留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data

Yu-Fang Mao Xi-Guo Yuan Yu-Peng Cun

Yu-Fang Mao, Xi-Guo Yuan, Yu-Peng Cun. A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data. Zoological Research, 2021, 42(2): 246-249. doi: 10.24272/j.issn.2095-8137.2021.014
Citation: Yu-Fang Mao, Xi-Guo Yuan, Yu-Peng Cun. A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data. Zoological Research, 2021, 42(2): 246-249. doi: 10.24272/j.issn.2095-8137.2021.014

svmSomatic:利用新一代测序数据来区分体细胞突变和种系突变的机器学习方法

doi: 10.24272/j.issn.2095-8137.2021.014

A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data

Funds: This study was supported by the CAS Pioneer Hundred Talents Program and National Natural Science Foundation of China (32070683) to Y.P.C
More Information
  • 摘要: 体细胞突变是癌症基因组中一种主要的变异类型,它与肿瘤的产生与发展有密切联系。单核苷酸变异(SNVs)的检测可以促进肿瘤研究的下游分析。目前已经有许多方法来检测SNVs,但大多数方法都需要癌症样本有与之匹配正常样本才能将体细胞变异检测出来,但与之配对的正常样本通常不容易获得。因此,发展新的方法对肿瘤单样本数据进行体细胞变异的检测至关重要。在这项工作中,我们发展了一个新的机器学习方法用于精确检测单个肿瘤样本的新一代测序数据中的体细胞突变。在体细胞变异检测中要考虑的另一点是多种变异同时存在的情形,即肿瘤细胞内拷贝数变异(CNV)和SNV的共同出现是很常见。因此,我们提出了一种新的机器学习模型svmSomatic,该方法可以根把单个肿瘤样本的基因组数据中的体细胞突变与种系突变区分开。svmSomatic的新特点包括:1)考虑了CNV的对检测体细胞变异的影响;2)在单肿瘤样本数据中,采用支持向量机(SVM)的训练结果作为分类器来区分体细胞变异和种系变异。我们在基因组的模拟数据和真实数据中测试了svmSomatic,并将其与其它同类方法进行了比较。这些模拟和比较结果表明,在F1-score的综合评价下,svmSomatic与其它方法相比在模拟数据和真实数据中都表现出了较好的性能。
  • Figure  1.  Overview of svmSomatic method and performance comparison among five methods

    A: Overview of svmSomatic. Input is a tumor-only sample aligned to a human reference genome. Based on STIC and FREEC, five features related to somatic SNVs were selected for SVM training. Trained classifier was then used to distinguish between germline and somatic mutations; B: Performance comparisons of five methods based on F1-score using simulation datasets with tumor purity ranging from 0.2 to 0.8 and coverage of 30X. C: Overlap between methods in terms of total number of detected somatic SNVs using real dataset.

    Table  1.   Description of five extracted features

    FeatureDescription
    Read depthNumber of reads mapped to each site
    Mismatched readsNumber of mismatched reads
    Allele frequencyRatio of a particular allele to total number of alleles
    Ave. mapping qualityAverage mapping quality of reads matched to each site
    Copy numberCopy number of reads mapped to each site
    下载: 导出CSV
  • [1] Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, et al. 2012. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics, 28(3): 423−425. doi: 10.1093/bioinformatics/btr670
    [2] Cun YP, Yang TP, Achter V, Lang U, Peifer M. 2018. Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust. Nature Protocols, 13(6): 1488−1501. doi: 10.1038/nprot.2018.033
    [3] Fan Y, Xi L, Hughes DST, Zhang JJ, Zhang JH, Futreal PA, et al. 2016. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biology, 17(1): 178. doi: 10.1186/s13059-016-1029-6
    [4] Guyon I, Boser BE, Vapnik V. 1993. Automatic capacity tuning of very large VC-dimension classifiers. In: Proceedings of Advances in Neural Information Processing Systems 5. Denver: NIPS, 147–155.
    [5] Hastie T, Tibshirani R. 1998. Classification by pairwise coupling. The Annals of Statistics, 26(2): 451−471. doi: 10.1214/aos/1028144844
    [6] Kalatskaya I, Trinh QM, Spears M, Mcpherson JD, Bartlett JMS, Stein L. 2017. ISOWN: accurate somatic mutation identification in the absence of normal tissue controls. Genome Medicine, 9(1): 59. doi: 10.1186/s13073-017-0446-9
    [7] Koboldt DC, Zhang QY, Larson DE, Shen D, McLellan MD, et al. 2012. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3): 568−576. doi: 10.1101/gr.129684.111
    [8] Lai ZW, Markovets A, Ahdesmaki M, Johnson J. 2015. Abstract 4864: VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Cancer Research, 75(15): 4864−4864.
    [9] Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, et al. 2015. The European Genome-phenome Archive of human data consented for biomedical research. Nature Genetics, 47(7): 692−695. doi: 10.1038/ng.3312
    [10] Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14): 1754−1760. doi: 10.1093/bioinformatics/btp324
    [11] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16): 2078−2079. doi: 10.1093/bioinformatics/btp352
    [12] Liu RM, Liu EQ, Yang J, Li M, Wang FL. 2006. Optimizing the hyper-parameters for SVM by combining evolution strategies with a grid search. In: Proceedings of International Conference on Intelligent Computing. Kunming, China: Springer, 712–721.
    [13] Liu YC, Loewer M, Aluru S, Schmidt B. 2016. SNVSniffer: an integrated caller for germline and somatic single-nucleotide and indel mutations. BMC Systems Biology, 10(S2): 47. doi: 10.1186/s12918-016-0300-5
    [14] Pattnaik S, Gupta S, Rao AA, Panda B. 2014. SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15: 40. doi: 10.1186/1471-2105-15-40
    [15] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 29(1): 308−311. doi: 10.1093/nar/29.1.308
    [16] Smith KS, Yadav VK, Pei SS, Pollyea DA, Jordan CT, De S. 2016. SomVarIUS: somatic variant identification from unpaired tissue samples. Bioinformatics, 32(6): 808−813. doi: 10.1093/bioinformatics/btv685
    [17] Wang WX, Wang PW, Xu F, Luo RB, Wong MP, Lam TW, et al. 2014. FaSD-somatic: a fast and accurate somatic SNV detection algorithm for cancer genome sequencing data. Bioinformatics, 30(17): 2498−2500. doi: 10.1093/bioinformatics/btu338
    [18] Wei Z, Wang W, Hu PZ, Lyon GJ, Hakonarson H. 2011. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Research, 39(19): e132. doi: 10.1093/nar/gkr599
    [19] Xi JN, Yuan XG, Wang MH, Li A, Li XL, Huang Q. 2020. Inferring subgroup-specific driver genes from heterogeneous cancer samples via subspace learning with subgroup indication. Bioinformatics, 36(6): 1855−1863.
    [20] Xu C. 2018. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Computational and Structural Biotechnology Journal, 16: 15−24. doi: 10.1016/j.csbj.2018.01.003
    [21] Yuan XG, Miller DJ, Zhang JY, Herrington D, Wang Y. 2012. An overview of population genetic data simulation. Journal of Computational Biology, 19(1): 42−54. doi: 10.1089/cmb.2010.0188
    [22] Yuan XG, Zhang JY, Yang LY. 2017. IntSIM: an integrated simulator of next-generation sequencing data. IEEE Transactions on Biomedical Engineering, 64(2): 441−451. doi: 10.1109/TBME.2016.2560939
    [23] Yuan X, Bai J, Zhang J, Yang L, Duan J, Li Y, et al. 2020a. CONDEL: Detecting copy number variation and genotyping deletion zygosity from single tumor samples using sequence data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(4): 1141−1153.
    [24] Yuan X, Ma C, Zhao H, Yang L, Wang S, Xi J. 2020b. STIC: Predicting single nucleotide variants and tumor purity in cancer genome. IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi: 10.1109/TCBB.2020.2975181.
  • ZR-2021-014 Supplementary Material.pdf
  • 加载中
图(1) / 表(1)
计量
  • 文章访问数:  544
  • HTML全文浏览量:  260
  • PDF下载量:  56
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-01-14
  • 录用日期:  2021-03-10
  • 网络出版日期:  2020-03-12
  • 刊出日期:  2021-03-18

目录

    /

    返回文章
    返回