Pitfalls of barcodes in the study of worldwide SARS-CoV-2 variation and phylodynamics
-
摘要: 使用最少量的选定信息位点组成的基因条形码在分析SARS-Cov-2基因组变异时存在诸多弊端。我们的研究表明,仅用数学程序来选定位点时应由已知的系统发育学研究作为指导,(1)确保用实体树分支来代表,而不是具有较差的系统发育地理特性的突变热点;(2)避免系统发育冗余。我们提出了一个流程,即通过考虑先前选定位点的累积的信息量(作为基于系统发育分析的标准代表)来避免位点选择中的信息冗余。这个程序演示了,对于一些短的条形码(如有11个位点)来说,也有成千上万位点组合信息来改进之前的提议。我们的研究还表明,基于全球数据库的条形码不可避免的优先考虑那些位于系统发育的基础节点上的变异,这使得在这些祖先节点上的大多数代表性基因组不再反复出现。因此,冠状病毒的系统发育动力学无法通过普遍的基因组条形码捕获,因为大多数的SARS-Cov-2变异是在地理限制区域内引入当地的变异产生的。Abstract: Analysis of SARS-CoV-2 genome variation using a minimal number of selected informative sites conforming a genetic barcode presents several drawbacks. We show that purely mathematical procedures for site selection should be supervised by known phylogeny (i) to ensure that solid tree branches are represented instead of mutational hotspots with poor phylogeographic proprieties, and (ii) to avoid phylogenetic redundancy. We propose a procedure that prevents information redundancy in site selection by considering the cumulative informativeness of previously selected sites (as a proxy for phylogenetic-based criteria). This procedure demonstrates that, for short barcodes (e.g., 11 sites), there are thousands of informative site combinations that improve previous proposals. We also show that barcodes based on worldwide databases inevitably prioritize variants located at the basal nodes of the phylogeny, such that most representative genomes in these ancestral nodes are no longer in circulation. Consequently, coronavirus phylodynamics cannot be properly captured by universal genomic barcodes because most SARS-CoV-2 variation is generated in geographically restricted areas by the continuous introduction of domestic variants.
-
Key words:
- SARS-COV-2 /
- COVID-19 /
- Phylogeny /
- Phylodynamics /
- Barcode /
- Informative subtype markers
-
Figure 1. Skeleton of the SARS-CoV-2 phylogeny based on ISMs signatures, interpolated frequency maps of haplogroup sub-lineages having differential geographic distributions, and comparative entropy values for ISMs signatures using different strategies
A: Skeleton of most parsimonious phylogenetic tree of SARS-CoV-2 variation based on ISMs signatures. Above: Zhao et al. (2020) proposed an initial signature conformed by 20 ISMs; those retained in their reduced 11 ISMs signature are highlighted in blue. Signatures defined by Zhao et al. (2020) are indicated below labels for each clade (according to Gómez-Carballa et al., (2020a)); clades with purple background are those captured by the 11 ISMs set. Bottom: Tree built on 11 ISMs set prioritized by HE algorithm; gray indicates mutations that occurred in same branches (according to Gómez-Carballa et al. (2020a)). Green stars indicate parallel mutations. Percentages below nodes indicate frequencies in 90 K database. B: Interpolated maps of haplogroup frequencies for haplogroup A2a4 (represented by signature CCCGCCAGGGA in Zhao et al. (2020)) and its two sub-lineages A2a4a3a and A2a4c1a, as well as haplogroup A2a5 (CCCGCCGGGGG) and its sub-lineage A2a5c. C: Above: Entropy using HE algorithm for 11 and 20 ISMs selected by Zhao et al. (2020) (red and purple, respectively (note: curves do not match because the HE algorithm prioritizes the 20 ISMs differently; see also Table 1)) and 11 ISMs barcodes proposed by Guan et al. (2020) (blue); dotted vertical lines indicate HE values for 11 and 20 ISMs sets. Inset figure shows HE entropy values for signatures conformed by 1 to 400 ISMs (green) calculated in present study using 90 K database. Bottom: Boxplot records HE values for 2×106 combinations of 11 ISMs among the 50 with the highest individual entropy values; light green dots (n=12 751) in the dot cloud indicate different combinations with HE values above signature proposed by Zhao et al. (2020) (red dot); note, all random combinations are below the signature obtained by the HE algorithm implemented in the present study (top green dot). Blue dot shows HE values of 11 site barcode of Guan et al. (2020) (95% of random site combinations fall above the HE value provided by this site combination).
Table 1. ISMs selected using HE procedure described in the present study and 20 ISMs signature captured by Zhao et al. (2020)
90 K database–HE algorithm 90 K database – Zhao et al. (2020) ISMs signature All database Before 18 June 2020 After 17 June 2020 All database Before 18 June 2020 After 17 June 2020 Site HE Site HE Site HE Site HE Site HE Site HE #1 28881 0.93 241 0.86 28881 0.99 28881* 0.93 241 0.86 28881* 0.99 #2 25563 1.58 25563 1.58 25563 1.58 25563* 1.58 25563* 1.58 25563* 1.58 #3 241 2.06 28881 2.07 241 1.97 241 2.06 28881* 2.07 241 1.97 #4 11083 2.37 11083 2.41 1163 2.35 11083* 2.37 11083* 2.41 11083* 2.24 #5 1163 2.61 1059 2.64 11083 2.61 1059* 2.59 1059* 2.64 1059* 2.45 #6 1059 2.83 8782 2.84 28854 2.83 20268* 2.78 8782* 2.84 20268* 2.64 #7 20268 3.02 20268 3.03 1059 3.03 14805* 2.93 20268* 3.03 14805* 2.75 #8 14805 3.17 14805 3.21 19839 3.21 8782* 3.06 14805* 3.21 8782* 2.82 #9 23731 3.31 15324 3.33 23731 3.37 18060* 3.12 17747 3.30 14408 2.87 #10 28854 3.45 27964 3.44 20268 3.52 14408 3.18 2558* 3.36 18060* 2.92 #11 19839 3.58 10097 3.54 27964 3.65 2558* 3.22 3037 3.42 23403* 2.95 #12 8782 3.70 28854 3.64 313 3.77 23403* 3.25 26144* 3.45 2558* 2.99 #13 27964 3.83 27046 3.73 14805 3.88 3037 3.28 14408 3.48 3037 3.01 #14 15324 3.93 17747 3.81 11916 3.98 26144* 3.31 28144 3.50 17747 3.03 #15 313 4.03 25429 3.89 15324 4.07 17747 3.32 18060* 3.52 26144* 3.04 #16 11916 4.12 11916 3.97 22480 4.15 28144 3.33 23403* 3.54 28882 3.05 #17 18877 4.19 313 4.04 8782 4.22 28882 3.34 2480 3.54 2480 3.05 #18 25429 4.26 29553 4.11 21575 4.29 2480 3.35 28882 3.55 28144 3.05 #19 18060 4.32 19839 4.18 18877 4.35 17858 3.35 17858 3.55 17858 3.06 #20 21575 4.38 18877 4.24 13862 4.41 28883 3.35 28883 3.56 28883 3.06 Sites common in all columns are in bold. Database used by Zhao et al. (2020) was downloaded on 17 June 2020; table shows values obtained according to this timepoint. Asterisks indicate ISMs retained in 11 ISMs set by Zhao et al. (2020) out of the 20 initially selected by their algorithm; HE algorithm prioritizes other ISMs not included by Zhao et al. among the 20 top candidates, which instead includes several that are not considered among the top 20 prioritized by the HE algorithm. -
[1] Boni MF, Lemey P, Jiang XW, Lam TTY, Perry BW, Castoe TA, et al. 2020. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nature Microbiology, 5(11): 1408−1417. doi: 10.1038/s41564-020-0771-4 [2] Forster P, Forster L, Renfrew C, Forster M. 2020. Phylogenetic network analysis of SARS-CoV-2 genomes. Proceedings of the National Academy of Sciences of the United States of America, 117(17): 9241−9243. doi: 10.1073/pnas.2004999117 [3] Galanter JM, Fernández-López JC, Gignoux CR, Barnholtz-Sloan J, Fernández-Rozadilla C, Via M, et al. 2012. Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas. PLoS Genetics, 8(3): e1002554. doi: 10.1371/journal.pgen.1002554 [4] Gómez-Carballa A, Bello X, Pardo-Seco J, Martinón-Torres F, Salas A. 2020a. Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders. Genome Research, 30(10): 1434−1448. doi: 10.1101/gr.266221.120 [5] Gómez-Carballa A, Bello X, Pardo-Seco J, Pérez Del Molino ML, Martinón-Torres F, Salas A. 2020b. Phylogeography of SARS-CoV-2 pandemic in Spain: a story of multiple introductions, micro-geographic stratification, founder effects, and super-spreaders. Zoological Research, 41(6): 605−620. doi: 10.24272/j.issn.2095-8137.2020.217 [6] Guan QT, Sadykov M, Mfarrej S, Hala S, Naeem R, Nugmanova R, et al. 2020. A genetic barcode of SARS-CoV-2 for monitoring global distribution of different clades during the COVID-19 pandemic. International Journal of Infectious Diseases, 100: 216−223. doi: 10.1016/j.ijid.2020.08.052 [7] Gudbjartsson DF, Helgason A, Jonsson H, Magnusson OT, Melsted P, Norddahl GL, et al. 2020. Spread of SARS-CoV-2 in the icelandic population. The New England Journal of Medicine, 382(24): 2302−2315. doi: 10.1056/NEJMoa2006100 [8] Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. 2018. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics, 34(23): 4121−4123. doi: 10.1093/bioinformatics/bty407 [9] Pardo-Seco J, Martinón-Torres F, Salas A. 2014. Evaluating the accuracy of AIM panels at quantifying genome ancestry. BMC Genomics, 15(1): 543. doi: 10.1186/1471-2164-15-543 [10] Rambaut A, Holmes EC, O'toole Á, Hill V, McCrone JT, Ruis C, et al. 2020. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology, 5(11): 1403−1407. doi: 10.1038/s41564-020-0770-5 [11] Rockett RJ, Arnott A, Lam C, Sadsad R, Timms V, Gray KA, et al. 2020. Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-based modeling. Nature Medicine, 26(9): 1398−1404. doi: 10.1038/s41591-020-1000-7 [12] Salas A, Amigo J. 2010. A reduced number of mtSNPs saturates mitochondrial DNA haplotype diversity of worldwide population groups. PLoS One, 5(5): e10218. doi: 10.1371/journal.pone.0010218 [13] Van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, et al. 2020. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection, Genetics and Evolution, 83: 104351. doi: 10.1016/j.meegid.2020.104351 [14] Yu WB, Tang GD, Zhang L, Corlett RT. 2020. Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2 / HCoV-19) using whole genomic data. Zoological Research, 41(3): 247−257. doi: 10.24272/j.issn.2095-8137.2020.022 [15] Zhao ZQ, Sokhansanj BA, Malhotra C, Zheng K, Rosen GL. 2020. Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization. PLoS Computational Biology, 16(9): e1008269. doi: 10.1371/journal.pcbi.1008269 -
ZR-2020-364 Supplementary Data and Table S1.zip
-