Computational Biology and Modelling
Since the start of the COVID-19 pandemic, the SARS-CoV-2 genome has been sequenced at an unprecedented scale. This enables the scientific community to detect many genetic mutations that occur during its evolution, including variants of concern that have spread quickly in the human population.
The tremendous amount of available viral genome sequencing data has also led to new computational challenges that brought viral genomics into the era of ‘big data’. Indeed, managing millions of sequences requires efficient storage solutions, dealing with missing data, integration from different sources, and powerful visualization techniques. While phylogenetic analyses are the main approach used by the community, it requires many pre- and post-processing steps, often undertaken by hand by data scientists, thus making the whole process time-consuming, somewhat arbitrary, and difficult to reproduce. These limitations in a fast-evolving data landscape call for alternative strategies to facilitate quick observations based on all data available for efficient viral genetic surveillance.
In this study, we present a set of population genomics approaches applied to SARS-CoV-2 sequence data from the GISAID database during the first year of the COVID-19 pandemic. We show how this toolbox, which is capable of taking advantage of all the data without the need for reducing the size of the dataset, can be used to perform an in-depth analysis of the genetic diversity of SARS-CoV-2. The toolbox includes an imputation method to compensate for sequencing issues leading to missing data. The evolutionary relationships between sequences are represented by a haplotype network, which is an efficient population genetic method to categorize viral sequences based on a set of significant genetic markers. To understand the expansion dynamics of a lineage or variant, we used the famous population genetics neutrality test known as Tajima’s D. Finally, we show how Principal Component Analysis (PCA) of SARS-CoV-2 genetic variations is an insightful visualization technique to identify evolutionary jumps in the viral genetic landscape.
In conclusion, our approaches enable real-time characterization of emerging SARS-CoV-2 lineages, allowing for the prompt implementation of dynamic, global, and up-to-date data analysis pipelines to support the most pressing questions in viral variant research.
Mostefai Fatima, Gamache Isabel, N’Guessan Arnaud, Pelletier Justin, Huang Jessie, Murall Carmen Lia, Pesaranghader Ahmad, Gaonac’h-Lovejoy Vanda, Hamelin David J., Poujol Raphaël, Grenier Jean-Christophe, Smith Martin, Caron Etienne, Craig Morgan, Wolf Guy, Krishnaswamy Smita, Shapiro B. Jesse, Hussin Julie G. Population Genomics Approaches for Genetic Characterization of SARS-CoV-2 Lineages, Frontiers in Medicine, Vol 9, 2022. doi: 10.3389/fmed.2022.826746