Results from the funded study: Genomic Surveillance and Modelling for the COVID-19 Variant Endgame
Pillar 6
Computational Analysis, Modelling and Evolutionary Outcomes (CAMEO)
Since the start of the COVID-19 pandemic, the SARS-CoV-2 genome has been sequenced at an unprecedented scale. This enables the scientific community to detect many genetic mutations that occur during its evolution, including variants of concern that have spread quickly in the human population.
The tremendous amount of available viral genome sequencing data has also led to new computational challenges that brought viral genomics into the era of ‘big data’. Indeed, managing millions of sequences requires efficient storage solutions, dealing with missing data, integration from different sources, and powerful visualization techniques. While phylogenetic analyses are the main approach used by the community, it requires many pre- and post-processing steps, often undertaken by hand by data scientists, thus making the whole process time-consuming, somewhat arbitrary, and difficult to reproduce. These limitations in a fast-evolving data landscape call for alternative strategies to facilitate quick observations based on all data available for efficient viral genetic surveillance.
In this study, we present a set of population genomics approaches applied to SARS-CoV-2 sequence data from the GISAID database during the first year of the COVID-19 pandemic. We show how this toolbox, which is capable of taking advantage of all the data without the need for reducing the size of the dataset, can be used to perform an in-depth analysis of the genetic diversity of SARS-CoV-2. The toolbox includes an imputation method to compensate for sequencing issues leading to missing data. The evolutionary relationships between sequences are represented by a haplotype network, which is an efficient population genetic method to categorize viral sequences based on a set of significant genetic markers. To understand the expansion dynamics of a lineage or variant, we used the famous population genetics neutrality test known as Tajima’s D. Finally, we show how Principal Component Analysis (PCA) of SARS-CoV-2 genetic variations is an insightful visualization technique to identify evolutionary jumps in the viral genetic landscape.
In conclusion, our approaches enable real-time characterization of emerging SARS-CoV-2 lineages, allowing for the prompt implementation of dynamic, global, and up-to-date data analysis pipelines to support the most pressing questions in viral variant research.