In a recent pre-print study posted to Research Square*, a group of researchers investigated the evolutionary trends of amino acid alterations at a population scale.
The study utilized this data to forecast both conserved and potentially mutable sites within the spike protein, guiding the development of vaccines and antibody treatments for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Study: Predicting the past and future evolutionary space of SARS-CoV-2. Image Credit: peterschreiber.media/Shutterstock.com
*Important notice: Research Square publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
The SARS-CoV-2 virus has undergone several mutations since its origin from an animal host at the Huanan seafood market, leading to its diversification. Significant mutations, like D614G in the spike protein and P323L in the viral ribonucleic acid (RNA) polymerase, increased the virus' transmissibility and replication.
SARS-CoV-2 genomes, whether dominant or minor variants, display both coding and non-coding differences. As the pandemic evolved, various dominant strains emerged and waned globally.
In some variants, like Alpha and Omicron, numerous mutations were introduced seemingly at once. It is hypothesized that these large genomic shifts might originate from long-term infections in immune-compromised individuals.
Further research is essential to decode these evolutionary patterns and predict future mutation sites on the virus' spike protein. Such studies can shape the development of vaccines and targeted therapies against SARS-CoV-2.
About the study
The present study utilized coronavirus disease 2019 (COVID-19) Genomics UK consortium (COG-UK) sequencing data and randomly selected 100 samples per day, from March 2020 to December 2022. A total of 96,558 samples were curated for analysis.
After filtering samples with inadequate genome coverage, 96,209 samples remained. These samples, previously trimmed by COG-UK, were sourced from the national center for biotechnology information sequence read archive (NCBI SRA) database.
The amino acid variations in the virus genes were determined using established methodologies. The genes were aligned to a known SARS-CoV-2 genome, and various tools and scripts, including Bowtie2, SAMtools, Bamclipper, and Quasirecomb, were applied for detailed sequence analysis.
After processing, mutations at each nucleotide site were identified, and low-frequency errors were distinguished from genuine mutations using specific P-value criteria.
The frequency of amino acid site variations was computed, and the Shapiro-Wilk test (W value) gauged the normality of these frequencies. Skewness helped measure the asymmetry of the distribution for each site monthly.
Machine learning, using R language, was used to processed the data. The W values of each amino acid site across selected months were assembled into a data matrix, and principal component analysis (PCA) was applied to this matrix.
The data was standardized, and a certain percentage of low variance variables were removed. Clustering analysis was then conducted using the pam algorithm, and the optimal cluster number was ascertained using the within sum of squares (wss) statistic.
Study results
To understand the rise of minor variant genomes and pinpoint amino acid variation frequencies, the researchers analyzed sequence data provided by COG-UK. They chose sequences generated via the ARTIC amplification method and sequenced by Illumina NovaSeq 6,000.
Altogether, they obtained 96,559 SARS-CoV-2 consensus genomes and their related minor variant genomic data, covering roughly 100 genomes daily from March 2020 to December 2022. These nucleotide sequences were virtually translated into amino acid sequences.
The researchers then computed the average variation frequencies of each amino acid to study the mutation patterns in SARS-CoV-2 proteins over these three years.
Their analysis identified artificial low-frequency genetic variants, stemming from errors during amplicon creation and sequencing. To address this, they compared the average variation frequencies of amino acids at identical sites in varying months, revealing differing frequencies in amino acid alterations.
The study assessed the average variation frequencies of amino acid sites in respective protein monthly.
Most proteins displayed an uptick in diversity from the pandemic's onset, followed by a minor dip due to the bottleneck during the Alpha variant's emergence in October 2020, and then a resurgence in genomic variant diversity between December 2020 and January 2021.
The authors also compared the monthly average variation of amino acid sites in SARS-CoV-2 from April 2020 to mid-October 2021 with March 2020 data.
Due to updates in the ARTIC primers by mid-October 2021, variations from November 2021 to December 2022 were contrasted with late October 2021. Notably, the spike protein showed the most variation, indicating significant selection pressure from host immunity. For instance, by April 2020, amino acid variations largely mirrored those from March.
A standout, the D614G substitution, persisted throughout the study. The Alpha variant mutations began appearing in October 2020, stabilizing by January 2021. Delta mutations emerged by April 2021, with both Alpha and Delta coexisting by May.
Typically, mutations transitioned similarly across the Alpha and Delta variants. Yet, specific substitutions like P681H (Alpha) and R158G (Delta) dominated more slowly. The initial Omicron variant's substitutions rose to dominance from December 2021, stabilizing by February 2022. Subsequent mutations linked to newer Omicron sub-variants continuously emerged and receded.
The researchers explored evolutionary patterns in the spike protein of SARS-CoV-2 to predict amino acids under evolutionary pressure. They utilized the Shapiro-Wilk statistic to measure the evolutionary patterns.
If the virus experienced no evolutionary pressures, variations would result from amplification, sequencing errors, or stochastic errors during viral replication, producing a high W value. Such amino acids would cluster at the top in visual representations. However, evolutionary pressures would result in variations in amino acid sites and create a right-skewed distribution with lower W values.
The relationship between the average variation and W value was visualized for the spike's amino sites. Over the pandemic's three years, some amino acids maintained high W values, while others shifted. Notably, all defining substitutions for variants of concern (VoCs) passed through the cluster linked to low W values before becoming predominant in the population.
To predict sites in the spike protein under future evolutionary pressure, an artificial intelligence/machine learning (AI/ML) technique utilizing the K-Medoid Clustering algorithm was employed.
This method effectively clustered amino acid sites based on W values, with longer durations producing more accurate models. These models accurately predicted mutation sites and offer potential for future predictive models.
The models were particularly effective in identifying amino acids prone to change and those remaining conserved. The conserved sites could inform a universal vaccine against future variants.
*Important notice: Research Square publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.