In a recent study posted to the bioRxiv* preprint server, researchers presented a novel genomic sequence data-based mathematical framework for rapid detection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest (VoIs) and variants of concern (VoCs) in a viral multiple sequence alignment (MSA).
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
Genomic surveillance plays a vital role in combat against more immune-evasive and virulent ribonucleic acid (RNA) virus strains and is particularly necessary for effective coronavirus disease 2019 (COVID-19) mitigation. Efforts for genomic sequencing have facilitated virological, epidemiological surveillance near real-time; however, efficient identification of variants within SARS-CoV-2 sequences that could be potential threats remains challenging. Existing correlation analysis methods are not amenable to co-evolutionary analysis.
About the study
In the present study, researchers presented a novel framework for rapidly detecting SARS-CoV-2 VoIs and VoCs with an MSA times series for SARS-CoV-2 genomes. The building blocks of the framework were pairs of co-evolving motifs identified by co-evolutionary signals in the MSA.
The method was applied to SARS-CoV-2, considering pairwise relations for analyzing alerts issued between November 2020 and August 2021 with a weekly resolution for England, the United States of America, South America, and India. The aim was to identify maximal co-evolving motif pairs within the MSA instead of considering the emergence of specific mutations.
The team constructed highly dimensional simplices based on the co-evolutionary pair coupling distances. The team developed and analyzed alerts and clusters with substantial fractions of newly emerging motifs triggered alerts. Subsequently, the alerts were tested, and the global initiative on sharing all influenza data (GISAID) database was used to obtain data on SARS-CoV-2 sequences retrospectively.
The framework issued alerts without a priori assumptions, with the exception of the Wuhan-Hu-1 (reference) sequence on which MSAs were built. The alerts were related to established SARS-CoV-2 variants and a posteriori knowledge of VoI/VoC strains was used for evaluating the issued alerts’ accuracy.
The main purpose of the alerts was to facilitate rapid biological analysis for a few critical sites. A substantial fraction of the included clusters comprised newly active sites that represented large additive fitness components of the underlying clusters and could indicate the emergence of a functional block or a mutational event. P-distance and J-distance with k-means and HCS-clustering were used for the analysis.
Results
An alert in England marked the rise of Alpha in a timely fashion, >4 weeks earlier than the WHO and Pango designations. The 28 Alpha-defining mutations were split into three blocks. Out of seven alerts in India, one alert corresponded to a cluster of 14 motifs, of which 13 were active sites that were confirmed to be among the 20 characteristic Delta VoC mutations organized into four co-evolutionary blocks that did not emerge at once; however, the 13 positions became inactive after 21 days.
In the USA, seven alerts were observed, three of which marked Delta AY.3 emergence and exhibited 31 characteristic mutations, of which 11 belonged exclusively to the AY.3. In South America, 23 alerts were observed, of which five alerts confirmed the emergence of a new variant. The prevalence of the Lambda variant in March 2021 was low and had not been assigned a Pango lineage; however, distinguished co-evolution signals mapped to Lambda were identified six weeks ahead of time.
Five alerts were triggered in the first week of December, which were: (i) a newly emerging cluster of 13 sites not mapped to any VoI/VoC, (ii) a cluster of 20 sites, of which 17 were actual or (+) sites mapped to Alpha, (iii) a cluster of 15 sites, of which nine were actual sites, (iv) a cluster of 16 sites, of which 15 were actual sites mapped to Delta, and (v) an entirely new emerging cluster of 12 actual sites, mapped to the Delta AY.3 sub-lineage.
In the third week of December, a cluster of 23 actual sites emerged, mapped to the Gamma variant. The framework did not issue alerts of Mu variant emergence, and the characteristic Mu mutations were recombination of mutations present in Alpha, Delta, and Gamma. The mutational sites were already active in the general population; thus, they were not considered actual sites. The matrices were highly symmetrical, indicating that HCS-clustering and k-means clustering produced similar results for P- and J-distances and that both metrics captured SARS-CoV-2 co-evolution.
Conclusions
Overall, the study findings showed how motifs could provide insights into the organization of characteristic mutations of a VoI/VoC, organizing them as co-evolving blocks. The extraction of co-evolutionary signals using MSA analysis could improve understanding of the significance of SARS-CoV-2 mutations and could enable prompt detection of emerging SARS-CoV-2 variants.
The framework required no a priori phylogenetic knowledge or any biological impact analysis and could be considered a guidance system that alerted not only sites where the biological analysis should be performed but also quantified the co-evolution rate. The framework could provide important data to biologists since co-evolution relation identification could provide clues for the underlying biological mechanisms.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.