The rapid outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the coronavirus disease 2019 (COVID-19) pandemic, which has infected more than 538 million people worldwide.
Scientists have worked relentlessly to characterize the virus and formulate pharmaceutical and non-pharmaceutical strategies to contain the pandemic sooner.
Background
SARS-CoV-2 is an RNA virus with around 30,000 base pairs. This virus contains four structural proteins, namely, spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins, which are typically targeted for the development of novel therapeutics and vaccines to protect individuals against the infection.
Since the emergence of SARS-CoV-2 in 2019, it underwent genomic evolution due to mutations and resulted in the incidence of several variants. Compared to the original SARS-CoV-2 strain, some variants, such as Alpha (B.1.1.7), Gamma (P.1), Omicron (B.1.1.529), Beta (B.1.351), Delta (B.1.617.2), exhibited higher transmissibility, virulence, and capacity to evade immune responses. For instance, D614G mutation in the spike protein is responsible for the enhanced viral replication rate in human lung epithelial cells and airway passage tissues.
At present, the Delta and Omicron strains are the two most commonly circulating SARS-CoV-2 variants. Scientists stated that the Delta strain contains fewer mutations in the S protein compared to the Omicron variant.
With the continual emergence of new SARS-CoV-2 variants around the world, it is imperative to study them constantly to identify strains with altered characteristics. A better understanding of these mutations would help early identification of viral variants, such that researchers could develop effective means to prevent their transmission.
Scientists observed the daily changes in the mutation dynamics of the SARS-CoV-2 proteome. They reported that all protein sites did not undergo a similar rate of mutation in a population. The mutability of protein sites depends on intrinsic physicochemical parameters (e.g., residue composition, local stability, hydrophobicity, surface accessibility, etc.) and it affects the virus’s transmission and survival.
A new study
In a new study, published in Computers in Biology and Medicine, scientists have focused on studying the mutational frequency of the SARS-CoV-2 proteome based on structural characteristics. They analyzed the mutation information obtained from the “2019 Novel Coronavirus Resource”. In this study, researchers analyzed 8,673 protein sites in the SARS-CoV-2 proteome that contain a minimum of one mutation among 1,079,273 isolates. The current study reported that physicochemical parameters could be positively identified based on sites with high and low mutation frequency.
Initially, scientists studied the entire proteome to study the physicochemical parameters that affect the mutability of protein sites. They considered the top 30% of high and low mutation sites, based on mutant isolate count (selection threshold). A higher isolate count implied mutation in the protein site at an early phase of the COVID-19 pandemic and incorporation of the mutation in all major SARS-CoV-2 variants.
Findings
To understand the role of sequence and structure-based features on site mutability, researchers predominantly filtered the features based on low interproperty correlation and statistical significance. They categorized the features capable of differentiating between low and high mutability of protein sites into five classes.
In the SARS-CoV-2 proteome, scientists observed that residual type is one of the most important features to classify the high and low mutability of protein sites. They stated that the percentage of bulky aromatic residues is considerably higher in low mutability sites. Additionally, the positive charge residues are more prevalent at low mutability sites, in contrast to negatively charged residues that are abundantly present at high mutability sites.
Scientists stated that the relatively accessible surface area (rASA) feature also plays an important role in the identification of mutation sites with high and low mutation rates. For most of the larger proteins of the SARS-CoV-2 proteome, high mutability sites contain higher rASA and vice versa. However, it is not the case for other proteins, such as E, M, ORF6, nsp4, and nsp8, which exhibit the opposite trend.
For residue type analysis, aromatic (bulky) residues showed low mutability sites in the SARS-CoV-2 proteome. Scientists calculated the local average stability of the mutation site. They observed that features with low mutability sites were more stable compared with those with higher mutability sites. This study also revealed that residue conservation is directly associated with the mutability of amino acids.
Researchers revealed that high mutability sites are massively inclined to be replaced by other amino acids; however, low mutability sites are prone to self-mutations and are relatively more conserved.
In this study, scientists designed machine learning (ML) models by utilizing physicochemical parameters. These models can categorize the high and low mutation sites, at different selection thresholds, ranging between 5 and 30%. The model’s accuracy was in the range of 65–76.7%. The authors revealed that by lowering the selection threshold or by increasing the confidence level of low and high mutability sites, the prediction performance of the model could be improved.
Conclusion
The authors presented a better understanding of the mutability of the SARS-CoV-2 proteome based on intrinsic sequence-structure features. They advocated that the analysis could be used to detect SARS-CoV-2’s variants of concern as well as other viruses that have the potential to cause pandemics.