In a recent study published in PNAS Nexus, researchers developed a risk evaluation model using machine learning to predict the future distribution trajectory of newly discovered severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants using genomic and epidemiological data.
Study: Predicting the spread of SARS-CoV-2 variants: An artificial intelligence-enabled early detection. Image Credit: Peter Kneiz / Shutterstock.com
How are novel SARS-CoV-2 strains identified?
The United States Centers for Disease Control and Prevention (CDC) and World Health Organization (WHO) are monitoring the emergence of novel SARS-CoV-2 variants to inform pandemic preparedness. However, identifying the small proportion of mutations that cause a new wave remains difficult.
Academic researchers have developed various models to forecast pandemic trajectory; however, none of these systems have been focused on variant-specific dissemination. Aside from monitoring the genetic development of mutant SARS-CoV-2 strains, genetic characteristics have not been included in current epidemiological modeling to reflect the infection trajectory.
About the study
In the present study, researchers used an artificial intelligence (AI)-based approach to evaluate nine million SARS-CoV-2 genomic sequences in 30 countries and reveal temporal patterns of variants producing large infection waves. The model used data from the Pango lineage, Global Initiative on Sharing Avian Influenza Data (GISAID), coronavirus disease 2019 (COVID-19) cases, vaccination rates, and non-pharmaceutical interventions.
The analysis focused on 30 nations that reported the most SARS-CoV-2 genomic sequences in March 2022. These 30 nations account for nine million out of 9.5 million genomic sequences recorded in GISAID since the beginning of the pandemic.
By March 19, 2022, 1,151 unique variants had been consistently detected in the included nations, with a median of 72 variants identified in each country since the pandemic began. The technique is consistent with CDC and WHO wave classifications based on the variants responsible for infections.
Multiple alterations in SARS-CoV-2 proteins compared to the wild-type reference strain identified in Wuhan in early January 2020 distinguished each new variant. The current study considered all conceivable changes in a genomic sequence, such as base substitutions, deletions, and insertions. The approach created a new distance measure between distinct variants by combining the Jaccard distance metric with a variant-specific list of mutations computed by dividing the number of unique mutations in a variant by the number of mutations in another SARS-CoV-2 variant.
The researchers also provided two measures for characterizing variant diversity across time, including variant entropy and heterogeneity. Variant entropy was motivated by applying the thermodynamic concept of entropy in ecological systems to compare low and high entropy states, which correlates to the number of cocirculating variants.
The model aimed to detect SARS-CoV-2 variants that have produced over 1,000 cases for every one million individuals within three months of their detection. Moreover, 31 predictive factors were incorporated into the model that captures the genomic characteristics of novel variants, their early distribution trajectory, and non-pharmaceutical and vaccination initiatives implemented during the period of variant transmission. These traits were used to estimate variant infectivity using machine learning.
Study findings
Risk scores were assigned to all SARS-CoV-2 variants and converted into binary predictions in training datasets to optimize model specificity and sensitivity. After one week of observation, the model can detect 73% of the variants that would trigger a COVID-19 wave of over 1,000 infections in the following three months. With a two-week observation period, this performance rises to 80%.
The out-of-sample area under the curve (AUC) values for the model were 86% for one-week forecasts and 91% for two-week predictions. The top three dominant variants were generally responsible for most instances during the relevant wave and had a total share of 71% throughout all waves.
Spike, nucleocapsid (N), and non-structural protein (NSP) proteins had the most mutations, with median numbers per variant in each nation of 10, three, and 14, respectively. With a median inter-wave distance of 0.9, the initial dominant variant in each wave contained highly unique mutations compared to variants circulating in the preceding wave.
The waves were divided into three groups, including Before-1 and Before-2, which ended before the nationwide vaccination campaign commencement; transition, which began before the vaccination campaign but ended after it; and After-1 and After-2, which commenced after the campaign. Wave-entropy values increased by a small statistically significant amount from Before-2 to Before-1 waves but remained comparable from Before-1 to Transition waves, with a median of 0.5.
Most variants, including those with the highest infectivity, continue to cause infections within two weeks after identification, with a median value of 2.5 COVID-19 cases for every one million individuals. Furthermore, variants causing a similar extent of infections in two weeks may have a significantly different transmission trajectory after three months.
Conclusions
The study findings highlight the development of a prediction model based on nine million genetic sequences from 30 countries to anticipate the emergence of novel SARS-CoV-2 variants. With AUC values of 86% and 91%, the model detected infectious variants as early as one week and two weeks after their detection, respectively.
These observations indicate that novel variants acquire mutations to reinfect or target new population subsets of previously immune individuals. The improved prediction accuracies of the standard models underscore the need to integrate genetic variables into more sensitive models.
Journal reference:
- Levi, R., El Ghali, Z., & Shoshy, A. (2024). Predicting the spread of SARS-CoV-2 variants: An artificial intelligence-enabled early detection. PNAS Nexus 3(1). doi:10.1093/pnasnexus/pgad424