The coronavirus responsible for the COVID-19 pandemic has spread across the globe with unprecedented speed and lethality, killing hundreds of thousands of people and forcing countries’ entire populations to self-quarantine.
The virus technically termed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is believed to be zoonotic, but its origin is still in doubt. A new paper reports the use of artificial intelligence (AI) to solve the puzzle of the virus’s origin. The paper was published on the preprint server bioRxiv* in May 2020.
AI has been widely employed during the pandemic, with its uses ranging from rapid diagnostics to contact tracing to drug simulation. The ability to rapidly compare, classify, and relate data has made it an invaluable tool. In fact, the researchers think this may yet provide the key to developing a virus vaccine.
Using AI-aided cluster analysis to track SARS-CoV-2 origin
To find the origin of the virus, the team decided to compare its genome with those of preexisting organisms. They downloaded 334 complete genome sequences of the virus from the GenBank database, using samples taken across the world - 258 from the United States, 49 from China, and the remaining 27 from other countries. For each set, they used the first released complete mapping of the virus’s sequence from each country.
They also selected reference genomic sequences such as those from alpha and beta coronaviruses, from GenBank and Virus-Host DB. Sequenced genomes of the Guangxi and Guangdong pangolins were downloaded from the GISAID database.
Study: Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence. Image Credit: 2630ben / Shutterstock
This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources
Altogether, there were three sets of reference genomes selected at various taxonomic levels for use in a supervised decision tree method that has been recommended for the classification of novel pathogens. The method used is to scroll through the levels of classification from high to low, looking for the right slot for the SARS-CoV-2 genome at the genus and lower levels, and its closest relatives.
The reference genomes at each taxonomic level were fed to the AI, along with the viral genome sequences.
The AI-based analysis was then carried out by unsupervised clustering methods, using a hierarchical clustering algorithm along with density-based spatial clustering of applications with noise (DBSCAN). Two steps are involved: using the algorithms to achieve reference sequence clusters alone and then use the same parametric values to cluster a mix of both reference and SARS-CoV-2 genome sequences.
In other words, the method first shows the reference sequences with which the SARS-CoV-2 sequences group. Secondly, the settings are changed to observe corresponding changes in the groups formed. This will help pick up the nearest sequences to compare the similarities between genomes.
What Did the Study Find?
By progressively narrowing the search parameters, the team progressed from high to low levels of taxonomic classification. Beginning with the first reference set, which comprises viruses from 12 major classes at the highest level, the team found that the virus belonged to the Riboviria cluster, represented by the MERS virus (responsible for the MERS outbreak in 2012). Based on this data, they concluded that the coronavirus probably belonged to the Riboviria family.
At the next level, they analyzed the clustering of SARS-CoV-2 against 12 virus families within the Riboviria. The results show that the viral genome groups with the Coronaviridiae family. This class has four genera - the Alpha-, Beta-, Gamma-, and Delta-coronavirus families. SARS-CoV2 belongs to the Beta-coronavirus genus.
Within this genus, among 37 reference sequences, SARS-CoV-2 clusters with the Sarbecovirus sub-genus. This contains mostly SARS coronaviruses and bat coronaviruses, but also 5 Guangxi and one Guangdong pangolin sequence.
Interestingly, the study found that the amount of variation in the genetic code of the 334 samples, as compared to the reference samples, was practically constant for all the samples, which were collected across sixteen countries over a time period of three months.
With narrower cut-off parameters, SARS-CoV-2 continued to be clustered with Sarbecovirus, even while this cluster itself separates into two. At a very low cut-off, SARS-CoV-2 clusters only with 2 viruses based on whole-genome analysis - bat CoV RaTG13 and Guangdong pangolin CoV.
On narrowing the search still further, the AI found only one virus, which it could group with SARS-Cov2 - the bat CoV-RaTG13. This could mean that bats are the most likely reservoir host of SARS-CoV2.
Greater horseshoe bat (Rhinolophus ferrumequinum). Image Credit: ATTILA Barsan / Shutterstock
However, with a still lower cut-off, the AI did not group the virus with any other organism. Does this mean that the virus could originate from neither bats nor pangolins?
The study says this is a “debatable question” because SARS-CoV-2 and bat CoV RaTG13 (or Guangdong pangolin CoV, for that matter) genome sequences are so similar as to be less than that between, for instance, bat coronaviruses originating from the same host.
SARS-CoV-2 Probably from Bat or Pangolin CoV
They conclude, “Therefore, SARS-CoV-2 is deemed very likely originated from the same host with bat CoV RaTG13 or Guangdong pangolin CoV, which is bat or pangolin, respectively.” The study showcases the ability of AI to make sense of large volumes of data to pick out meaningful and useful patterns. It raises hopes that the same power can be harnessed to develop an effective vaccine against SARS-CoV-2.
This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources
Journal references:
- Preliminary scientific report.
Nguyen, T. T. et al. (2020). Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence. bioRxiv preprint. doi: https://doi.org/10.1101/2020.05.12.091397. https://www.biorxiv.org/content/10.1101/2020.05.12.091397v1
- Peer reviewed and published scientific report.
Nguyen, Thanh Thi, Mohamed Abdelrazek, Dung Tien Nguyen, Sunil Aryal, Duc Thanh Nguyen, Sandeep Reddy, Quoc Viet Hung Nguyen, et al. 2022. “Origin of Novel Coronavirus Causing COVID-19: A Computational Biology Study Using Artificial Intelligence.” Machine Learning with Applications, May, 100328. https://doi.org/10.1016/j.mlwa.2022.100328. https://www.sciencedirect.com/science/article/pii/S266682702200041X
Article Revisions
- Mar 20 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.