In a recent study posted to the bioRxiv* preprint server, researchers used machine learning (ML) tools to discover animal coronaviruses (CoVs), both alpha and beta CoVs, previously unknown to infect humans.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
It has remained challenging to predict which animal CoVs might infect humans because their whole host range is unknown. For instance, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in an animal host, most likely bats. After a host expansion event, an essential step in viral evolution, SARS-CoV-2 spilled over into humans. Thus, it is crucial to survey all alpha and beta CoVs that infect animals near humans (e.g., farm animals, such as pigs) that facilitate their zoonotic transmission.
Both alignment-based and alignment-free approaches have shown promise when addressing the issue of viral host prediction, but the former exhibits poor efficiency as the sequence lengths increase. Likewise, alignment-free methods do not account for the relative position of the amino acid (AA) residues across the sequence.
About the study
In the present study, researchers developed a novel machine-learning model to predict the binding between the spike (S) protein of alpha and beta CoVs and a human receptor, such as human dipeptidyl-peptidase 4 (hDPP4) and angiotensin-converting enzyme 2 (ACE2).
To this end, they first downloaded 28,368 spike (S) protein sequences of all alpha and beta CoVs from the National Center for Biotechnology Information Virus database. They used a skip-gram model to convert this data into vectors that encoded the association between adjacent length k protein sequences called k-mers. Next, a classifier used these vectors to score each protein sequence per its human receptor binding potential, referred to as the human-Binding Potential (h-BiP).
The final alpha and beta CoV dataset spanning all their clades and variants had 2,534 AA sequences, based on which there were 1705 and 829 viruses with positive and negative annotations for human binding, respectively. Thus, the researchers split these 2,534 AA sequences into a training (85%) and test set (15%).
Further, the researchers used a subset of 424 sequences to generate a phylogenetic tree for the S protein of alpha and beta CoVs. The team used starting receptor-binding domain (RBD) structures of LYRa3 and LYRa11, generated using AlphaFold, for molecular dynamics (MD) simulations. The MD package YASARA helped simulate protein-protein interactions by substituting individual AA residues and searching for minimum-energy conformations on the final modified candidate structures. The team also performed an energy minimization (EM) routine for all modified candidate structures until free energy stabilized to within 50 Joules/mol. Due to the high accuracy of the classifier, the h-BiP score correlated with the percent sequence identity (in %) against human viruses. The team computed the pairwise % sequence identity between all seven human CoVs and the S protein sequences in the study dataset to select the maximum for each. Notably, all viruses with ≥97 % identity with previously known human CoVs had an h-BiP score >0.5.
Notably, the h-BiP score detected binding in cases of low sequence identity and discriminated between the binding potential for viruses with nearly the same sequence identity.
Results and conclusion
The researchers discovered LYRa326 and Bt13325, two viruses whose human binding properties are yet unknown, though they had high h-BiP scores. In support, phylogenetic analysis revealed that these two viruses were related to non-human CoVs previously known to bind to human receptors. The receptor binding motifs (RBM) within the receptor binding domain (RBD) of the S protein comes in direct contact with the host receptor. The multiple sequence alignment of the RBMs of Bt133 and LYRa3 with related viruses uncovered that they conserve contact residues that interact with the human receptor(s).
For instance, Bt133 had conserved all its eight contact residues used by Tylonycteris bat CoV HKU4 (Ty-HKU4) to bind hDPP4 despite having 13 RBD mutations. Similarly, LYRa3, phylogenetically related to SARS-CoV Tor2, had conserved 12 of its 17 contact residues that bind to hACE2. Moreover, except for residue 441, it had identical sequences at the RBD. MD simulations of the RBD further validated this binding and identified contact residues that bound human receptors.
Finally, the researchers tested whether this model surveyed host expansion events. They emulated the conditions before SARS-CoV-2 advent by removing all SARS-CoV-2 S protein sequences from the training set. They found that the re-trained ML model successfully predicted the binding between a human receptor and the wild-type SARS-CoV-2 S, with an h-BiP score equal to 0.96. Overall, the proposed ML-based method could prove to be a valuable tool for detecting, from a vast pool of animal CoVs, which viruses could cross species-barrier to infect humans.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.