In a recent study posted to the Research Square* preprint server, researchers used a machine-learning (ML)-based model to identify non-human CoVs (coronaviruses) that might cause human infections.
*Important notice: Research Square publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
Experimental data is considered ideal for determining the host infectivity of a virus; however, the entire host range for viruses is unknown. Several in-silico methods have been used to estimate viral hosts, one based on ML. Alignment-free methods are reportedly preferable for extensive datasets comprising recombined virus sequences; however, such methods do not consider the relative location of contact residues in the sequences.
Studies have reported on shared signals among different viral families for host estimation; however, limited viral taxa have been included, not considering the distinctive virological characteristics and, therefore, preventing mechanistic studies from being performed on the pathways of host range expansion.
About the study
In the present study, researchers studied the infectivity of α-CoVs and β-CoVs in humans by estimating binding interactions between the CoV S (spike) protein and human (host) receptors.
An ML-based model was built with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) S protein sequences to estimate viral binding with host receptors. Skip-gram modeling was performed using artificial neural networks to convert data into vector format, such that the vectors code associations between the adjacent protein sequences. Further, the vectors are converted, using a logistic regression classifier, to human-binding potential (h-BiP) scores for the binding interactions between the protein sequences and host receptors.
The model incorporated 2,534 distinct α-CoV and β-CoV S sequences. Phylogenetic and MSA (multiple sequence alignment analyses) were performed. Molecular dynamic (MD) simulations of the S RBD (receptor-binding domain) were prepared to evaluate virus-host receptor binding. The model was retrained to investigate its application in host range expansion surveillance for viruses that emerged before SARS-CoV-2. Pre-coronavirus disease 2019 (COVID-19) conditions were emulated by the exclusion of severe acute respiratory syndrome coronavirus 2 sequences in the re-trained model dataset.
The novel (re-trained) dataset comprised 1,369 CoVs, of which 540 showed human receptor binding. The training and testing datasets comprised human CoV (hCoV)-OC43, hCoV-NL63, hCoV-HKU1, the Middle East respiratory syndrome CoV (MERS-CoV), SARS-CoV-1, hCoV-229E, other MERS-associated viruses, other sarbecoviruses, other α-CoVs, other β-CoVs, and the porcine epidemic diarrhea virus. Viruses with h-BiP scores of ≥0.5 were categorized as likely to show human receptor binding.
Results
The ML model produced h-BiP scores, based on the S protein-host receptor binding of viruses that precisely evaluated the binding potential of human CoVs. The team identified two viral organisms, Bat CoV BtCoV/133/2005 (MERS-associated viral organism) and Rhinolophus affinis CoV isolate LYRa3 (SARS-associated viral organism) showed elevated h-BiP scores and previously unknown host receptor binding characteristics.
The findings indicated that the Bt133 virus and the LyRa3 virus were associated with non-human viral organisms with known host receptor binding. The high sequence identity (97.0%) observed for sharing of Bt133 S with Ty-HKU4 S indicated that the Bt133 virus binds with the human dipeptidyl peptidase 4 (hDPP4) receptor. Likewise, the 99% spike protein sequence identity between the LYRa3 virus and the LYRa11 virus indicated that the LYRa3 virus binds with the human angiotensin-converting enzyme 2 (hACE2) receptor.
MSA analysis of RBMs (receptor-binding motifs) of the Bt133 virus and the LYRa3 virus with associated viral organisms indicated that LYRa3 and Bt133 conserve their residues that contact host receptors. MD simulations validated the findings and identified the residues in contact with the host receptors. Binding interactions were observed between the E518 residue in S RBD and the Q344 residue in the hDDP4 receptor and between the N514 residue in S RBD and the R317 residue in the hDDP4 receptor for Bt133. In addition, the Q515 contact residue was detected in >70.0% of Bt133 simulations. Contact residues for LYRa3 included T490, G492, Y485, and G486, detected in ≥45.0% of MD simulations.
In total, 16 viruses with no known host binding showed h-BiP scores of ≥ 0.5, indicating that the viruses might bind to receptors on human cells. Of these, 14 viruses were associated with MERS and originated from dromedary camels of Africa. Previous studies have indicated that MERS-CoV-associated viruses from camels in Africa might cause human infections.
The model classifier showed 99.5%, 99.6%, and 98% accuracy, sensitivity, and specificity, respectively, and the h-BiP scores correlated well with the sequence identity findings. The h-BiP scores also discriminated between viral organisms with identical sequence identities, and the model detected host receptor binding for low percent sequence identity cases. Re-training of the model also yielded similar accuracy.
Overall, the study findings showed the h-BiP-score-based ML approach as an accurate method of estimating the human receptor-binding capability of CoVs, underscoring that ML models could be used to predict host expansion events.
*Important notice: Research Square publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.