In a recent study posted to the medRxiv* preprint server, researchers assessed human genome variants related to the susceptibility and severity of coronavirus disease 2019 (COVID-19).
Studies have reported that the genomic susceptibility of the host can increase the risk of severe SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) infections. Many studies have been conducted on host genetics for COVID-19 susceptibility; however, data on COVID-19-related variants are limited, and a database of variants stratified by confidence levels is lacking. In addition, computational tools to predict severe COVID-19-associated variants are currently unavailable.
Study: A comprehensive knowledgebase of known and predicted human genetic variants associated with COVID-19 susceptibility and severity. Image Credit: Orpheus FX / Shutterstock
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
About the study
In the present study, researchers explored the genetic factors underlying host susceptibility to the severity of SARS-CoV-2 infections.
The biological functions of SARS-CoV-2 infection susceptibility/severity genes were explored using gene enrichment, feature importance, network, and pathway analyses. In addition, the team conducted phenome-wide association studies (PheWAS) on 39,386 individuals genotyped by the Mount Sinai BioMe BioBank to evaluate the pleiotropic effects of SARS-CoV-2 infection-associated variants and identify physiological similarities between COVID-19 and associated disorders.
A severe COVID-19 variant classifier based on machine learning was developed for estimating severe COVID-19-associated variants from 82,468,698 human genomic missense variants. Further, a SARS-CoV-2 infection-associated-host genomic variants website was created for searching, submitting, and downloading COVID-19 susceptibility-associated genetic variants. The classifier’s estimations were based on SHAP (Shapley-value-based explanations) and feature importance analysis.
COVID-19-associated genetic variants were categorized into four categories: (i) mild or asymptomatic SARS-CoV-2-associated variants; (ii) variants that could elevate symptomatic COVID-19 risks; (iii) known severe COVID-19-associated variants, e.g., those associated with critical COVID-19-associated pneumonia and ICU (intensive care unit) admissions; and (iv) variants involved in structural destabilization of proteins related to SARS-CoV-2 infection susceptibility.
Based on confidence levels, the variants were categorized into the following categories: (i) CAV (COVID-19-associated variants), (ii) CAV-FE (CAV with functional evidence), (iii) Allele frequency-FCP (COVID-19 prevalence correlation), (iv) IP (in silico prediction) and (v) Allele frequency – FCP + IP. CAV and CAV-FE category variants were identified through candidate gene approaches and association studies. In addition, the team identified FCP variants in studies investigating the association between the probable COVID-19-associated variant frequency and the prevalence of SARS-CoV-2 infections in several populations.
IP category deleterious variants were identified in studies exclusively using in-silico approaches for estimating the effects of amino acid exchanges on the susceptibility of SARS-CoV-2 infections. CAV-FE variants and HGMD (human gene mutation database) known disease-causing pathological mutations were utilized for creating a machine-learning classifier of severe COVID-19-related variants. Further, PPI (protein-protein interaction) networks, biological functions, and diseases significantly enriched by high-confidence COVID-19 genes were evaluated. Finally, the LD (linkage disequilibrium)-based clustering was performed to identify COVID-19-associated variants.
Results
Text mining yielded 1,977 relevant publications and 222 eligible studies, from which 820 COVID-19-associated host genetic variants reported to affect COVID-19 susceptibility were obtained, 719 of which were present in 295 genes, and 101 were present in intergenic sites. By confidence evaluation, 196 high-confidence variants were obtained. Conservation scores, MAF (minor allele frequency), SNVs (single nucleotide variants), and genome-level evolutionary pressures showed the most significant impacts on COVID-19 susceptibility/severity variant estimation.
Genes with high-confidence COVID-19 susceptibility variants shared networks, pathways, biological functions, and diseases, and the categories of infectious diseases and the immunological systems showed the highest significance. Pre-existing thromboembolism and chronic hepatic disease could elevate COVID-19 severity risks.
Compared to pathogenic variants not associated with COVID-19, CAV-FE variants were observed at significantly less conserved sites, with MAF> 0.1 variants within 100 to 1000 base-pairs, lower de novo mutational excess rates, lower indispensability scores, lower H3K36me3 levels, and were less likely to be associated with a disordered protein segment.
In total, 117 significantly over-represented pathways, among which, pathways for IFN-α/β (interferon-alpha/beta) signaling, toll-like receptor 4 (TLR4) signaling, and TBK1 (TANK-binding kinase 1) /IKK (IκB kinase) epsilon-mediated interferon regulatory transcription factor (IRF)3/IRF7 activation were the most significantly over-represented. Pathways of hypercytokinemia/hyperchemokinemia in influenza pathogenesis, coronavirus pathogenesis, neuroinflammation signaling, and pathogen-induced cytokine storm signaling were the most significant pathways.
The most significantly enriched human phenotype ontology (HPO) terminology was ‘recurrent viral infections. LD-based analysis showed that 285, 286, and 288 variants were independently associated with COVID-19 among African Americans, European Americans, and Hispanic Americans across 458, 466, and 629 phenotypes, respectively.
Overall, the study findings showed a comprehensive SARS-CoV-2 infection-related human genomics knowledge base, with a machine learning-based classifier and predetermined estimations for host genomic missense variants based on gene-, variant-, network-, and protein-level features.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.