New machine learning models incorporating data over time of clinical severity related to COVID-19 infection show that patient demography and occurrence of additional health conditions are key predictors of infection outcome.
SARS-CoV-2 Virus. Image Credit: Kateryna Kon/Shutterstock.com
Temporal dynamics of COVID-19 infection and the urgency of representative sampling
As of mid-July 2021, SARS-CoV-2 has infected over 187 million people and caused more than 4.03 million deaths worldwide. In response, the COVID-19 pandemic has prompted the global scale implementation of preventative measures and vaccination programs that have been developed over the course of the past year.
Nonetheless, insights into the temporal and clinical dynamics during the pandemic remain limited. In the USA, a key concern is the gathering of representative clinical datasets required by US practitioners, scientists, health care systems, and policymakers, which are used to inform critical strategies related to predictive and diagnostic measures.
To address this urgent need, an extensive collaborative effort has formed led by Tellen D. Bennett, MD, from the Section of Informatics and Data Science at the Department of Pediatrics, University of ColoradoSchool of Medicine, University of Colorado, USA, working alongside numerous colleagues. The research was published in Jama Network.
The research team has come together to form the National COVID Cohort Collaborative (N3C) to accelerate the understanding of SARS-CoV-2 and develop new approaches for collaborative data sharing as well as analytical data during the pandemic.
This collaboration is made of scientists from the National Institutes of Health Clinical and Translational Science Awards Program and its Centre for Data to Health, the IDeA Centres for Translational Research,5the National Patient-Centred Clinical Research Network, the Observational Health Data Sciences, and Informatics network, TriNetX, and the Accrual to Clinical Trials network.
Altogether, the researchers were able to publish the first outcome of the collaboration in the journal Health Informatics. The report provides detailed clinical descriptions of the largest cohort of US COVID-19 cases to date, including data from racially, ethnically, and geographically diverse patients.
The team used data from 926 526 US adults with SARS-CoV-2 infection (polymerase chain reaction >99% or antigen <1%) as well as adult patients without SARS-CoV-2 infection who served as controls from 34 different medical centers nationwide between January 1st, 2020, and December 7th, 2020.
First, this data provided key insights into how infections varied during 2020, showing that mortality decreased from 16.4% to 8.6% during the year.
This was consistent across demographic and ethnic factors, with researchers using that data to then explore the clinical data of cases associated with the infection rates to identify specific predictive factors across patients.
Independent factors predictive of higher clinical severity following COVID-19 infection
Researchers used a series of random forest and XGBoost models to predict severe clinical course (death, discharge to hospice, invasive ventilatory support, or extracorporeal membrane oxygenation) in association with ethnic and demographic factors.
This analysis encompassed a total of 174 568 adults who tested positive for SARS-CoV-2 as well as a control cohort of 1 133 848 adult controls who tested negative for SARS-CoV-2 for comparative analysis.
Machine learning models showed that key factors predicted the outcome of COVID-19 infection. Specifically, factors from highest to lowest severity included age, male sex, liver disease, dementia, African American and Asian race, as well as obesity. These factors were all independently associated with higher levels of clinical severity following COVID-19 infection.
Due to the extensive sample size, prolonged temporal scale, and diversity in categorical variables, the findings presented are reliable and consistent.
However, several limitations associated with data collection require consideration.
Firstly, detailed respiratory support information, such as oxygen flow, levels of inspired oxygen, and ventilator settings, are not fully available, which could limit the accuracy of diagnoses and clinical severity. Secondly, the exact time at which laboratory values were calculated is inconsistently provided by different geographical sites. This is due to the fact that laboratory test results are standardized to calendar day but not time of day.
These challenges limit the standardization of data collection but the methodology presented in the present study provides a clinically useful and reliable machine learning-based predictor of SARS-CoV-2 severity. Ultimately, this study is particularly insightful as it is the first to use multiple health systems across the USA to evaluate COVID-19 severity and risk factors over time to develop unprecedented predictive models that could be used to develop future risk-based strategies.
Journal reference:
- JAMA Network Open. 2021;4(7):e2116901. doi:10.1001/jamanetworkopen.2021.16901