In a recent study posted to Preprints with The Lancet*, researchers developed a machine learning approach to identify patients with long coronavirus disease (COVID).
This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources
The post-acute sequelae of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection are called long COVID. The variation and evolution of symptoms characterizing long COVID necessitate the investigation of this syndrome.
About the study
In the present study, researchers aimed to generate a robust clinical definition for long COVID using data related to long COVID patients.
The team utilized data obtained from electronic health records that were integrated and harmonized in the secure N3C Data Enclave. This allowed the team to identify unique patterns and clinical characteristics among COVID-19-infected patients. The base population of the study was defined as any individual aged 18 years and above having either a positive SARS-CoV-2 polymerase chain reaction (PCR) or antigen test or an International Classification of Diseases-10 (ICD-10) COVID-19 diagnosis code diagnosed after a visit to an emergency health-care or inpatient center. The eligible participants had also completed 90 days since the COVID-19 index date which was defined as the earliest date of COVID-19 diagnosis.
The team used N3C sites that provided lists of patients who had at least once visited that site’s long COVID clinic. These patients were referred to as the long COVID clinic patients while the patients that were identified by the machine model as having long COVID were referred to as patients with potential long COVID.
The machine learning models were trained and tested by creating a subset of the overall cohort that comprised only patients detected from the list of patients who had visited the long COVID clinic at the three N3C sites which included the long COVID clinic patients. This group was further divided into patients who were hospitalized due to acute COVID-19 patients and non-hospitalized patients. The subset was again stratified to include patients who had a minimum of one healthcare visit and one diagnosis or were recommended at least one medication after COVID-19.
In the three site subsets, the team assessed the patient demographic, details of healthcare visits, medical conditions, and prescription drugs ordered for each of the patients before and after they were diagnosed with acute COVID-19. The team only took into account the diagnoses that were either newly incident or occurred more frequently after the COVID-19 diagnosis as compared to before and drugs that were prescribed after the COVID-19 diagnosis.
The team highlighted that long COVID manifestations might differ based on the severity of COVID-19 symptoms by generating three distinct machine learning models that used the three-site subset: (1) all patients; (2) patients hospitalized due to acute COVID-19; and (3) non-hospitalized patients. Each model identified the patients who had a higher chance of experiencing long COVID by using the patient’s attendance at a long COVID clinic as a proxy for the diagnosis of long COVID in that patient. The team randomly sampled patients to yield an equal number of patient samples in the group of long COVID clinic patients and that of patients who did not visit any long COVID clinic.
The team assessed the performance of the models by evaluating the area under a receiver operating characteristic curve (AUROC), recall, precision, and F-score for each model. Once the model was trained, the all-patients model was run over the full base patient population who visited a minimum of one healthcare and had at least one diagnosis or one medication after COVID-19.
Results
The collective demographics of the long COVID clinic patients who visited the three N3C sites differed significantly from COVID-19 patients who did not visit any long COVID clinic. The team noted that a majority of the long COVID clinic patients that were non-hospitalized were female while the patients hospitalized due to acute COVID-19 were Black.
Running the machine model against the three-site populations resulted in an AUROC for the all-patients model was 0.92, the hospitalized model was 0.90, and the non-hospitalized model was 0.85. In order to calculate the metric of the performances, the team considered the long COVID clinic patients as true positives and the patients who had not visited any long COVID clinic as true negatives.
When the performance of the three models was validated in comparison to an independent dataset, the team observed that the AUROC for the all-patients model was 0.82, the hospitalized model was 0.79, and the non-hospitalized model was 0.78.
Conclusion
The study results showed that four themes were prominent across the three models and long COVID manifestations: (1) post-COVID-19 respiratory symptoms and related treatments; (2) non-respiratory symptoms that were reported to be a part of long COVID and related treatments; (3) pre-existing risk factors for higher severity of acute COVID symptoms; and (4) proxies for hospitalization. Several model features were found to be substantially different among patients who potentially had long COVID and patients who had no evidence of long COVID.
This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources
Journal references:
- Preliminary scientific report.
Emily R Pfaff, Andrew T Girvin, Tellen D Bennett, Abhishek Bhatia, Ian M Brooks, Rachel R Deer, et al. (2022). Identifying who has long COVID in the USA: a machine learning approach using N3C data. doi: https://doi.org/10.1016/S2589-7500(22)00048-6 https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00048-6/fulltext#%20
- Peer reviewed and published scientific report.
Pfaff, Emily R, Andrew T Girvin, Tellen D Bennett, Abhishek Bhatia, Ian M Brooks, Rachel R Deer, Jonathan P Dekermanjian, et al. 2022. “Identifying Who Has Long COVID in the USA: A Machine Learning Approach Using N3C Data.” The Lancet Digital Health, May. https://doi.org/10.1016/s2589-7500(22)00048-6. https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00048-6/fulltext.
Article Revisions
- May 13 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.