As the coronavirus disease 2019 (COVID-19) pandemic has spread across the world, vast amounts of bioinformatics data have been created and analyzed, and logistic regression models have been key to many papers helping to illuminate important features of the disease, such as which mutations are tied to more severe disease outcomes.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Linear regression models are used for binary classification that can then be generalized to multiclass classification and normally perform very well. Researchers from the US Air Force Medical Readiness Agency have been studying how logistic regression model training affects performance, and which features are best to include when examining datasets from individuals suffering from COVID-19.
A preprint version of the study is available on the medRxiv* server, while the article undergoes peer review.
The Study
The initial export of raw Global Initiative on Sharing Avian Influenza Data (GISAID) data was curated using shell scripts, with FASTA sequences parsed from the export. Samples with patient outcome metadata attached were separated out, with ~30,000 samples with severe outcomes and ~25,000 samples with mild outcomes used in the analyses. Scikit-learn was used to fit logistic regression models, and a train/test split was created on the data, with test data only used for evaluating the performance of the models. A total of five different logistic regression models were created with different input features.
Initially, the researchers reproduced previous results using the same dataset to validate the accuracy and area under the curve (AUC) of the logistic regression models - a measure of goodness of fit. The model that used age, gender, region, and the variant of COVID-19 as features showed both the highest AUC at 0.91 and the highest accuracy at 91%. This was followed by models that used fewer features. The models identified the same mutations associated with disease severity as in the previous experiment.
Following this, the classification performance of the logistic regression models used in the previous experiment was examined using the newer dataset. The mutations included in the updated dataset were limited to match the feature space of the trained models, with no novel mutations that were not included in the original 2020 dataset present. Generally, the previous models showed a decline in performance when applied to the later dataset, especially for models that included the region feature.
The nested logistic regression models were then retrained on the new dataset, with retraining performed using the train split of the expanded dataset and performance evaluated using the test split. The retrained models were then compared to the models trained on the original dataset. As expected, the models using the age, gender, region, and variants (AGRV) continued to show the best performance, and the models trained on the original dataset outperformed the models trained on the later dataset.
The decrease in the retrained model performance could indicate a reduction in power to distinguish between severe and mild outcomes in the expanded dataset or could be explained by an inconsistent case severity definition between the two datasets. The mutations most often associated with severe and mild outcomes in the 2020 dataset are not identified in the 2021 dataset, with no overlap in the top 40 mutations. However, 10 of the top 20 mutations associated with severe outcomes in the previous study were also associated with severe outcomes in the 2021 dataset.
Other machine learning binary classifiers were also explored, including Random Forest, Naïve Bayes and Neural Network algorithms. When these performances were compared to the logistic regression model, 3,386 samples were used for the analysis and 2,694 of these were associated with severe outcomes, and 692 associated with mild outcomes.
AGRV was once again used as features for all of the models tested, with a stratified 67% train and 33% test split dataset. 5-fold cross-validation was performed to select the best parameters for each model before Sci-kit learn ensemble modules were used to run each of the models. The random forest model significantly outperformed all other models, including the logistic regression model that the entire paper focuses on, with an eventual AUC of 0.936 and an accuracy of 0.918.
Conclusions
The researchers found that Random Forest was the best performing algorithm for classification, which could indicate the presence of non-linear interactions between features.
As well as this, they have identified the most effective features for examining COVID-19 data with linear regression models, which should be of help to bioinformaticians studying datasets where Random Forest is a suitable method of analysis.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.