In a recent study published in The Lancet Digital Health, a group of researchers developed and evaluated a scalable, privacy-preserving federated learning solution using low-cost microcomputing for coronavirus disease 2019 (COVID-19) screening in United Kingdom (UK) hospitals.
Background
Patient data use in medical artificial intelligence (AI) research faces ethical, legal, and technical challenges, including risks of misuse and privacy breaches. Federated learning offers a privacy-protecting approach by allowing AI model development without sharing data outside organizations. It enables local data training, contrasting with traditional centralized training.
This method, especially client-server federated learning, involves sharing model weights, not patient data, for global model development. Real-world hospital implementations are rare, often requiring technical expertise and data separation from clinical systems.
Further research is needed to refine and validate the federated learning approach in diverse healthcare settings and to address implementation challenges for wider adoption in real-world clinical environments.
About the study
The present study involved a detailed process to develop and test a federated learning solution for COVID-19 screening in UK hospitals. Researchers selected four National Health Service (NHS) hospital groups - Oxford University Hospitals (OUH), University Hospitals Birmingham (UHB), Bedfordshire Hospitals (BH), and Portsmouth Hospitals University (PUH) and used Raspberry Pi 4 Model B devices for full-stack federated learning. This setup allowed each hospital to train, calibrate, and evaluate AI models locally using de-identified patient data, ensuring privacy.
Inclusion and exclusion criteria were provided to NHS trusts for data extraction from electronic health records. Data de-identification was rigorously conducted by clinical teams or NHS informaticians. The study used a pre-pandemic control cohort and a COVID-19-positive cohort for training, with data including vital signs, demographics, and blood test results. Data extracts were loaded onto client devices for federated training, calibration, and evaluation.
The federated training employed logistic regression and deep neural network classifiers. Features were preprocessed into a common format, and missing data were imputed using local median values. The FedAvg algorithm facilitated training across hospital groups, with clients transmitting model parameters to the central server for aggregation. Calibration of local models aimed for a set sensitivity threshold, with evaluation results aggregated by the server.
The federated evaluation involved using prospective cohorts from various hospitals. Calibration and imputation strategies varied depending on whether sites participated in both training and evaluation or evaluation only. Site-specific model tuning tested the global model's adaptability, and a centralized server-side evaluation verified federated evaluation fidelity. The study also examined the impact of individual features on model predictions.
Statistical analysis focused on comparing model performance across different configurations and training methods, using measures like AUROC, sensitivity, and specificity.
Study results
In the study, the comparison revealed a notable increase in the AUROC of the logistic regression model. For instance, the OUH saw an increase in AUROC from 0.685 to 0.829, and PUH experienced an increase from 0.731 to 0.865. Similarly, deep neural network models showed even more significant improvements, with AUROC values rising from 0.574 to 0.872 at OUH and from 0.622 to 0.876 at PUH.
Three NHS trusts- OUH, UHB, and PUH- participated in this federated training, contributing data from a large cohort of patients. The federated evaluation included data from patients admitted during the pandemic's second wave, with varying COVID-19 prevalence rates and median ages across participating sites.
When the final global models were externally evaluated, both logistic regression and deep neural network models demonstrated high classification performance. The federated calibration achieved impressive sensitivities, with the logistic regression model at 83.4% and the deep neural network model at 89.7%.
The performance of these models remained stable across the different evaluation sites. The deep neural network model, in particular, showed more marked improvement through federation compared to the logistic regression model, reaching a performance plateau after about 75-100 rounds.
Site-specific tuning of the global models resulted in a slight improvement in the deep neural network model at PUH. Still, no significant improvement was observed for the logistic regression model. This suggested a high level of generalizability of the global models and minimal shifts in predictor distributions between sites.
The analysis of the logistic regression global model highlighted several key predictors, such as granulocyte counts and albumin concentrations, aligning with previous studies emphasizing their roles in the inflammatory response. The deep neural network model's analysis using Shapley additive explanations revealed eosinophil count as a highly influential predictor.