Revolutionizing missing data management in EHRs with machine learning

Researchers from the National Institute of Health Data Science at Peking University and the Department of Clinical Epidemiology and Biostatistics at Peking University People's Hospital have conducted a comprehensive systematic review evaluating strategies for addressing missing data in electronic health records (EHRs). Published in Health Data Science, the study highlights the growing importance of machine learning methods over traditional statistical approaches in managing missing data scenarios effectively​​.

Electronic health records have become a cornerstone in modern healthcare research, enabling analysis across clinical trials, treatment effectiveness studies, and genetic association research. However, missing data remains a persistent challenge, potentially introducing bias and undermining the reliability of findings. This study reviewed 46 research papers published between 2010 and 2024, systematically comparing the performance of traditional statistical methods, such as Multiple Imputation by Chained Equations (MICE), with modern machine learning approaches like Generative Adversarial Networks (GANs) and k-Nearest Neighbors (KNN)​.

The findings reveal that machine learning techniques, particularly GAN-based methods and context-aware time-series imputation (CATSI), consistently outperformed traditional statistical approaches in handling both longitudinal and cross-sectional datasets. For longitudinal data, Med.KNN and CATSI showed superior performance, while probabilistic principal component analysis (PCA) and MICE were more effective for cross-sectional datasets​.

Machine learning methods show significant promise for addressing missing data in EHRs. However, no single approach offers a universally applicable solution, underscoring the need for standardized benchmarking analyses across diverse datasets and missingness scenarios"​.

Dr. Huixin Liu, Associate Professor at Peking University People's Hospital

The study also identifies key challenges, including the heterogeneity of EHR datasets, the opacity of machine learning models, and the lack of universal benchmarks for assessing methodology performance. Future research aims to establish a standardized protocol for handling missing EHR data and develop benchmarking datasets for comprehensive evaluation.

"Our ultimate goal is to create a universally accepted protocol for handling missing data in electronic health records, ensuring more reliable and reproducible findings across medical research," added Dr. Shenda Hong, Assistant Professor at the National Institute of Health Data Science at Peking University​.

This research marks a significant step toward addressing one of the most pressing challenges in digital healthcare research, offering insights that can help bridge the gap between data scarcity and robust analysis.

Source:
Journal reference:

Ren, W., et al. (2024). Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records. Health Data Science. doi.org/10.34133/hds.0176.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Wearables and machine learning predict five-year fall risk in Parkinson’s patients