In a recent perspective article published in npj Digital Medicine, researchers discussed the possible benefits and limits of artificially generated data in the context of healthcare analytics.
Study: Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. Image Credit: PopTika/Shutterstock.com
Background
Data-based decision-making underlies predictive analytics and innovation in clinical research and public health. In banking and economics, synthetic information has demonstrated promising potential for improving algorithm development, risk assessment, and portfolio optimization.
On the other hand, higher risks, possible liabilities, and health practitioner doubt make clinical usage of artificially generated information challenging.
About the perspective
In the present perspective, researchers reviewed synthetic data usage, applications, challenges, and limitations in the health sector.
Synthetic data: introduction and applications
Synthetic information is a viable alternative to standard healthcare data, providing a means of gaining access to high-quality datasets. It is developed utilizing mathematical models or algorithms, such as deep learning structures like generative adversarial networks (GANs) and variational auto-encoders (VAEs), to tackle specific data science challenges.
In clinical contexts, synthetic data may be utilized to quantify the effectiveness of screening programs, enrich artificial intelligence algorithms, train machine learning-based models for particular patient groups, and enhance the performance of population welfare models to anticipate infectious disease outbreaks.
Synthetic data may also aid in studying the implications of health policies, especially concerning demographic aging, by generating a synthesis dataset and testing policy choices using micro-simulation techniques.
Further, synthetic data may be utilized to assess the influence of policies on health outcomes, including morbidity, community assistance, and doctor conduct. Clinical difficulties involving several people and pandemics such as the coronavirus disease 2019 (COVID-19) might benefit from synthetic data.
During the pandemic, synthetic data was utilized to increase the volume of information in imaging investigations, enhancing the accuracy of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) detection methods compared to original datasets.
Synthetic information may also benefit digital twins or virtual clones of physical processes or systems employed for real-time behavior prediction.
Synthetic data may be used for simulating different hospital settings and predicting results, thereby improving patient outcomes and perhaps lowering expenses by constructing tailored models of patients.
Limitations and challenges of synthetic data use
The artificially generated information is useful for risk assessment in clinical scenarios. However, it also has drawbacks, such as modeling inaccuracy, poor interpretability, and a lack of effective tools for verifying data quality.
AI may assist in solving these difficulties by using automated methods, such as anomaly identification methods, to find occurrences that differ considerably from the training data distribution.
Black-box-type generation algorithms, evaluation metric limitations, and the possibility of underfitting or overfitting can, however, reduce trust in synthetic information, increasing the difficulty of drawing accurate conclusions or making informed decisions for researchers and health professionals.
Although XAI approaches can assist in determining if synthetic data retains the required input-output correlations comparable to actual data, the interpretability and explanations offered by XAI methods could be context-dependent and subjective.
In cases where XAI approaches fail to evaluate data correctness and representativeness, robust auditing procedures are required. Machine learning-based models and advanced statistical approaches can effectively assess the similarities between real-world and synthetic datasets, improving data representativeness.
Domain-specific assessment criteria and benchmark data are useful for comparing the performances of different synthetic data creation techniques.
While working with clinical data, a "privacy-by-design" mindset must be used to guarantee that artificial data generated from medical records does not inadvertently reveal identifiable information regarding individuals and result in re-identification, thus infringing data security and privacy principles.
Conclusions
Based on this perspective, artificially generated information can transform healthcare by enhancing research capacity and developing cost-efficient solutions. However, difficulties such as skewed information, data quality concerns, and privacy threats are critical.
To exploit the revolutionary power of synthetic information, the healthcare sector must actively participate in dialogues and partnerships with patients, regulatory agencies, and technology developers.
Synthetic data has real-world healthcare applications, such as improving data privacy, enriching datasets for predictive analytics, and fostering openness and accountability.
Regulatory bodies contribute to openness and accountability by offering risk-mitigation techniques, including differential privacy (DP) and a digital custodial chain dataset. Protecting patient health and upholding ethical norms are critical to encouraging the safe use of artificially generated data.
Differential privacy appears as a strong, dependable, and viable method, and the healthcare sector must address precautions against the spread of synthetic datasets by adopting and enforcing suitable legislation.
It is critical to establish a strong digital custodial chain to maintain data privacy, integrity, and security throughout its lifespan.