In a recent study posted to the medRxiv* preprint server, researchers in Australia, Finland, and New Zealand reviewed the translational success of artificial intelligence (AI) models in healthcare, especially those used in coronavirus disease 2019 (COVID-19) studies.
Study: Application of comprehensive evaluation framework to Coronavirus Disease 19 studies: A systematic review of translational aspects of artificial intelligence in health care. Image Credit: sdecoret / Shutterstock.com
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
Although some AI applications are in clinical trials to determine their potential integration into medical information systems, there remains a lack of studies demonstrating their ability to improve clinical results. However, studies have demonstrated the superiority of AI in experimental or pilot settings. Due to the reduced performance of these AI applications on external validation and low acceptance by clinicians, existing clinical workflows have yet to initiate their integration.
About the study
In the present study, researchers evaluate COVID-19 AI models developed between December 2019 and 2020 using translational evaluation of healthcare AI (TEHAI), a comprehensive evaluation framework for translational value assessments of AI models.
TEHAI evaluates scientific studies for adoption, inherent capability, and utility to focus on the statistics of its 15 sub-components. This expert-driven formalized framework mitigates an individual's subjectivity and substitutes it with the consensus power of several reviewers. Each criterion of this framework yielded a score between zero and three points, depending on the study quality.
This systematic review used the Covidence software platform. While nine independent reviewers assessed scientific literature for translational value, the other two collected descriptive data from each study. A third reviewer compared evaluation scores and extracted data from all studies to resolve discrepancies if there were any.
Fisher's exact test was used to evaluate associations between groupings of scientific papers and the distributions of sub-component scores. Finally, Kendall's formula was used to calculate associations between all 15 sub-components.
Results
The screening covering a one-year timespan yielded over 3,000 eligible studies, thus suggesting high activity in this area. However, only 102 studies yielded the expected findings.
Most studies performed remarkably in the capability component but did not score highly in the utility and service adoption components of the TEHAI framework.
Most studies attained high scores for technical capability but scored low in terms of clinical translatability. However, most studies also failed on AI-related parameters, such as ethics, safety, external model validation, and quality of integration with medical systems.
Sixty-nine of the 102 studies were related to medical imaging analysis, with a convolutional neural network as the most popular machine learning mode. This result was anticipated, as imaging techniques are now widely understood and are being readily applied in real-world clinical settings. However, non-imaging studies scored higher in adoption and utility subcomponents.
Surprisingly, studies with large datasets did not advance in domains of utility or adoption. This was similarly expected, as the number of studies for analysis would increase, differences in small versus large datasets would also become significant.
Only a few independent studies tested claims that AI models identify more accurate and specific results in real-time than human experts. Thus, despite their potential, AI models are generally unsuitable for clinical translation and, if deployed prematurely, could lead to undesirable outcomes. Some adverse effects could include increased stress on the healthcare system and patients with redundant invasive procedures, which could lead to deaths due to misdiagnoses.
Most studies required more adequate considerations for the service adoption domain of the TEHAI framework, which was associated with the real-world applications of AI-based models in the medical industry. Therefore, more pilot data from real-world tests on AI-based new tools are needed to adjust the cost from misclassification and deployment from a patient safety perspective. There is also an urgent need for preliminary accounting of workload requirements.
Conclusions
The present review assessed 102 COVID-19 AI studies to demonstrate a notable gap in most studies that could adversely impact their clinical translation. These findings emphasize the importance of addressing the translatability challenge of AI in the medical information systems domain.
Researchers should also introduce appropriate interventions early in the AI development cycle to improve translatability. In this regard, the TEHAI evaluation framework could be beneficial. Furthermore, the findings from its application could inform all stakeholders, including developers, researchers, and clinicians, to deploy more translatable AI models in healthcare.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.