New study reveals that large language models outperform physicians in diagnostic accuracy but require strategic integration to enhance clinical decision-making without replacing human expertise.
Study: Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. Image Credit: Shutterstock AI / Shutterstock.com
In a recent study published in JAMA Network Open, researchers investigate whether large language models (LLMs) could enhance the diagnostic reasoning of physicians as compared to using standard diagnostic resources. LLMs were found to perform better alone as compared to the performance of physician groups using LLMs for diagnosing cases.
How can artificial intelligence improve clinical diagnoses?
Diagnostic errors, which can arise from systemic and cognitive issues, may cause significant harm to patients. Thus, improving diagnostic accuracy requires methods to address cognitive challenges that are part of clinical reasoning. However, common methods like reflective practices, educational programs, and decision support tools have not effectively improved diagnostic accuracy.
Recent advances in artificial intelligence, especially LLMs, offer promising support by simulating human-like reasoning and responses. LLMs can also handle complex medical cases and assist in clinical decision-making, while interacting empathetically with the user.
The current use of LLMs in healthcare is largely supplementary in enhancing human expertise. Considering the limited training and integration received by healthcare professionals on the use of LLMs in clinical settings, it is crucial to understand the impact of using LLMs in clinical settings on patient care.
About the study
In the present study, researchers utilized a randomized, single-blind design to assess the diagnostic reasoning abilities of physicians using either LLMs or conventional resources. Physicians working in family, emergency, or internal medicine were recruited for the study, with all sessions conducted in person or remotely.
Physicians were provided with one hour to work through six moderately complex clinical cases presented in a survey tool. Study participants in the intervention group were provided access to LLM tools ChatGPT Plus and GPT-4, whereas study participants in the control group only used conventional resources.
Clinical cases included detailed patient histories, examination findings, and test results. The reviewing and selection of cases followed strict criteria involving four physicians, with selected cases affected by a wide range of medical conditions while excluding simple and extremely rare cases.
Structured reflection was included as a conventional assessment tool. This required the participants to list their top differential diagnosis, explain the supporting and opposing case factors, and choose the most likely diagnosis while proposing further treatment steps. The responses were graded for the accuracy of the final diagnosis, as well as diagnostic reasoning.
The objective diagnostic performance of the LLM was evaluated by using standardized prompts, which were repeated thrice for consistency. The responses were then scored by assigning points for correct reasoning and diagnostic plausibility.
Statistical analyses using mixed-effects models were also performed to account for intra-participant variability, whereas linear and logistic models were applied to time metrics and diagnostic performance.
Study findings
The use of LLMs by physicians did not improve the diagnostic reasoning for challenging cases as compared to the use of conventional resources by physicians. However, the LLMs alone performed significantly better than the physicians in diagnosing cases.
These findings were consistent across different physician experience levels, which suggests that simply providing access to LLMs was not likely to enhance the diagnostic reasoning.
No significant differences were observed in case-solving evaluations between the groups. However, further studies using larger sample sizes are needed to determine whether LLM use improves efficiency.
The standalone performance of the LLM was better than that of both human groups, with these results similar to those published in similar studies on other LLM technologies. The superior impartial performance of the LLMs is attributed to the sensitivity to prompt formulation, which emphasizes the importance of prompt strategies in maximizing the utility of LLMs.
Conclusions
LLMs show immense promise in efficient diagnostic reasoning. Despite successful diagnoses provided by LLMs in the current study, these results should not be interpreted to indicate that LLMs can provide diagnoses without clinician oversight.
As AI research progresses and nears clinical integration, it will become even more important to reliably measure diagnostic performance using the most realistic and clinically relevant evaluation methods and metrics.
The integration of LLMs into clinical practice requires effective strategies for structured prompt designing and training physicians to use detailed prompts, which could optimize the performance of physician-LLM collaborations in diagnosis. Nevertheless, the utilization of LLMs for enhancing diagnostic reasoning involves using these tools as complements, rather than replacements, for physician expertise in the clinical decision-making process.
Journal reference:
- Goh, E., Gallo, R., Hom, J., et al. (2024). Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Network Open 7(10); e2440969–e2440969. doi:10.1001/jamanetworkopen.2024.40969.