Study reveals AI’s critical flaw in medical decision-making

Download PDF Copy

By Hugo Francisco de SouzaReviewed by Susha Cheriyedath, M.Sc.Jan 15 2025

While large language models ace medical exams, their inability to recognize uncertainty highlights a critical flaw that could impact patient safety.

Research: Large Language Models lack essential metacognition for reliable medical reasoning. Image Credit: NicoElNino / Shutterstock

In a recent study published in the journal Nature Communications, researchers evaluated the metacognitive abilities of popular large language models (LLMs) to assess their suitability for deployment in clinical settings. They developed a novel benchmarking tool named "MetaMedQA" as a modification and enhancement of the MedQA-USMLE benchmark to evaluate LLM performance across missing answer recall, confidence-based accuracy, and unknown recall through multiple-choice medical questions.

Study findings revealed that despite scoring high in multiple-choice questions, LLMs were incapable of recognizing the limitations of their knowledge base, providing confident answers even when none of the options provided were factually correct. However, exceptions like GPT-4o exhibited relatively better self-awareness and calibration of confidence, highlighting the variability in model performance. These findings highlight a disconnect between LLMs' perception of their capabilities and actual medical abilities, which may prove disastrous in clinical settings. The study hence identifies scope for growth in LLM development, calling for incorporating enhanced metacognition before LLMs can reliably be deployed in clinical decision support systems.

Background

Model Size Matters: The study found a strong correlation between the size of a language model and its performance, with larger models like GPT-4o achieving significantly higher accuracy than smaller models, such as Yi 1.5-9B.

Large language models (LLMs) are artificial intelligence (AI) models that use deep learning techniques to understand and generate human language. Recent advances in LLMs have resulted in their extensive use across various industries, including defense and healthcare. Notably, several LLMs, including OpenAI's popular ChatGPT models, have been demonstrated to achieve expert-level performance in official medical board examinations across a wide range of medical specialties (pediatrics, ophthalmology, radiology, oncology, and plastic surgery).

While several evaluation methodologies (such as the current gold standard, "MultiMedQA") have been developed to assess LLM performance in medical applications, they suffer from a common drawback – LLM performance tests are limited to evaluating model information recall and pattern recognition, with no weightage given to their metacognitive abilities. Recent studies have highlighted these limitations by revealing deficiencies in model safety, particularly in LLMs' potential to generate misleading information when accurate information is lacking.

About the Study

The present study aimed to develop a novel evaluation of the metacognitive capabilities of current and future LLMs. It developed and tested a framework titled "MetaMedQA" by incorporating fictional, malformed, and modified medical questions into the existing MedQA-USMLE benchmark. In addition to MultiMedQA's information recall and pattern recognition evaluations, the novel assessment determines uncertainty quantification and confidence scoring, thereby revealing LLMs' capacity (or lack thereof) for self-evaluation and knowledge gap identification.

"This approach provides a more comprehensive evaluation framework that aligns closely with practical demands in clinical settings, ensuring that LLM deployment in healthcare can be both safe and effective. Moreover, it holds implications for AI systems in other high-stakes domains requiring self-awareness and accurate self-assessment."

MultiMedQA was developed using Python 3.12 alongside Guidance algorithms. The tool comprises 1,373 questions, each providing multiple (n = 6) choices (MCQs), only one of which is correct. Questions included fictional scenarios, manually identified malformed questions, and altered correct answers to evaluate specific metacognitive skills.

Outcomes of interest in LLMs' metacognitive abilities included:

Overall model accuracy
Impact of Confidence
Missing answer analysis
Unknown analysis (a measure of LLMs' self-awareness), and
Prompt engineering analysis. Current LLMs evaluated through this novel framework included both proprietary (OpenAI's GPT-4o-2024-05-13, GPT-3.5-turbo-0125) and open-weight models.

Study Findings

"None of the Above" Dilemma: LLMs struggled with questions requiring "none of the above" as the correct answer, with some models inflating their scores by defaulting to this option excessively, underscoring a key area for improvement.

The study identified the association between model size and overall accuracy—larger models (e.g., Qwen2 72B; M = 64.3%) performed better than their smaller counterparts (e.g., Qwen2 7B; M = 43.9%). Similarly, more recent models released were observed to outperform their older counterparts substantially. GPT-4o-2024-05-13 (M = 73.3%) was found to be the most overall accurate LLM currently available.

Impact of confidence (1.0-5.0-point score; higher value indicates greater self-assessed confidence in answers) analysis revealed that most models consistently assumed that their answers were accurate with high confidence values (5). GPT-4o and Qwen2-72B were notable exceptions, showing variability in confidence that aligned with accuracy, a critical capability for clinical safety.

Missing answers (LLM choosing 'none of the above' as its answer to an MCQ) revealed that larger and more recent models performed best. Unknown analysis (LLMs identifying that they were unequipped to answer a specific question) produced the worst outcomes of all analyses – all but three models scored 0% accuracy in this evaluation. This pervasive inability to identify unanswerable questions underscores a fundamental gap in current LLM capabilities. GPT-4o-2024-05-13 was found to be the best-performing with a score of 3.7%.

Prompt engineering significantly improved outcomes, with tailored prompts enhancing confidence calibration, accuracy, and unknown recall. Explicitly informing models of potential pitfalls improved high-confidence accuracy and prompted self-awareness, though these gains were context-dependent.

Conclusions

The present study devised a novel evaluation metric (MetaMedQA) to assess popular LLMs' metacognitive abilities and self-awareness. Testing 12 proprietary and open-weight models revealed that, while most models have expert-level overall accuracy, they struggle with missing information or unknown analysis, highlighting their lack of self-awareness. Prompt engineering showed promise but remains an incomplete solution for addressing these challenges. Notably, OpenAI's GPT-4o-2024-05-13 consistently outperformed other currently popular models and presented the highest self-awareness.

These findings emphasize the gap between apparent expertise and actual self-assessment in LLMs, which poses significant risks in clinical contexts. Addressing this will require a focus on both improved benchmarks and fundamental enhancements in model architecture.

Journal reference:

Griot, M., Hemptinne, C., Vanderdonckt, J. et al. Large Language Models lack essential metacognition for reliable medical reasoning. Nat Commun 16, 642 (2025), DOI – 10.1038/s41467-024-55628-6, https://www.nature.com/articles/s41467-024-55628-6

Posted in: Device / Technology News | Medical Science News | Medical Research News

Comments (0)

Written by

Hugo Francisco de Souza

Hugo Francisco de Souza is a scientific writer based in Bangalore, Karnataka, India. His academic passions lie in biogeography, evolutionary biology, and herpetology. He is currently pursuing his Ph.D. from the Centre for Ecological Sciences, Indian Institute of Science, where he studies the origins, dispersal, and speciation of wetland-associated snakes. Hugo has received, amongst others, the DST-INSPIRE fellowship for his doctoral research and the Gold Medal from Pondicherry University for academic excellence during his Masters. His research has been published in high-impact peer-reviewed journals, including PLOS Neglected Tropical Diseases and Systematic Biology. When not working or writing, Hugo can be found consuming copious amounts of anime and manga, composing and making music with his bass guitar, shredding trails on his MTB, playing video games (he prefers the term ‘gaming’), or tinkering with all things tech.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Francisco de Souza, Hugo. (2025, January 15). Study reveals AI’s critical flaw in medical decision-making. News-Medical. Retrieved on July 05, 2025 from https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx.
MLA
Francisco de Souza, Hugo. "Study reveals AI’s critical flaw in medical decision-making". News-Medical. 05 July 2025. <https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx>.
Chicago
Francisco de Souza, Hugo. "Study reveals AI’s critical flaw in medical decision-making". News-Medical. https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx. (accessed July 05, 2025).
Harvard
Francisco de Souza, Hugo. 2025. Study reveals AI’s critical flaw in medical decision-making. News-Medical, viewed 05 July 2025, https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx.