Study reveals AI’s critical flaw in medical decision-making

While large language models ace medical exams, their inability to recognize uncertainty highlights a critical flaw that could impact patient safety.

Research: Large Language Models lack essential metacognition for reliable medical reasoning. Image Credit: NicoElNino / ShutterstockResearch: Large Language Models lack essential metacognition for reliable medical reasoning. Image Credit: NicoElNino / Shutterstock

In a recent study published in the journal Nature Communications, researchers evaluated the metacognitive abilities of popular large language models (LLMs) to assess their suitability for deployment in clinical settings. They developed a novel benchmarking tool named "MetaMedQA" as a modification and enhancement of the MedQA-USMLE benchmark to evaluate LLM performance across missing answer recall, confidence-based accuracy, and unknown recall through multiple-choice medical questions.

Study findings revealed that despite scoring high in multiple-choice questions, LLMs were incapable of recognizing the limitations of their knowledge base, providing confident answers even when none of the options provided were factually correct. However, exceptions like GPT-4o exhibited relatively better self-awareness and calibration of confidence, highlighting the variability in model performance. These findings highlight a disconnect between LLMs' perception of their capabilities and actual medical abilities, which may prove disastrous in clinical settings. The study hence identifies scope for growth in LLM development, calling for incorporating enhanced metacognition before LLMs can reliably be deployed in clinical decision support systems.

Background

Large language models (LLMs) are artificial intelligence (AI) models that use deep learning techniques to understand and generate human language. Recent advances in LLMs have resulted in their extensive use across various industries, including defense and healthcare. Notably, several LLMs, including OpenAI's popular ChatGPT models, have been demonstrated to achieve expert-level performance in official medical board examinations across a wide range of medical specialties (pediatrics, ophthalmology, radiology, oncology, and plastic surgery).

While several evaluation methodologies (such as the current gold standard, "MultiMedQA") have been developed to assess LLM performance in medical applications, they suffer from a common drawback – LLM performance tests are limited to evaluating model information recall and pattern recognition, with no weightage given to their metacognitive abilities. Recent studies have highlighted these limitations by revealing deficiencies in model safety, particularly in LLMs' potential to generate misleading information when accurate information is lacking.

About the Study

The present study aimed to develop a novel evaluation of the metacognitive capabilities of current and future LLMs. It developed and tested a framework titled "MetaMedQA" by incorporating fictional, malformed, and modified medical questions into the existing MedQA-USMLE benchmark. In addition to MultiMedQA's information recall and pattern recognition evaluations, the novel assessment determines uncertainty quantification and confidence scoring, thereby revealing LLMs' capacity (or lack thereof) for self-evaluation and knowledge gap identification.

"This approach provides a more comprehensive evaluation framework that aligns closely with practical demands in clinical settings, ensuring that LLM deployment in healthcare can be both safe and effective. Moreover, it holds implications for AI systems in other high-stakes domains requiring self-awareness and accurate self-assessment."

MultiMedQA was developed using Python 3.12 alongside Guidance algorithms. The tool comprises 1,373 questions, each providing multiple (n = 6) choices (MCQs), only one of which is correct. Questions included fictional scenarios, manually identified malformed questions, and altered correct answers to evaluate specific metacognitive skills.

Outcomes of interest in LLMs' metacognitive abilities included:

  1. Overall model accuracy
  2. Impact of Confidence
  3. Missing answer analysis
  4. Unknown analysis (a measure of LLMs' self-awareness), and
  5. Prompt engineering analysis. Current LLMs evaluated through this novel framework included both proprietary (OpenAI's GPT-4o-2024-05-13, GPT-3.5-turbo-0125) and open-weight models.

Study Findings

The study identified the association between model size and overall accuracy—larger models (e.g., Qwen2 72B; M = 64.3%) performed better than their smaller counterparts (e.g., Qwen2 7B; M = 43.9%). Similarly, more recent models released were observed to outperform their older counterparts substantially. GPT-4o-2024-05-13 (M = 73.3%) was found to be the most overall accurate LLM currently available.

Impact of confidence (1.0-5.0-point score; higher value indicates greater self-assessed confidence in answers) analysis revealed that most models consistently assumed that their answers were accurate with high confidence values (5). GPT-4o and Qwen2-72B were notable exceptions, showing variability in confidence that aligned with accuracy, a critical capability for clinical safety.

Missing answers (LLM choosing 'none of the above' as its answer to an MCQ) revealed that larger and more recent models performed best. Unknown analysis (LLMs identifying that they were unequipped to answer a specific question) produced the worst outcomes of all analyses – all but three models scored 0% accuracy in this evaluation. This pervasive inability to identify unanswerable questions underscores a fundamental gap in current LLM capabilities. GPT-4o-2024-05-13 was found to be the best-performing with a score of 3.7%.

Prompt engineering significantly improved outcomes, with tailored prompts enhancing confidence calibration, accuracy, and unknown recall. Explicitly informing models of potential pitfalls improved high-confidence accuracy and prompted self-awareness, though these gains were context-dependent.

Conclusions

The present study devised a novel evaluation metric (MetaMedQA) to assess popular LLMs' metacognitive abilities and self-awareness. Testing 12 proprietary and open-weight models revealed that, while most models have expert-level overall accuracy, they struggle with missing information or unknown analysis, highlighting their lack of self-awareness. Prompt engineering showed promise but remains an incomplete solution for addressing these challenges. Notably, OpenAI's GPT-4o-2024-05-13 consistently outperformed other currently popular models and presented the highest self-awareness.

These findings emphasize the gap between apparent expertise and actual self-assessment in LLMs, which poses significant risks in clinical contexts. Addressing this will require a focus on both improved benchmarks and fundamental enhancements in model architecture.

Journal reference:
  • Griot, M., Hemptinne, C., Vanderdonckt, J. et al. Large Language Models lack essential metacognition for reliable medical reasoning. Nat Commun 16, 642 (2025), DOI – 10.1038/s41467-024-55628-6, https://www.nature.com/articles/s41467-024-55628-6
Hugo Francisco de Souza

Written by

Hugo Francisco de Souza

Hugo Francisco de Souza is a scientific writer based in Bangalore, Karnataka, India. His academic passions lie in biogeography, evolutionary biology, and herpetology. He is currently pursuing his Ph.D. from the Centre for Ecological Sciences, Indian Institute of Science, where he studies the origins, dispersal, and speciation of wetland-associated snakes. Hugo has received, amongst others, the DST-INSPIRE fellowship for his doctoral research and the Gold Medal from Pondicherry University for academic excellence during his Masters. His research has been published in high-impact peer-reviewed journals, including PLOS Neglected Tropical Diseases and Systematic Biology. When not working or writing, Hugo can be found consuming copious amounts of anime and manga, composing and making music with his bass guitar, shredding trails on his MTB, playing video games (he prefers the term ‘gaming’), or tinkering with all things tech.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Francisco de Souza, Hugo. (2025, January 15). Study reveals AI’s critical flaw in medical decision-making. News-Medical. Retrieved on January 15, 2025 from https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx.

  • MLA

    Francisco de Souza, Hugo. "Study reveals AI’s critical flaw in medical decision-making". News-Medical. 15 January 2025. <https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx>.

  • Chicago

    Francisco de Souza, Hugo. "Study reveals AI’s critical flaw in medical decision-making". News-Medical. https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx. (accessed January 15, 2025).

  • Harvard

    Francisco de Souza, Hugo. 2025. Study reveals AI’s critical flaw in medical decision-making. News-Medical, viewed 15 January 2025, https://www.news-medical.net/news/20250115/Study-reveals-AIe28099s-critical-flaw-in-medical-decision-making.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.