While large language models ace medical exams, their inability to recognize uncertainty highlights a critical flaw that could impact patient safety.
Research: Large Language Models lack essential metacognition for reliable medical reasoning. Image Credit: NicoElNino / Shutterstock
In a recent study published in the journal Nature Communications, researchers evaluated the metacognitive abilities of popular large language models (LLMs) to assess their suitability for deployment in clinical settings. They developed a novel benchmarking tool named "MetaMedQA" as a modification and enhancement of the MedQA-USMLE benchmark to evaluate LLM performance across missing answer recall, confidence-based accuracy, and unknown recall through multiple-choice medical questions.
Study findings revealed that despite scoring high in multiple-choice questions, LLMs were incapable of recognizing the limitations of their knowledge base, providing confident answers even when none of the options provided were factually correct. However, exceptions like GPT-4o exhibited relatively better self-awareness and calibration of confidence, highlighting the variability in model performance. These findings highlight a disconnect between LLMs' perception of their capabilities and actual medical abilities, which may prove disastrous in clinical settings. The study hence identifies scope for growth in LLM development, calling for incorporating enhanced metacognition before LLMs can reliably be deployed in clinical decision support systems.
Background
Large language models (LLMs) are artificial intelligence (AI) models that use deep learning techniques to understand and generate human language. Recent advances in LLMs have resulted in their extensive use across various industries, including defense and healthcare. Notably, several LLMs, including OpenAI's popular ChatGPT models, have been demonstrated to achieve expert-level performance in official medical board examinations across a wide range of medical specialties (pediatrics, ophthalmology, radiology, oncology, and plastic surgery).
While several evaluation methodologies (such as the current gold standard, "MultiMedQA") have been developed to assess LLM performance in medical applications, they suffer from a common drawback – LLM performance tests are limited to evaluating model information recall and pattern recognition, with no weightage given to their metacognitive abilities. Recent studies have highlighted these limitations by revealing deficiencies in model safety, particularly in LLMs' potential to generate misleading information when accurate information is lacking.
About the Study
The present study aimed to develop a novel evaluation of the metacognitive capabilities of current and future LLMs. It developed and tested a framework titled "MetaMedQA" by incorporating fictional, malformed, and modified medical questions into the existing MedQA-USMLE benchmark. In addition to MultiMedQA's information recall and pattern recognition evaluations, the novel assessment determines uncertainty quantification and confidence scoring, thereby revealing LLMs' capacity (or lack thereof) for self-evaluation and knowledge gap identification.
"This approach provides a more comprehensive evaluation framework that aligns closely with practical demands in clinical settings, ensuring that LLM deployment in healthcare can be both safe and effective. Moreover, it holds implications for AI systems in other high-stakes domains requiring self-awareness and accurate self-assessment."
MultiMedQA was developed using Python 3.12 alongside Guidance algorithms. The tool comprises 1,373 questions, each providing multiple (n = 6) choices (MCQs), only one of which is correct. Questions included fictional scenarios, manually identified malformed questions, and altered correct answers to evaluate specific metacognitive skills.
Outcomes of interest in LLMs' metacognitive abilities included:
- Overall model accuracy
- Impact of Confidence
- Missing answer analysis
- Unknown analysis (a measure of LLMs' self-awareness), and
- Prompt engineering analysis. Current LLMs evaluated through this novel framework included both proprietary (OpenAI's GPT-4o-2024-05-13, GPT-3.5-turbo-0125) and open-weight models.
Study Findings
The study identified the association between model size and overall accuracy—larger models (e.g., Qwen2 72B; M = 64.3%) performed better than their smaller counterparts (e.g., Qwen2 7B; M = 43.9%). Similarly, more recent models released were observed to outperform their older counterparts substantially. GPT-4o-2024-05-13 (M = 73.3%) was found to be the most overall accurate LLM currently available.
Impact of confidence (1.0-5.0-point score; higher value indicates greater self-assessed confidence in answers) analysis revealed that most models consistently assumed that their answers were accurate with high confidence values (5). GPT-4o and Qwen2-72B were notable exceptions, showing variability in confidence that aligned with accuracy, a critical capability for clinical safety.
Missing answers (LLM choosing 'none of the above' as its answer to an MCQ) revealed that larger and more recent models performed best. Unknown analysis (LLMs identifying that they were unequipped to answer a specific question) produced the worst outcomes of all analyses – all but three models scored 0% accuracy in this evaluation. This pervasive inability to identify unanswerable questions underscores a fundamental gap in current LLM capabilities. GPT-4o-2024-05-13 was found to be the best-performing with a score of 3.7%.
Prompt engineering significantly improved outcomes, with tailored prompts enhancing confidence calibration, accuracy, and unknown recall. Explicitly informing models of potential pitfalls improved high-confidence accuracy and prompted self-awareness, though these gains were context-dependent.
Conclusions
The present study devised a novel evaluation metric (MetaMedQA) to assess popular LLMs' metacognitive abilities and self-awareness. Testing 12 proprietary and open-weight models revealed that, while most models have expert-level overall accuracy, they struggle with missing information or unknown analysis, highlighting their lack of self-awareness. Prompt engineering showed promise but remains an incomplete solution for addressing these challenges. Notably, OpenAI's GPT-4o-2024-05-13 consistently outperformed other currently popular models and presented the highest self-awareness.
These findings emphasize the gap between apparent expertise and actual self-assessment in LLMs, which poses significant risks in clinical contexts. Addressing this will require a focus on both improved benchmarks and fundamental enhancements in model architecture.