ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals

Download PDF Copy

By Vijay Kumar MalesuReviewed by Susha Cheriyedath, M.Sc.Apr 16 2025

While ChatGPT-4 excels at multiple-choice medical exams, new research reveals its weaknesses in complex clinical decision-making, raising big questions about the future of AI-powered healthcare.

Study: Assessing ChatGPT 4.0’s Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis. Image Credit: Collagery / Shutterstock

In a recent study published in the journal Scientific Reports, researchers evaluated ChatGPT-4’s capabilities on the United Kingdom Medical Licensing Assessment (UKMLA), highlighting both strengths and limitations across question formats and clinical domains.

Background

Artificial intelligence (AI) continues to reshape healthcare and education. With the UKMLA soon becoming a standardized requirement for new doctors in the UK, determining whether AI models like ChatGPT-4 can meet clinical benchmarks is increasingly important. While AI shows promise, questions remain about its ability to replicate human reasoning, empathy, and contextual understanding in real-world care.

Study Overview

Distractors tripped up AI logic: In 8 cases, ChatGPT answered correctly without multiple-choice options but failed when presented with plausible wrong answers, exposing vulnerabilities to exam-style trick questions.

The researchers tested ChatGPT-4 on 191 multiple-choice questions from the Medical Schools Council’s mock UKMLA exam. The questions spanned 24 clinical areas and were split across two 100-question papers. Nine image-based questions were excluded due to ChatGPT’s inability to interpret images, which the authors note as a limitation.

Each question was tested with and without multiple-choice options. Questions were further categorized by reasoning complexity (single-step vs. multi-step) and clinical focus (diagnosis, management, pharmacology, etc.). Responses were labeled as accurate, indeterminate, or incorrect. Statistical analysis included chi-squared and t-tests.

Key Findings

Overall Accuracy: ChatGPT-4 achieved 86.3% and 89.6% accuracy with multiple-choice options on the two papers. Without options, accuracy fell to 61.5% and 74.7%, respectively (p = 0.007).
Reasoning Complexity: Single-step questions were more accurate (90% with prompts, 73.1% without) than multi-step questions (83.6% with prompts, 57.4% without). The difference was statistically significant (p = 0.025).
Clinical Competency: Diagnostic questions had the highest accuracy, 91.2% with prompts and 84.2% without. Management questions showed poor performance without options (51.2% accuracy), with a notable rate of indeterminate and incorrect responses.
Pharmacology Weaknesses: Pharmacology had the highest proportion of indeterminate answers, especially without prompts, highlighting the model's limitations in this domain.
Distractor Confusion: ChatGPT performed better without options in eight cases, suggesting that misleading distractors in multiple-choice formats can confuse the model.

Discussion

Pharmacology answers lacked certainty: Nearly 35% of drug-related responses were marked “indeterminate” without answer prompts, reflecting struggles to confidently apply dosing or treatment protocols.

ChatGPT-4 demonstrated a broad knowledge base, especially in diagnostic tasks, and performed at or above the level expected of medical graduates in structured assessments. However, it struggled with contextual clinical reasoning, especially in open-ended or multi-step management scenarios. This suggests the model may support early-stage clinical assessments but lacks the nuance required for autonomous decision-making.

Limitations include a lack of training on UK-specific clinical guidelines, which may have influenced performance on specific UKMLA questions. Furthermore, "hallucinations", fluent but incorrect outputs, pose a risk in clinical use. Ethical concerns include potential depersonalization of care and clinician deskilling due to overreliance on AI.

Conclusion

ChatGPT-4 performs well on structured medical licensing questions, particularly those centered on diagnosis. However, accuracy drops significantly in open-ended and multi-step clinical reasoning, especially in management and pharmacology. While LLMs show promise for supporting education and early-stage clinical support, their current limitations underscore the need for cautious integration, further training on clinical datasets, and ethical safeguards.

Journal reference:

Casals-Farre, O., Baskaran, R., Singh, A. et al. Assessing ChatGPT 4.0’s Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis. Scientific Reports (2025). DOI: 10.1038/s41598-025-97327-2

Posted in: Device / Technology News | Medical Science News | Medical Research News | Healthcare News

Comments (0)

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Kumar Malesu, Vijay. (2025, April 16). ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals. News-Medical. Retrieved on April 24, 2025 from https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx.
MLA
Kumar Malesu, Vijay. "ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals". News-Medical. 24 April 2025. <https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx>.
Chicago
Kumar Malesu, Vijay. "ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals". News-Medical. https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx. (accessed April 24, 2025).
Harvard
Kumar Malesu, Vijay. 2025. ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals. News-Medical, viewed 24 April 2025, https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx.