ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals

While ChatGPT-4 excels at multiple-choice medical exams, new research reveals its weaknesses in complex clinical decision-making, raising big questions about the future of AI-powered healthcare.

Study: Assessing ChatGPT 4.0’s Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis. Image Credit: Collagery / ShutterstockStudy: Assessing ChatGPT 4.0’s Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis. Image Credit: Collagery / Shutterstock

In a recent study published in the journal Scientific Reports, researchers evaluated ChatGPT-4’s capabilities on the United Kingdom Medical Licensing Assessment (UKMLA), highlighting both strengths and limitations across question formats and clinical domains.

Background

Artificial intelligence (AI) continues to reshape healthcare and education. With the UKMLA soon becoming a standardized requirement for new doctors in the UK, determining whether AI models like ChatGPT-4 can meet clinical benchmarks is increasingly important. While AI shows promise, questions remain about its ability to replicate human reasoning, empathy, and contextual understanding in real-world care.

Study Overview

The researchers tested ChatGPT-4 on 191 multiple-choice questions from the Medical Schools Council’s mock UKMLA exam. The questions spanned 24 clinical areas and were split across two 100-question papers. Nine image-based questions were excluded due to ChatGPT’s inability to interpret images, which the authors note as a limitation.

Each question was tested with and without multiple-choice options. Questions were further categorized by reasoning complexity (single-step vs. multi-step) and clinical focus (diagnosis, management, pharmacology, etc.). Responses were labeled as accurate, indeterminate, or incorrect. Statistical analysis included chi-squared and t-tests.

Key Findings

  • Overall Accuracy: ChatGPT-4 achieved 86.3% and 89.6% accuracy with multiple-choice options on the two papers. Without options, accuracy fell to 61.5% and 74.7%, respectively (p = 0.007).
  • Reasoning Complexity: Single-step questions were more accurate (90% with prompts, 73.1% without) than multi-step questions (83.6% with prompts, 57.4% without). The difference was statistically significant (p = 0.025).
  • Clinical Competency: Diagnostic questions had the highest accuracy, 91.2% with prompts and 84.2% without. Management questions showed poor performance without options (51.2% accuracy), with a notable rate of indeterminate and incorrect responses.
  • Pharmacology Weaknesses: Pharmacology had the highest proportion of indeterminate answers, especially without prompts, highlighting the model's limitations in this domain.
  • Distractor Confusion: ChatGPT performed better without options in eight cases, suggesting that misleading distractors in multiple-choice formats can confuse the model.

Discussion

ChatGPT-4 demonstrated a broad knowledge base, especially in diagnostic tasks, and performed at or above the level expected of medical graduates in structured assessments. However, it struggled with contextual clinical reasoning, especially in open-ended or multi-step management scenarios. This suggests the model may support early-stage clinical assessments but lacks the nuance required for autonomous decision-making.

Limitations include a lack of training on UK-specific clinical guidelines, which may have influenced performance on specific UKMLA questions. Furthermore, "hallucinations", fluent but incorrect outputs, pose a risk in clinical use. Ethical concerns include potential depersonalization of care and clinician deskilling due to overreliance on AI.

Conclusion

ChatGPT-4 performs well on structured medical licensing questions, particularly those centered on diagnosis. However, accuracy drops significantly in open-ended and multi-step clinical reasoning, especially in management and pharmacology. While LLMs show promise for supporting education and early-stage clinical support, their current limitations underscore the need for cautious integration, further training on clinical datasets, and ethical safeguards.

Journal reference:
  • Casals-Farre, O., Baskaran, R., Singh, A. et al. Assessing ChatGPT 4.0’s Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis. Scientific Reports (2025). DOI: 10.1038/s41598-025-97327-2
Vijay Kumar Malesu

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.    

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kumar Malesu, Vijay. (2025, April 16). ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals. News-Medical. Retrieved on April 24, 2025 from https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx.

  • MLA

    Kumar Malesu, Vijay. "ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals". News-Medical. 24 April 2025. <https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx>.

  • Chicago

    Kumar Malesu, Vijay. "ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals". News-Medical. https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx. (accessed April 24, 2025).

  • Harvard

    Kumar Malesu, Vijay. 2025. ChatGPT-4 passes UK medical licensing exam but falters in real-world clinical decision-making, study reveals. News-Medical, viewed 24 April 2025, https://www.news-medical.net/news/20250416/ChatGPT-4-passes-UK-medical-licensing-exam-but-falters-in-real-world-clinical-decision-making-study-reveals.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.