AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful

Download PDF Copy

By Vijay Kumar MalesuReviewed by Lily Ramsey, LLMJun 21 2024

In a recent study published in the JAMA Network Open, researchers evaluated the accuracy and safety of large language models (LLMs) in answering medical oncology examination questions.

Study: Performance of Large Language Models on Medical Oncology Examination Questions. Image Credit: BOY ANTHONY/Shutterstock.com

Background

LLMs have the potential to revolutionize healthcare by assisting clinicians with tasks and interacting with patients. These models, trained on vast text corpora, can be fine-tuned to answer questions with human-like responses.

LLMs encode extensive medical knowledge and have shown the ability to pass the United States (US) Medical Licensing Examination, demonstrating comprehension and reasoning. However, their performance varies across medical subspecialties.

With rapidly evolving knowledge and high publication volume, medical oncology presents a unique challenge.

Further research is needed to ensure that LLMs can reliably and safely apply their medical knowledge to dynamic and specialized fields like medical oncology, improving clinician support and patient care.

About the study

The present study, conducted from May 28 to October 11, 2023, followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines and did not require ethics board approval or informed consent due to the lack of human participants.

American Society of Clinical Oncology (ASCO)’s publicly accessible question bank provided 52 multiple-choice questions, each with one correct answer and explanatory references. Similarly, the European Society for Medical Oncology (ESMO) Examination Trial Questions from 2021 and 2022 provided 75 questions after excluding image-based ones, with answers developed by oncologists.

To ensure unbiased testing, 20 original questions were created by oncologists, maintaining a multiple-choice format.

Chat Generative Pre-trained Transformer (ChatGPT)-3.5 and ChatGPT-4 were used to answer these questions, labeled consistently for comparison. Six open-source LLMs, including Biomedical Mistral-7B Domain Adapted for Retrieval and Evaluation (BioMistral-7B DARE), tailored for biomedical domains, were also evaluated.

Responses were recorded with explanations classified into a four-level error scale. Statistical analysis, conducted in R version 4.3.0, tested accuracy, error distribution, and agreement between oncologists.

The study used binomial distribution, McNemar test, Fisher test, weighted κ, and Wilcoxon rank sum test, with a 2-sided P value of .05, indicating statistical significance.

Study results

The evaluation of LLMs across 147 examination questions included 52 from ASCO, 75 from ESMO, and 20 original questions. Hematology was the most common category (15.0%), but the questions spanned various topics.

ESMO questions were more general, addressing mechanisms and toxic effects of systemic therapies. Notably, 27.9% of questions required knowledge from evidence published from 2018 onwards. LLMs provided prose answers to all questions, with proprietary LLM 2 needing prompts for specific answers in 22.4% of cases.

A selected ASCO question involved a 62-year-old woman with metastatic breast cancer presenting with symptoms of a pulmonary embolism. Proprietary LLM 2 correctly identified the best treatment as low molecular weight heparin or a direct oral anticoagulant, considering the patient's cancer and travel history.

Another ASCO question described a 61-year-old woman with metastatic colon cancer experiencing neuropathy from her chemotherapy regimen. The LLM recommended switching to targeted therapy with encorafenib and cetuximab, given the presence of a B-Raf proto-oncogene, serine/threonine kinase (BRAF) V600E mutation, and its side effects.

Proprietary LLM 2 demonstrated the highest accuracy, correctly answering 85.0% of questions (125 out of 147), significantly outperforming random answering and other models. The performance was consistent across ASCO (80.8%), ESMO (88.0%), and original questions (85.0%).

When given a second attempt, 54.5% of initially incorrect answers were corrected. Proprietary LLM 1 and the best open-source LLM, Mixture of Mistral-8x7B version 0.1 (Mixtral-8x7B-v0.1), had lower accuracies of 60.5% and 59.2%, respectively. BioMistral-7B DARE, tuned for biomedical domains, had an accuracy of 33.6%.

Qualitative evaluation of the prose answers by clinicians showed that proprietary LLM 2 provided correct and error-free answers for 83.7% of the questions.

Incorrect answers were more frequent when questions required knowledge of recent publications, with errors in knowledge recall, reasoning, and reading comprehension identified.

Clinicians classified 63.6% of errors as having a medium likelihood of causing harm, with a high likelihood in 18.2% of cases. No hallucinations were observed in the LLM responses.

Conclusions

In this study, LLMs performed exceptionally well on medical oncology exam-style questions intended for trainees nearing clinical practice. Proprietary LLM 2 correctly answered 85.0% of multiple-choice questions and provided accurate explanations, showcasing its substantial medical oncology knowledge and reasoning abilities.

However, incorrect answers, particularly those involving recent publications, raised significant safety concerns. Proprietary LLM 2 outperformed its predecessor, proprietary LLM 1, and demonstrated superior accuracy compared to other LLMs.

The study revealed that while LLMs' capabilities are improving, errors in information retrieval, especially with newer evidence, pose risks. Enhanced training and frequent updates are essential for maintaining up-to-date medical oncology knowledge in LLMs.

Journal reference:

Longwell JB, Hirsch I, Binder F, et al. (2024) Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open. doi:10.1001/jamanetworkopen.2024.17641. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2820094

Posted in: Device / Technology News | Medical Science News | Medical Research News | Medical Condition News

Comments (0)

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Kumar Malesu, Vijay. (2024, June 21). AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful. News-Medical. Retrieved on April 20, 2025 from https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx.
MLA
Kumar Malesu, Vijay. "AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful". News-Medical. 20 April 2025. <https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx>.
Chicago
Kumar Malesu, Vijay. "AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful". News-Medical. https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx. (accessed April 20, 2025).
Harvard
Kumar Malesu, Vijay. 2024. AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful. News-Medical, viewed 20 April 2025, https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx.