AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful

In a recent study published in the JAMA Network Open, researchers evaluated the accuracy and safety of large language models (LLMs) in answering medical oncology examination questions.

Study: Performance of Large Language Models on Medical Oncology Examination Questions. Image Credit: BOY ANTHONY/Shutterstock.comStudy: Performance of Large Language Models on Medical Oncology Examination Questions. Image Credit: BOY ANTHONY/Shutterstock.com

Background 

LLMs have the potential to revolutionize healthcare by assisting clinicians with tasks and interacting with patients. These models, trained on vast text corpora, can be fine-tuned to answer questions with human-like responses.

LLMs encode extensive medical knowledge and have shown the ability to pass the United States (US) Medical Licensing Examination, demonstrating comprehension and reasoning. However, their performance varies across medical subspecialties.

With rapidly evolving knowledge and high publication volume, medical oncology presents a unique challenge.

Further research is needed to ensure that LLMs can reliably and safely apply their medical knowledge to dynamic and specialized fields like medical oncology, improving clinician support and patient care.

About the study 

The present study, conducted from May 28 to October 11, 2023, followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines and did not require ethics board approval or informed consent due to the lack of human participants.

American Society of Clinical Oncology (ASCO)’s publicly accessible question bank provided 52 multiple-choice questions, each with one correct answer and explanatory references. Similarly, the European Society for Medical Oncology (ESMO) Examination Trial Questions from 2021 and 2022 provided 75 questions after excluding image-based ones, with answers developed by oncologists.

To ensure unbiased testing, 20 original questions were created by oncologists, maintaining a multiple-choice format.

Chat Generative Pre-trained Transformer (ChatGPT)-3.5 and ChatGPT-4 were used to answer these questions, labeled consistently for comparison. Six open-source LLMs, including Biomedical Mistral-7B Domain Adapted for Retrieval and Evaluation (BioMistral-7B DARE), tailored for biomedical domains, were also evaluated.

Responses were recorded with explanations classified into a four-level error scale. Statistical analysis, conducted in R version 4.3.0, tested accuracy, error distribution, and agreement between oncologists.

The study used binomial distribution, McNemar test, Fisher test, weighted κ, and Wilcoxon rank sum test, with a 2-sided P value of .05, indicating statistical significance.

Study results 

The evaluation of LLMs across 147 examination questions included 52 from ASCO, 75 from ESMO, and 20 original questions. Hematology was the most common category (15.0%), but the questions spanned various topics.

ESMO questions were more general, addressing mechanisms and toxic effects of systemic therapies. Notably, 27.9% of questions required knowledge from evidence published from 2018 onwards. LLMs provided prose answers to all questions, with proprietary LLM 2 needing prompts for specific answers in 22.4% of cases.

A selected ASCO question involved a 62-year-old woman with metastatic breast cancer presenting with symptoms of a pulmonary embolism. Proprietary LLM 2 correctly identified the best treatment as low molecular weight heparin or a direct oral anticoagulant, considering the patient's cancer and travel history.

Another ASCO question described a 61-year-old woman with metastatic colon cancer experiencing neuropathy from her chemotherapy regimen. The LLM recommended switching to targeted therapy with encorafenib and cetuximab, given the presence of a B-Raf proto-oncogene, serine/threonine kinase (BRAF) V600E mutation, and its side effects.

Proprietary LLM 2 demonstrated the highest accuracy, correctly answering 85.0% of questions (125 out of 147), significantly outperforming random answering and other models. The performance was consistent across ASCO (80.8%), ESMO (88.0%), and original questions (85.0%).

When given a second attempt, 54.5% of initially incorrect answers were corrected. Proprietary LLM 1 and the best open-source LLM, Mixture of Mistral-8x7B version 0.1 (Mixtral-8x7B-v0.1), had lower accuracies of 60.5% and 59.2%, respectively. BioMistral-7B DARE, tuned for biomedical domains, had an accuracy of 33.6%.

Qualitative evaluation of the prose answers by clinicians showed that proprietary LLM 2 provided correct and error-free answers for 83.7% of the questions.

Incorrect answers were more frequent when questions required knowledge of recent publications, with errors in knowledge recall, reasoning, and reading comprehension identified.

Clinicians classified 63.6% of errors as having a medium likelihood of causing harm, with a high likelihood in 18.2% of cases. No hallucinations were observed in the LLM responses.

Conclusions 

In this study, LLMs performed exceptionally well on medical oncology exam-style questions intended for trainees nearing clinical practice. Proprietary LLM 2 correctly answered 85.0% of multiple-choice questions and provided accurate explanations, showcasing its substantial medical oncology knowledge and reasoning abilities.

However, incorrect answers, particularly those involving recent publications, raised significant safety concerns. Proprietary LLM 2 outperformed its predecessor, proprietary LLM 1, and demonstrated superior accuracy compared to other LLMs.

The study revealed that while LLMs' capabilities are improving, errors in information retrieval, especially with newer evidence, pose risks. Enhanced training and frequent updates are essential for maintaining up-to-date medical oncology knowledge in LLMs.

Journal reference:
Vijay Kumar Malesu

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.    

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kumar Malesu, Vijay. (2024, June 21). AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful. News-Medical. Retrieved on November 23, 2024 from https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx.

  • MLA

    Kumar Malesu, Vijay. "AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful". News-Medical. 23 November 2024. <https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx>.

  • Chicago

    Kumar Malesu, Vijay. "AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful". News-Medical. https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx. (accessed November 23, 2024).

  • Harvard

    Kumar Malesu, Vijay. 2024. AI outperforms peers in medical oncology quiz, yet some mistakes could be harmful. News-Medical, viewed 23 November 2024, https://www.news-medical.net/news/20240621/AI-outperforms-peers-in-medical-oncology-quiz-yet-some-mistakes-could-be-harmful.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Rare genetic mutations in healthy women may be key to breast cancer origins