Study finds health care evaluations of large language models lacking in real patient data and bias assessment

Download PDF Copy

By Dr. Chinta SidharthanReviewed by Lily Ramsey, LLMOct 18 2024

A new systematic review reveals that only 5% of health care evaluations for large language models use real patient data, with significant gaps in assessing bias, fairness, and a wide range of tasks, underscoring the need for more comprehensive evaluation methods.

Study: Testing and Evaluation of Health Care Applications of Large Language Models. Image Credit: BOY ANTHONY/Shutterstock.com

In a recent study published in JAMA, researchers from the United States (U.S.) conducted a systematic review to evaluate various aspects of existing large language models (LLMs) used for healthcare applications, such as the healthcare tasks and data assessed types, to identify the most useful areas in healthcare for the application of LLMs.

Background

The use of artificial intelligence (AI) in healthcare has advanced rapidly, especially with the development of LLMs. Unlike predictive AI, which is used to forecast outcomes for processes, generative AI using LLMs can create a wide range of new content, such as images, sounds, and text.

Based on user inputs, LLMs can generate structured and largely coherent text responses, which makes them valuable in the healthcare field. In some health systems in the U.S., LLMs are already being applied for notetaking and are being explored in the medical field to improve efficiency and patient care.

However, the sudden interest in LLMs has also resulted in unstructured testing of LLMs across various fields, and the performance of LLMs in clinical settings has been mixed. While some studies have found the responses from LLMs to be largely superficial and often inaccurate, others have found accuracy rates comparable to those of human clinicians.

This inconsistency highlights the need for a systematic evaluation of the performance of LLMs in the healthcare setting.

About the study

For this comprehensive systematic review, the researchers searched preprints and peer-reviewed studies on LLM evaluations in healthcare published between January 2022 and February 2024. This two-year window was selected to include the papers published after the launch of the AI chatbot ChatGPT in November 2022.

Three independent reviewers screened the studies, which were included in the review if they focused on LLM evaluations in healthcare. Studies on basic biological research or multimodal tasks were excluded.

The studies were then categorized based on the data type evaluated, the healthcare tasks, the natural language processing (NLP) and natural language understanding tasks, medical specialties, and evaluation dimensions. The framework for categorization was developed from an existing list of healthcare tasks, established evaluation models, and inputs from healthcare professionals.

The categorization framework considered whether real patient data was evaluated and examined 19 healthcare tasks, including caregiving and administrative functions. Additionally, six NLP tasks, including summarization and question answering, were included in the categorization.

Furthermore, seven dimensions of evaluation were identified, including aspects such as factuality, accuracy, and toxicity. The studies were also grouped by medical specialty into 22 categories. The researchers then used descriptive statistics to summarize the findings and calculate the percentages and frequencies for each category.

Results

The review found that the evaluation of LLMs in healthcare is heterogeneous, and there are significant gaps in task coverage and data usage. Among the 519 studies included in the review, only 5% used real patient data, and most of the studies relied on expert-generated snippets of data or medical examination questions.

Most of the studies focused on LLMs for medical knowledge tasks, especially through evaluations such as the U.S. Medical Licensing Examination.

Patient care tasks, such as diagnosing patients and making recommendations for treatment, were also relatively common among the LLM tasks. However, administrative tasks, including clinical notetaking and billing code assignments, were rarely explored among the LLM tasks.

Among the NLP tasks, most of the studies focused on question answering, which included generic inquiries. Approximately 25% of the functions used LLMs for text classification and extraction of information, but tasks such as conversational dialogue and summarization were not well explored through LLM evaluations.

The most frequently examined evaluation dimension through LLMs was accuracy (95.4%), followed by comprehensiveness (47%). Very few studies used LLMs for ethical considerations related to bias, toxicity, and fairness.

While more than 20% of the studies were not specific to any medical specialties, internal medicine, ophthalmology, and surgery were the most represented in the LLM evaluation studies. Medical genetics and nuclear medicine studies were the least explored in the LLM evaluations.

Conclusions

Overall, the review highlighted the need for standardized evaluation methods and a consensus framework for assessing LLM applications in healthcare.

The researchers stated that the use of real patient data in LLM evaluations should be promoted, and the use of LLMs for administrative tasks and expanding the application of LLMs to other medical specialty areas would be highly beneficial.

Journal reference:

Bedi, S., Liu, Y., OrrEwing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, Akash R, Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., & Shah, N. H. (2024). Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. doi:10.1001/jama.2024.21700. https://jamanetwork.com/journals/jama/fullarticle/2825147

Posted in: Healthcare News

Comments (0)

Written by

Dr. Chinta Sidharthan

Chinta Sidharthan is a writer based in Bangalore, India. Her academic background is in evolutionary biology and genetics, and she has extensive experience in scientific research, teaching, science writing, and herpetology. Chinta holds a Ph.D. in evolutionary biology from the Indian Institute of Science and is passionate about science education, writing, animals, wildlife, and conservation. For her doctoral research, she explored the origins and diversification of blindsnakes in India, as a part of which she did extensive fieldwork in the jungles of southern India. She has received the Canadian Governor General’s bronze medal and Bangalore University gold medal for academic excellence and published her research in high-impact journals.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Sidharthan, Chinta. (2024, October 18). Study finds health care evaluations of large language models lacking in real patient data and bias assessment. News-Medical. Retrieved on December 13, 2025 from https://www.news-medical.net/news/20241018/Study-finds-health-care-evaluations-of-large-language-models-lacking-in-real-patient-data-and-bias-assessment.aspx.
MLA
Sidharthan, Chinta. "Study finds health care evaluations of large language models lacking in real patient data and bias assessment". News-Medical. 13 December 2025. <https://www.news-medical.net/news/20241018/Study-finds-health-care-evaluations-of-large-language-models-lacking-in-real-patient-data-and-bias-assessment.aspx>.
Chicago
Sidharthan, Chinta. "Study finds health care evaluations of large language models lacking in real patient data and bias assessment". News-Medical. https://www.news-medical.net/news/20241018/Study-finds-health-care-evaluations-of-large-language-models-lacking-in-real-patient-data-and-bias-assessment.aspx. (accessed December 13, 2025).
Harvard
Sidharthan, Chinta. 2024. Study finds health care evaluations of large language models lacking in real patient data and bias assessment. News-Medical, viewed 13 December 2025, https://www.news-medical.net/news/20241018/Study-finds-health-care-evaluations-of-large-language-models-lacking-in-real-patient-data-and-bias-assessment.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.