In a recent study posted to the medRxiv* preprint server, researchers systematically evaluated the capabilities and limitations of large language models (LLMs), specifically ChatGPT, for zero-shot medical evidence summarization.
Study: Evaluating Large Language Models on Medical Evidence Summarization. Image Credit: Piscine26/Shutterstock.com
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
Text summarization research has relied on fine-tuned pre-trained models as the primary approach. However, these models often need large training datasets, which may not be accessible in certain domains, like medical literature.
Large language models (LLMs) have caused a shift in natural language processing (NLP) research due to their recent success in zero- and few-shot prompting.
Prompt-based models offer promise for medical evidence summarization by allowing the model to summarize without updating parameters simply by following human instructions. Yet, no research has been conducted on summarizing and evaluating medical evidence.
About the study
In the present study, researchers evaluated the effectiveness of LLMs, such as ChatGPT and GPT-3.5, in summarizing medical evidence across six clinical domains. The capabilities and limitations of these models are systematically examined.
The study utilized Cochrane Reviews from the Cochrane Library and concentrated on six clinical areas: Alzheimer's disease, esophageal cancer, kidney disease, skin disorders, neurological conditions, and heart failure. The team collected the ten most recent reviews published for these six domains.
Domain experts verified reviews to ensure they fulfilled important research objectives. The study focused on single-document summarization, specifically on the abstracts obtained from Cochrane Reviews.
The zero-shot performance concerning medical evidence summarization was evaluated using two models, GPT-3.5 and ChatGPT. Two experimental setups were designed to assess the models' capabilities.
The models were provided with the complete abstract, except for the Author's Conclusions (ChatGPTAbstract) in the initial setup. Two models, ChatGPT-MainResult and GPT3.5-MainResult, were given the Objectives as well as main results sections of the abstract as input in the second setup.
The main results document was selected as it contains significant benefit and harm findings. It also summarized how the risk of bias affects conduct, trial design, and reporting.
The quality of the generated summaries was assessed using several automatic metrics, such as ROUGE-L, METEOR, and BLEU, compared to a reference summary. The values of the generated summaries were rated on a scale of 0.0 to 1.0, where a score of 1.0 suggested that the generated summaries matched the reference summary.
The model-generated summaries underwent a thorough human evaluation that surpassed the limitations of automatic metrics. The evaluation identified four dimensions for defining summary quality: coherence, factual consistency, comprehensiveness, and harmfulness.
A 5-point Likert scale was used to evaluate each dimension. Participants were asked to explain in a text box corresponding to each dimension if the summary obtained a low score. Participants were asked to rate the quality of the summaries and share their most and least preferred ones, along with reasons for their decisions.
Results
Similar performance was observed among all models regarding the ROUGE-L, METEOR, and BLEU metrics. LLMs-generated summaries were less novel regarding n-grams and tended to be more extractive than those written by humans.
ChatGPT-MainResult displayed greater abstraction than GPT3.5-MainResult and ChatGPT-Abstract; however, it still fell short of human reference. Around 50% of the reviews were written in 2022 and 2023, which falls outside the timeframe of GPT3.5 and ChatGPT's capabilities. No significant variations were noted in quality metrics estimated before and after 2022.
The ChatGPT-MainResult LLM configuration was the most preferred, producing the highest number of preferred summaries, surpassing the other two configurations by a significant margin.
ChatGPTMainResult was the preferred option due to its ability to generate a thorough summary encompassing important details. The team noted that a lack of important data, fabricated errors, and errors in interpretation were the main reasons certain summaries were deemed the least preferred option.
The study also showed that ChatGPT-MainResult was the most preferred option due to its minimal factual inconsistency errors and lack of harmful or misleading statements.
Conclusion
The study findings revealed that the three model settings of ChatGPT-Abstract, ChatGPT-MainResult, and GPT3.5- MainResult produced comparable results when evaluated using automatic metrics. However, these metrics did not estimate factual inconsistency, potential for medical harm, or human preference for LLM-generated summaries.
The researchers believe that human evaluation is crucial for assessing the accuracy and quality of medical evidence summaries produced by LLMs. However, there is a need for more efficient automatic evaluation methods in this area.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.