In a recent study published in the journal Nature Medicine, an international team of scientists identified the best large language models and adaptation methods for clinically summarizing large amounts of electronic health record data and compared the performance of these models to that of medical experts.
Study: Adapted large language models can outperform medical experts in clinical text summarization. Image Credit: takasu / Shutterstock
Background
A laborious but essential aspect of medical practice is the documentation of patient medical health records containing progress reports, diagnostic tests, and treatment history across specialists. Clinicians often spend a substantial portion of their time compiling vast amounts of textual data, and even with very experienced physicians, this process presents a possibility of introducing errors, which can translate to serious medical and diagnostic problems.
The transition from paper records to electronic health records only seems to have expanded the workload of clinical documentation, and reports suggest that clinicians spend approximately two hours each documenting the clinical data from their interactions with one patient. Nurses spend close to 60% of their time in clinical documentation, and the temporal demands of this process often result in considerable stress and burnout, decreasing job satisfaction among clinicians and eventually resulting in worse patient outcomes.
Although large language models present an excellent option for the summarization of clinical data, and these models have been evaluated for general natural language processing tasks, their efficiency and accuracy in summarizing clinical data have not been evaluated extensively.
About the study
In the present study, the researchers evaluated eight large language models across four clinical summarization tasks, namely, patient questions, radiology reports, dialogue between doctor and patient, and progress notes.
They first used quantitative natural language processing metrics to determine which model and adaptation method performed the best across the four summarization tasks. Ten physicians then conducted a clinical reader study where they compared the best summaries from the large language models with those from medical experts along parameters such as conciseness, correctness, and completeness.
Finally, the researchers assessed the safety aspects to determine the challenges, such as the fabrication of information and the potential for medical harm present in the summarization of clinical data by medical experts and large language models.
Two broad language-generation approaches — autoregressive and seq2seq models — were used to evaluate the eight large language models. Training seq2seq models requires paired datasets as they use an encoder-decoder architecture that maps the input to the output. These models perform efficiently in tasks involving summarization and machine translation.
On the other hand, autoregressive models do not require paired datasets, and these models are suitable for tasks such as dialogue and question-answer interactions and text generation. The study evaluated open-sourced autoregressive and seq2seq large language models, as well as some proprietary autoregressive models and two techniques for adapting the general-purpose, pre-trained large language models to perform domain-specific tasks.
The four areas of tasks used to evaluate the large language models consisted of summarization of radiology reports using detailed data of radiology analyses and results, summarization of questions from patients into condensed queries, using progress notes to produce a list of medical problems and diagnoses, and summarizing interactions between the doctor and patient into a paragraph on the assessment and plan.
Results
The results showed that 45% of the summaries from the best-adapted large language models were equivalent to and 36% of them were superior to those from medical experts. Furthermore, in the clinical reader study, the large language model summaries scored higher than the medical expert summaries across all three parameters of conciseness, correctness, and completeness.
Furthermore, the scientists found that ‘prompt engineering’ or the process of tuning or modifying the input prompts greatly improved the performance of the model. This was apparent, especially along the conciseness parameter, where specific prompts instructing the model to summarize patient questions into queries of specific word counts were helpful in meaningfully condensing the information.
Radiology reports were the one aspect where the conciseness of the large language model summaries was lower than that of medical experts, and the scientists predicted that this could be due to the vagueness of the input prompt since the prompts for summarizing the radiology reports did not specify the word limit. However, they also believe that incorporating checks from other large language models or model ensembles, as well as from human operators, can greatly improve the accuracy of this process.
Conclusions
Overall, the study found that using large language models to summarize data on patient health records performed as well or better than the summarization of data by medical experts. Most of these large language models scored higher than human operators in the natural language processing metrics, concisely, correctly, and completely summarizing the data. This process can potentially be implemented with further modifications and improvements to help clinicians save valuable time and improve patient care.
Journal reference:
- Veen, V., Uden, V., Blankemeier, L., Delbrouck, J., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., Seehofnerová, A., Rohatgi, N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, C. P., Hom, J., Gatidis, S., Pauly, J., & Chaudhari, A. S. (2024). Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine. DOI: 10.1038/s41591024028555, https://www.nature.com/articles/s41591-024-02855-5