A new study by investigators from Mass General Brigham demonstrates that large language models (LLMs), a type of generative AI, may help reduce physician workload and improve patient education when used to draft replies to patient messages. The study also found limitations to LLMs that may affect patient safety, suggesting that vigilant oversight of LLM-generated communications is essential for safe usage. Findings, published in Lancet Digital Health, emphasize the need for a measured approach to LLM implementation.
Rising administrative and documentation responsibilities have contributed to increases in physician burnout. To help streamline and automate physician workflows, electronic health record (EHR) vendors have adopted generative AI algorithms to aid clinicians in drafting messages to patients; however, the efficiency, safety and clinical impact of their use had been unknown.
Generative AI has the potential to provide a 'best of both worlds' scenario of reducing burden on the clinician and better educating the patient in the process. However, based on our team's experience working with LLMs, we have concerns about the potential risks associated with integrating LLMs into messaging systems. With LLM-integration into EHRs becoming increasingly common, our goal in this study was to identify relevant benefits and shortcomings."
Danielle Bitterman, MD, corresponding author, faculty member in the Artificial Intelligence in Medicine (AIM) Program at Mass General Brigham and physician in the Department of Radiation Oncology at Brigham and Women's Hospital
For the study, the researchers used OpenAI's GPT-4, a foundational LLM, to generate 100 scenarios about patients with cancer and an accompanying patient question. No questions from actual patients were used for the study. Six radiation oncologists manually responded to the queries; then, GPT-4 generated responses to the questions. Finally, the same radiation oncologists were provided with the LLM-generated responses for review and editing. The radiation oncologists did not know whether GPT-4 or a human had written the responses, and in 31% of cases, believed that an LLM-generated response had been written by a human.
On average, physician-drafted responses were shorter than the LLM-generated responses. GPT-4 tended to include more educational background for patients but was less directive in its instructions. The physicians reported that LLM-assistance improved their perceived efficiency and deemed the LLM-generated responses to be safe in 82.1 percent of cases and acceptable to send to a patient without any further editing in 58.3 percent of cases. The researchers also identified some shortcomings: If left unedited, 7.1 percent of LLM-generated responses could pose a risk to the patient and 0.6 percent of responses could pose a risk of death, most often because GPT-4's response failed to urgently instruct the patient to seek immediate medical care.
Notably, LLM-generated/physician-edited responses were more similar in length and content to LLM-generated responses versus the manual responses. In many cases, physicians retained LLM-generated educational content, suggesting that they perceived it to be valuable. While this may promote patient education, the researchers emphasize that overreliance on LLMs may also pose risks, given their demonstrated shortcomings.
The emergence of AI tools in health has the potential to positively reshape the continuum of care and it is imperative to balance their innovative potential with a commitment to safety and quality. Mass General Brigham is leading the way in responsible use of AI, conducting rigorous research on new and emerging technologies to inform the incorporation of AI into care delivery, workforce support and administrative processes. Mass General Brigham is currently leading a pilot integrating generative AI into the electronic health record to draft replies to patient portal messages, testing the technology in a set of ambulatory practices across the health system.
Going forward, the study's authors are investigating how patients perceive LLM-based communications and how patients' racial and demographic characteristics influence LLM-generated responses, based on known algorithmic biases in LLMs.
"Keeping a human in the loop is an essential safety step when it comes to using AI in medicine, but it isn't a single solution," Bitterman said. "As providers rely more on LLMs, we could miss errors that could lead to patient harm. This study demonstrates the need for systems to monitor the quality of LLMs, training for clinicians to appropriately supervise LLM output, more AI literacy for both patients and clinicians, and on a fundamental level, a better understanding of how to address the errors that LLMs make."
Source:
Journal reference:
Chen, S., et al. (2024) The effect of using a large language model to respond to patient messages. The Lancet Digital Health. doi.org/10.1016/S2589-7500(24)00060-8.