In a recent study published in JAMA Internal Medicine, researchers evaluated the ability of ChatGPT, an artificial intelligence-based chatbot assistant, to respond to patient questions posted on a publically accessible social media forum.
Background
Owing to the quick expansion of digital health care, more and more patients have begun to raise queries on social media forums. Answering these questions is not just time-consuming but tedious for healthcare professionals. AI assistants, like ChatGPT, could help address this additional work and help draft quality responses, which later clinicians could review.
Study: Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. Image Credit: Wright Studio / Shutterstock
About the study
In the present cross-sectional study, researchers randomly drew 195 exchanges in response to a patient question asked over Reddit's r/AskDocs, a publically accessible social media forum, in October 2022. Then, a team of licensed healthcare professionals generated a new chatbot session using the original full text of the question to which a physician responded and then evaluated the anonymized physician and chatbot responses. Note, this session was free of any prior questions that could bias the results. Next, they rated the average outcomes of ChatGPT and physicians on a 1-to-5 scale for their quality and empathy, with a higher score indicating better quality.
On r/AskDocs, subreddit moderators verify healthcare professionals' credentials who post a response and display it alongside the answer. The researchers also anonymized patient messages by removing unique information to protect patients' identities and make this study Health Insurance Portability and Accountability Act (HIPAA) compliant.
In addition, the researchers compared the number of words in physician and chatbot responses to determine the number of responses for which evaluators preferred the chatbot. Furthermore, they compared rates of responses on prespecified thresholds, e.g., less than adequate, to compute prevalence ratios for chatbot and physician responses.
Finally, the team reported the Pearson correlation between quality and empathy scores. In addition, they evaluated the extent to which subsetting the data into longer replies authored by physicians (>75th percentile length) changed evaluator preferences and the quality or empathy ratings.
Results
In 585 evaluations equating to 78.6% responses, evaluators preferred chatbot (or ChatGPT) responses over physician responses. Strikingly, even when compared to the lengthiest physician-authored responses, ChatGPT responses were rated significantly higher for both quality and empathy.
The proportion of responses rated ≥4 indicating 'good' or 'very good' quality was higher for chatbots than physicians (chatbot: 78.5% vs. physicians: 22.1%). It equated to 3.6 times higher quality in chatbot responses.
Additionally, the proportion of chatbot responses rated ≥4, indicating 'empathetic' or 'very empathetic' were more than physician responses (t = 18.9). Likewise, the proportion of responses rated ≥4 indicating 'empathetic' or 'very empathetic' was higher for chatbot responses than for physicians (chatbot: 45.1% vs. physicians: 4.6%). It equated to a 9.8 times higher empathy in chatbot responses.
The Pearson correlation coefficient (r) between quality and empathy scores authored by physicians vs. chatbots were 0.59 and 0.32, respectively.
Conclusion
In the electronic health records, each new message added 2.3 minutes of more after-hours work for a healthcare professional. Thus, increasing messaging volume translated to increased burnout for clinicians, with 62% of physicians experiencing at least one burnout symptom. It also increased the likelihood of patients' messages going unanswered or fetching unhelpful responses.
Some patient queries require more skills and time to answer; however, most are not seeking high-quality medical advice and are generic, like asking about appointments and test results. It represents an uncharted territory where AI assistants could be tested and, if successful, could help reduce or manage the extra burden levied on clinicians by patient messages.
ChatGPT is well-recognized for its extraordinary potential to write human-like quality responses on varied topics beyond basic health concepts. Thus, answering patients looking for medical advice on social media forums could help save clinical staff time for more complex tasks, drafting an answer for physicians or support staff to edit later, and, most importantly, bringing in more consistency in responses.
Additionally, if patients would receive a quick response to their queries, it might reduce unnecessary clinic visits and even help patients with mobility limitations or who have irregular work hours. For some patients, prompt messaging might collaterally affect health behaviors, e.g., stricter adherence to diet and medications.
Overall, this study yielded promising results and demonstrated that the use of AI assistants has the potential to improve both clinician and patient outcomes. Nevertheless, evaluating AI-based technologies in randomized clinical trials is still vital before their implementation in real-world clinical settings. In addition, such trials should examine their effect on clinical staff and physician burnout in more detail.