In a recent study published in JAMA Oncology, researchers compared online conversational artificial intelligence (AI) chatbot replies to cancer-related inquiries to those of licensed physicians concerning empathy, response quality, and readability.
Digital oncology solutions can help to cut expenses, enhance patient care outcomes, and minimize physician burnout. AI has produced significant advances in healthcare delivery, notably conversational AI-based chatbots that inform cancer patients about clinical diagnoses and treatment options. However, the potential of AI chatbots to produce replies based on cancer knowledge has yet to be validated. Interest in deploying these technological advancements in patient-facing roles is considerable, but their medical accuracy, empathy, and readability remain unknown. According to recent studies, chatbot replies are more empathic than physician replies to general medical inquiries online.
Brief Report: Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media. Image Credit: Jirsak / Shutterstock
About the study
In the present equivalence study, researchers examined numerous cutting-edge chatbots utilizing pilot parameters of response readability, empathy, and quality to assess chatbot competence in answering oncology-related patient concerns. They investigated the ability of three artificial intelligence chatbots, i.e., GPT-3.50 (first chatbot), GPT-4.0 (second chatbot), and Claude AI (third chatbot), to provide high-quality, sympathetic, and legible replies to cancer-related inquiries from patients.
The researchers compared AI chatbot replies with responses from six confirmed doctors to 200 cancer-related queries posed by patients in a public forum. They collected data on May 31, 2023. The research exposures comprised 200 patient cancer-related inquiries sent online to three AI chatbots between January 1, 2018, and May 31, 2023.
The primary study outcomes included pilot evaluations for readability, empathy, and quality on Likert scales ranging between 1.0 (extremely poor) and 5.0 (very good). Physicians from radiation oncology, medical oncology, and palliative and supportive care graded quality, empathy, and readability. The secondary outcome was readability, measured using Flesch-Kincaid Grade Level (FKGL) scores, Gunning-Fog Index, and Automated Readability Index.
The researchers assessed reading comprehension cognitive load using mean dependency distances for syntactic complexities and textual lexical diversities. They offered recommendations to limit the length of the chatbot response to the average physician response word count (125). Each question's responses were blindfolded and sorted at random. They conducted a one-way analysis of variance (ANOVA) with post-hoc tests to evaluate 200 readability, empathy, and quality ratings and 90 readability metrics between chatbot and physician replies. They used Pearson correlation coefficients to assess the relationships between measures.
Results
Researchers consistently scored chatbot replies higher concerning empathy, quality, and readability in writing styles. Responses created by chatbots 1, 2, and 3 were consistently superior on mean response quality component measures, such as medical correctness, completeness, focus, and quality, compared to physician responses. Similarly, chatbot replies scored higher on the component and overall empathy measures than physician replies.
Responses to 200 questions generated by chatbot 3, the highest-rated artificial intelligence chatbot, were regularly evaluated higher on overall criteria of quality, empathy, and readability than physician responses with mean values of 3.6 (vs. 3.0), 3.56 (vs. 2.4) and 3.8 (vs. 3.1), respectively. The mean Flesch-Kincaid grade level of physician replies (mean, 10.1) was not significantly different from the third chatbot's responses (mean, 10.3), although it was lower than that of the first (mean, 12.3) and second chatbots (mean, 11.3).
Physician replies scored lower in FKGL, showing a greater degree of estimated readability than chatbot responses, implying that chatbot responses may be more tedious to read due to word and phrase length. The mean number of words in the third chatbot replies was higher than that of physician responses (136 vs. 125), but there was no significant difference between the first chatbot (mean, 136) and the second chatbot (mean, 140) replies. Researchers observed word count robustly associated with evaluations of answer quality provided by physicians, the first and second chatbots, and empathy ratings for physician replies and the third chatbot responses.
Despite word count regulation efforts, only the third chatbot response showed higher word counts than physician replies. The first (mean, 12) and second chatbot replies (mean, 11) had considerably higher FKGL ratings than physician replies (mean, 10), whereas the third chatbot replies (mean, 10) were comparable to physician responses. However, physician replies had a 19% lower readability rating (mean, 3.1) than chatbot 3, the best-performing chatbot (mean, 3.8).
The study showed that conversational AI chatbots may deliver high-quality, sympathetic, and legible replies to patient inquiries comparable to those provided by physicians. Future studies should examine chatbot-mediated interaction breadth, process integration, and results. Specialized AI chatbots trained in big medical text corpora might support cancer patients emotionally and improve oncology care. They may also serve as point-of-care digital health tools and offer information to vulnerable groups. Researchers must establish future standards in randomized controlled trials to ensure proper monitoring and results for clinicians and patients. The higher empathy of chatbot replies may stimulate healthcare partnerships.