In a recent study published in JAMA Network Open, a team of researchers from Vanderbilt University examined the potential role of the Chat-Generative Pre-Trained Transformer (ChatGPT) in providing medical information to patients and health professionals.
Study: Accuracy and Reliability of Chatbot Responses to Physician Questions. Image Credit: CkyBe / Shutterstock
Background
ChatGPT is widely used for various purposes nowadays. This large language model (LLM) has been trained on articles, books, and other sources across the web. ChatGPT understands requests from human users and provides answers in text and, now, image formats. Unlike natural language processing (NLP) models that came before it, this chatbot can learn by itself through ‘self-supervised learning.’
ChatGPT synthesizes immense amounts of information rapidly, making it an invaluable reference tool. Medical professionals could use this application to draw inferences from medical data and be informed about complex clinical decisions. This would make healthcare more efficient, as physicians would not need to look up multiple references to obtain necessary information. Similarly, patients would be able to access medical information without needing to rely solely on their doctor.
However, the utility of ChatGPT in medicine, to doctors and patients, lies in whether it can provide accurate and complete information. Many cases have been documented where the chatbot ‘hallucinated’ or produced convincing responses that were entirely incorrect. It is crucial to assess its accuracy in responding to health-related queries.
“Our study provides insights into model performance in addressing medical questions developed by physicians from a diverse range of specialties; these questions are inherently subjective, open-ended, and reflect the challenges and ambiguities that physicians and, in turn, patients encounter clinically.”
About the study
Thirty-three physicians, faculty, and recent graduates from the Vanderbilt University Medical Center devised a list of 180 questions that belonged to 17 pediatric, surgical, and medical specialties. Two additional question sets included queries on melanomas, immunotherapy, and common medical conditions. In total, 284 questions were chosen.
The questions were designed to have clear answers based on the medical guidelines of early 2021 (when the training set for the chatbot version 3.5 ended). Questions could be binary (with yes/no answers) or descriptive. Based on difficulty, they were classified as easy, medium, or hard.
An investigator entered each question into the chatbot, and the response to each question was assessed by the physician who had designed it. The accuracy and completeness were scored using Likert scales. Each question was scored from 1-6 for accuracy, where 1 indicated ‘completely incorrect’ and 6 ‘completely correct.’ Similarly, completeness was graded from 1-3, where 3 was the most comprehensive, and 1 was the least. A completely incorrect answer was not assessed for completeness.
Score results were reported as median [interquartile range (IQR)] and mean [standard deviation (SD)]. Differences between groups were assessed using Mann-Whitney U tests, Kruskal-Wallis tests, and Wilcoxon signed-rank tests. When more than one physician scored a particular question, interrater agreement was also checked.
Incorrectly answered questions were asked a second time, between one and three weeks later, to check if the results were reproducible over time. All immunotherapy and melanoma-based questions were also rescored to assess the performance of the most recent model, ChatGPT version 4.
Findings
In terms of accuracy, the chatbot had a median score of 5 (IQR: 1-6) for the first set of 180 multispecialty questions, indicating that the median answer was “nearly all correct.” However, the mean score was lower, at 4.4 [SD: 1.7]. While the median completeness score was 3 (“ comprehensive”), the mean score was lower at 2.4 [SD: 0.7]. Thirty-six answers were classified as inaccurate, having scored 2 or less.
For the first set, completeness and accuracy were also slightly correlated, with a correlation coefficient of 0.4. There were no significant differences in the completeness and accuracy of ChatGPT’s answers across the easy, moderate, and hard questions and between descriptive and binary questions.
For the reproducibility analysis, 34 out of the 36 were rescored. The chatbot’s performance improved markedly, with 26 being more accurate, 7 remaining constant, and only 1 being less accurate than before. The median score for accuracy increased from 2 to 4.
The immunotherapy and melanoma-related questions were assessed twice. In the first round, the median score was 6 (IQR: 5-6), and the mean score was 5.2 (SD: 1.3). The chatbot performed better in the second round, improving its mean score to 5.7 (SD: 0.8). Completeness scores also increased, and the chatbot also scored highly on the questions related to common conditions.
“This study indicates that 3 months into its existence, chatbot has promise for providing accurate and comprehensive medical information. However, it remains well short of being completely reliable.”
Conclusions
Overall, ChatGPT performed well in terms of completeness and accuracy. However, the mean score was noticeably lower than the median score, suggesting that a few highly inaccurate answers (“hallucinations”) pulled the average down. Since these hallucinations are delivered in the same convincing and authoritative tone, they are difficult to distinguish from correct answers.
ChatGPT improved markedly over the short period between assessments. This indicates the importance of continuously updating and refining algorithms and using repeated user feedback to reinforce factual accuracy and verified sources. Increasing and diversifying training datasets (within medical sources) will allow ChatGPT to parse nuances in medical concepts and terms.
Additionally, the chatbot could not distinguish between ‘high-quality’ sources like PubMed-index journal articles and medical guidelines and ‘low-quality’ sources such as social media pieces – it weighs them equally. With time, ChatGPT can become a valuable tool for medical practitioners and patients, but it is not there yet.