Study: Can AI chatbots accurately answer patient questions regarding vasectomies? Image Credit: Fabian Montano Hernandez / Shutterstock
ChatGPT provided the most accurate and concise answers to frequently asked vasectomy questions compared to Gemini (formerly Bard) and Copilot (formerly Bing), making it a reliable patient resource.
In a recent study published in the journal IJIR: Your Sexual Medicine Journal, researchers evaluated the efficacy and accuracy of three common generative artificial intelligence (AI) chatbots in answering basic healthcare questions. Specifically, they investigated ChatGPT-3.5, Bing Chat, and Google Bard's performance when answering questions related to vasectomies.
Critical assessment by a team of qualified urologists revealed that while all models performed satisfactorily across the ten common question tests, the ChatGPT algorithm attained the highest average score (1.367), significantly outperforming Bing Chat and Google Bard (p=0.03988 and p=0.00005, respectively). Encouragingly, with the exception of Google Bard (now 'Gemini') presenting one 'unsatisfactory' response to the question, 'Does a vasectomy hurt?', all generative AI responses were rated either 'satisfactory' or 'excellent.' Together, these results highlight the benefits of generative AI development in the healthcare industry, particularly when used to answer basic and common patient questions in an accurate and timely manner.
However, the study authors caution that while these results are promising, they were based on responses reviewed by only three non-blinded urologists, which may have introduced bias into the ratings. Despite this limitation, the findings are a step forward in validating AI chatbots for patient education.
Background
Artificial Intelligence (AI) is the collective name for a set of models and technologies that enable computers and machines to perform advanced tasks with human-like perception, comprehension, and iterative learning. Generative AI is a subset of these technologies that learn from human-supplied large machine learning (ML) datasets, thereby generating novel text, audio-visual media, and other types of informative data.
Recent progress in computation hardware (processing power), software (advanced algorithms), and expansive training datasets has allowed AI's utility to witness unprecedented growth, especially in the healthcare sector. Bolstered by the recent coronavirus disease 2019 (COVID-19) pandemic, the number of patients seeking online medical advice is higher than ever.
AI chatbots are pieces of software that leverage generative AI models to respond to user queries in an easily digestible language without the need for human agents. Numerous AI chatbots exist, with OpenAI's ChatGPT, Google's Bard (now 'Gemini'), and Microsoft's Bing Chat (now 'Copilot') representing the most used. ChatGPT alone has been reported to have more than 200 million users and more than 1.7 billion monthly responses in less than two years since its public release. While anecdotal evidence from both users and experts suggests that chatbots substantially outperform conventional search engine results in answering common medical questions, these hypotheses have never been formally investigated.
About the study
The present study aims to fill this gap in the literature using human (expert) subjective reasoning to evaluate chatbot responses to common urological questions regarding the vasectomy procedure. Given their widespread use (above 100 million users), the chatbots under investigation include ChatGPT-3.5, Google Bard, and Bing Chat.
Data for the study was obtained in a single session by having three expert registered urologists rate responses (four-point scale) to 10 common vasectomy questions. The questions were chosen from an independently generated question bank comprising 30 questions.
"Responses were rated as 1 (excellent response not requiring clarification), 2 (satisfactory requiring minimal clarification), 3 (satisfactory requiring moderate clarification), or 4 (unsatisfactory requiring substantial clarification). Scores of 1 were those that provided a level of detail and evidence that is comparable to what is reported in the current literature whereas scores of 4 were assigned if the answers were considered incorrect or vague enough to invite potential misinterpretation."
Following ratings, statistical analysis, including one-way Analysis of Variance (ANOVA) and Tukey's honestly significant difference (HSD) test, were used to elucidate differences between chatbot-specific outcomes. The results showed that ChatGPT's scores were significantly different from both Bard's and Bing's (p=0.00005 and p=0.03988, respectively), while the difference between Bard and Bing was found to be insignificant (p=0.09651).
Study findings
The ChatGPT model was observed to perform the best out of the three evaluated, with a mean score of 1.367 (lower is better) and 41 points across all ten questions. In comparison, Bing achieved a mean score of 1.800 (total = 54), and Bard had a mean score of 2.167 (total = 65). Notably, Bing and Bard's scores were statistically indistinguishable.
Results were similar in consistency evaluations, where ChatGPT once again topped scores – it was the only chatbot to receive unanimous 'excellent' (score = 1) ratings from all three experts and did so for three separate questions. In contrast, the worst score received was one expert rating one of Bard's responses 'unsatisfactory' for the question, 'Does a vasectomy hurt?' (score = 4).
"The question that received the highest score on average was "Do vasectomies affect testosterone levels?" (Mean score 2.22 ± 0.51) and the question that received the lowest score on average was "How effective are vasectomies as contraception?" (Mean score 1.44 ± 0.56)."
Conclusions
The present study is the first to scientifically evaluate the performance of three commonly used AI chatbots (with significant differences in their underlying ML models) in answering patients' medical questions. Herein, experts scored chatbot responses to frequently asked questions regarding the vasectomy procedure.
Contrasting the general advice of 'Do not google your medical questions,' all evaluated AI chatbots received overall positive ratings with mean scores ranging from 1.367 (ChatGPT) to 2.167 (Bard) on a 4-point scale (1 = excellent, 4 = unsatisfactory, lower is better). ChatGPT was found to perform the best of the three models and be the most consistently reliable (with three unanimous ‘excellent’ ratings). While Bard did receive an isolated ‘unsatisfactory’ rating for a single question, this only occurred once and may be considered a statistical outlier.
Together, these findings highlight AI chatbots as accurate and effective sources of information for patients seeking educational advice on common medical conditions, reducing the burden on medical practitioners and the potential monetary expenditure (consultation fees) for the general public. However, the study also highlights potential ethical concerns, particularly regarding non-blinded assessments and the small number of reviewers, which could have introduced bias into the results.