In a recent study published in the journal Eye, researchers from Canada evaluated the performance of two artificial intelligence (AI) chatbots, Google Gemini and Bard, in the ophthalmology board examination.
They found that both the tools achieved acceptable accuracy in the answers and performed well in the field of ophthalmology, with some variation across countries.
Study: Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment. Image Credit: Deemerwha studio/Shutterstock.com
Background
AI chatbots such as ChatGPT (short for chat-generative pre-trained transformer), Bard, and Gemini are increasingly used in medical settings. Their performance continues to evolve across exams and disciplines.
While ChatGPT-3.5's accuracy was up to 64% in steps one and two of the AMBOSS and NBME (short for National Board Medical Examination) exams, newer versions like ChatGPT-4 showed improved performance.
Google's Bard and Gemini offer responses based on diverse cultural and linguistic training, potentially tailoring information to specific countries. However, the responses vary across geographies, calling for further research to ensure consistency, particularly in medical applications where accuracy is crucial for patient safety.
In the present study, researchers aimed to evaluate the performance of Google Gemini and Bard on a set of practice questions designed for the ophthalmology board certification exam.
About the study
The performance of Google Gemini and Bard was assessed using 150 text-based multiple-choice questions obtained from “EyeQuiz,” an educational platform for medical professionals specializing in ophthalmology.
The portal provides practice questions for various exams, including the Ophthalmic Knowledge Assessment Program (OKAP), national board exams such as the American Board of Ophthalmology (ABO) exam, as well as certain postgraduate exams.
The questions were categorized manually, and data were collected using the Bard and Gemini versions available as of 30th November and 28th December 2023, respectively. The accuracy, provision of explanations, response time, and question length were assessed for both tools.
Secondary analyses included evaluating the performance in countries other than the United States (US), including Vietnam, Brazil, and the Netherlands, using virtual private networks (VPNs).
Statistical tests, including the chi-square and Mann-Whitney U tests, were conducted to compare performance across countries and chatbot models. Multivariable logistic regression was used to explore factors influencing correct responses.
Results and discussion
Bard and Gemini responded promptly and consistently to all 150 questions without experiencing high demand. In the primary analysis using the US versions, Bard took 7.1 ± 2.7 seconds to respond, while Gemini responded in 7.1 ± 2.8 seconds, with a longer average response length.
In the primary analysis using the US form of the chatbots, both Bard and Gemini achieved an accuracy of 71%, correctly answering 106 out of 150 questions. Bard provided explanations for 86% of its responses, while Gemini provided explanations for all responses.
Bard was found to perform best in orbital & plastic surgery, while Gemini showed superior performance in general ophthalmology, orbital & plastic surgery, glaucoma, and uveitis. However, both the tools struggled in the cataract & lenses and refractive surgery categories.
In the secondary analysis with Bard from Vietnam, the chatbot answered 67% of questions correctly, similar to the US version. However, using Bard from Vietnam led to different answer choices in 21% of questions compared to the US version.
With Gemini from Vietnam, 74% of questions were answered correctly, similar to the US version, but there were differences in answer choices for 15% of questions compared to the US version. In both cases, some questions answered incorrectly by the US versions were answered correctly by the Vietnam versions, and vice versa.
The Vietnam versions of Bard and Gemini explained 86% and 100% of their responses, respectively. Bard performed best in retina & vitreous and orbital & plastic surgery (80% accuracy), while Gemini performed better in cornea & external disease, general ophthalmology, and glaucoma (87% accuracy each).
Bard struggled most in cataracts & lenses (40% accuracy), while Gemini faced challenges in pediatric ophthalmology & strabismus (60% accuracy). Gemini's performance in Brazil and the Netherlands was relatively inferior to the US and Vietnam versions.
Despite the promising findings, the study's limitations include a small question sample size, reliance on an openly accessible question bank, unexplored effects of user prompts, internet speed, website traffic on response times, and occasional incorrect explanations provided by the chatbots.
Future studies could explore the chatbots' ability to interpret ophthalmic images, which remains relatively unexplored. Further research is warranted to address the limitations and explore additional applications in the field.
Conclusion
In conclusion, although both the US and Vietnam iterations of Bard and Gemini demonstrated satisfactory performance on ophthalmology practice questions, the study highlights potential response variability linked to user location.
Future evaluations to track the enhancement of AI chatbots and comparisons between ophthalmology residents and AI chatbots could offer valuable insights into their efficacy and reliability.