A study published in JAMA Network Open claims that the quality of artificial intelligence (AI))-generated responses to patient eye care questions is comparable to that written by certified ophthalmologists.
Background
Large language models, including bidirectional encoder representations from transformers (BERT) and generative pre-trained transformer 3 (GPT-3), have extensively transformed natural language processing by helping computers interact with texts and spoken words like humans. This has led to the generation of chatbots.
A large amount of text and spreadsheet data related to natural language processing tasks are used to train these models. In healthcare sectors, these models are widely used for various purposes, including prediction of hospital stay duration, categorization of medical images, summarization of medical reports, and identification of patient-specific electronic health record notes.
ChatGPT is regarded as a powerful large language model. The model was designed to specifically generate natural and contextually appropriate responses in a conversational setting. Since its release in November 2022, the model has been used for simplifying radiology reports, writing hospital discharge summaries, and transcribing patient notes.
Given their enormous benefits, large language models are gaining rapid entry into clinical setups. However, incorporation of these models into routine clinical practice requires proper validation of model-generated data by physicians. This is particularly important to avoid the delivery of misleading information to patients and family members seeking healthcare advice.
In this study, scientists have compared the efficacy of certified ophthalmologists and Al-based chatbots in generating accurate and useful responses to patient eye care questions.
Study design
The study analysis included a set of information collected from the Eye Care Forum, which is an online platform where patients can ask detailed eye care-related questions and receive answers from the American Academy of Ophthalmology (AAO)-certified physicians.
The quality assessment of the collected dataset led to the selection of 200 question-answer pairs for the final analysis. The eye care responses (answers) included in the final analysis were provided by the top ten physicians in the forum.
ChatGPT (OpenAl) version 3.5 was used in the study to generate eye care responses with a style similar to human-created responses. The model was provided with explicit instructions about the task of responding to selected eye care questions in the form of a specially crafted input prompt so that the model could adapt its behavior accordingly.
This led to the generation of a question-answer dataset where each question had one ophthalmologist-provided response and one ChatGPT-generated response. The comparison between these two types of responses was done by a masked panel of eight AAO-certified ophthalmologists.
They were also asked to determine whether the responses contained correct information, whether the responses could cause harm, including the severity of harm, and whether the responses were aligned with the perceived consensus in the medical community.
Important observations
A total of 200 questions included in the study had an average length of 101 words. The average length of ChatGPT responses (129 words) was significantly higher than physician responses (77 words).
All members of the expert panel together were able to differentiate between ChatGPT and physician responses, with a mean accuracy of 61%. The accuracies of individual members ranged from 45% to 74%. A high percentage of responses were rated by the expert panel as “definitely ChatGPT-generated.” However, about 40% of these responses were actually written by physicians.
According to the experts’ assessments, no significant difference was observed between ChatGPT and physician responses in terms of information accuracy, alignment with the perceived consensus in the medical community, and probability of causing harm.
Study significance
The study finds that ChatGPT is capable of analyzing long patient-written eye care questions and subsequently generating appropriate responses that are comparable to physician-written responses in terms of information accuracy, alignment with the medical community standards, and probability of causing harm.
As mentioned by scientists, despite promising outcomes, large language models can have potential disadvantages. These models are prone to generate incorrect information, commonly known as “hallucinations.” Some findings of this study also highlight the generation of hallucinated responses by ChatGPT. This kind of response can be potentially harmful to patients seeking eye care advice.
Scientists suggest that large language models should be used in clinical setups for assisting physicians and not as a patient-facing AI that substitutes their judgment.