In a recent study published in the Radiology Journal, researchers performed a prospective exploratory analysis to assess the performance of artificial intelligence (AI)-based ChatGPT on radiology board–style examination questions between February 25 and March 3, 2023.
Study: Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Image Credit: MMDCreative/Shutterstock.com
Background
ChatGPT, based on GPT-3.5, is a general large language model (LLM) pre-trained on >45 terabytes of textual data using deep neural networks.
Though not trained in medical data, ChatGPT has shown immense potential in medical data writing and education. Accordingly, physicians are already using ChatGPT with search engines to search for medical information.
ChatGPT is under investigation for its potential use in simplifying radiology reports and aiding clinical decision-making. Additionally, it could help educate radiology students, perform differential and computer-aided diagnoses, and in disease classification.
ChatGPT recognizes relationships and patterns between words across its enormous training data to generate human-like responses.
Though it could generate a factually incorrect response; however, so far, ChatGPT has performed exceptionally well on several professional examinations, e.g., the U.S. Medical Licensing Examination, without any domain-specific pretraining.
Though ChatGPT appears promising for applications in diagnostic radiology, including image analysis, the ChatGPT performance in the radiology domain remains unknown.
More importantly, radiologists must know the strengths and limitations of ChatGPT to use it confidently.
About the study
In the present study, researchers included 150 multiple-choice questions with one correct and three wrong answers, which matched the content, style, and difficulty level of the Canadian Royal College examination in diagnostic radiology and the American Board of Radiology Core and Certifying examinations.
These board examinations comprehensively assess conceptual knowledge of radiology and the ability to reason and make a clinical judgment(s).
Two board-certified radiologists independently reviewed these questions and ensured these questions matched specific criteria, e.g., questions did not have images, wrong answers were plausible and similar in length to the correct answer, etc.
At least 10% of questions originated from nine topics listed by the Canadian Royal College to ensure these multiple-choice questions were on topics that comprehensively covered the concept of radiology.
Two other board-certified radiologists classified those 150 multiple-choice questions by type using Bloom Taxonomy principles into lower-order or higher-order thinking.
The team entered all questions with their answer choices into ChatGPT to simulate real-world use and recorded all ChatGPT responses. The Royal College considers ≥70% on all written components as passing scores.
Another two board-certified radiologists subjectively assessed the language of each ChatGPT response for its level of confidence on a Likert scale on a one-to-four, where a score of four indicated high confidence and zero indicated no confidence.
Finally, the researchers also made qualitative observations of the behavior of ChatGPT when they prompted the model with the correct answer.
First, the researchers computed the overall performance of ChatGPT. Next, they compared its performance using Fisher exact test between question types and topics, e.g., related to physics or clinical type.
In addition, they performed subgroup analysis for subclassifications of higher-order thinking questions. The team had subclassified higher-order thinking questions into four groups, involving the description of imaging, clinical management, application of concepts, and disease associations.
Lastly, they used the Mann-Whitney U test to compare the confidence level of responses between correct and incorrect ChatGPT responses, where p-values less than 0.05 indicated a significant difference.
Study findings
ChatGPT nearly passed radiology board–style examination questions without images in this study and scored 69%.
The model performance was better on questions requiring lower-order thinking involving knowledge recall and basic understanding than those requiring higher-order thinking (84% vs. 60%).
However, it performed well on higher-order questions related to clinical management (89%), likely because a large amount of disease-specific patient-facing data is available on the Internet.
It struggled with higher-order questions involving the description of imaging results, calculation and classification, and application of concepts.
Also, ChatGPT performed poorly on physics questions relative to clinical questions (40% vs. 73%). ChatGPT used confident language consistently, even when incorrect (100%).
The tendency of ChatGPT to produce incorrect human-like responses with confidence is particularly dangerous if it is the sole source of information. This behavior limits the applicability of ChatGPT in medical education at present.
Conclusions
ChatGPT excelled on questions assessing basic knowledge and understanding of radiology, and without radiology-specific pretraining, it nearly passed (scored 69%) a radiology board–style examination without images.
However, radiologists must exercise caution and remain aware of the limitations of ChatGPT, including its tendency to present incorrect responses with 100% confidence. In other words, study findings do not support relying on ChatGPT for practice or education.
With future advancements in LLMs, the availability of applications built on LLMs with radiology-specific pretraining will increase. Overall, the study results are encouraging for the potential of LLMs-based models like ChatGPT in radiology.