New study finds GPT-4 matches radiologists in diagnosing brain tumors from MRI reports, with impressive accuracy in differential diagnoses.
Study: Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors. Image Credit: raker/Shutterstock.com
A recent study published in European Radiology compared the diagnostic performance of Generative Pretrained Transformer 4 (GPT-4) with radiologists using brain tumor reports.
Background
Large-language models (LLMs) have been dominant in global technology discourse. The advent of ChatGPT has simplified using these models conversationally. Among LLMs, the GPT series has particularly received significant attention; its potential to diagnose from an image is notable.
Two studies have demonstrated the potential of GPT-4 in differential diagnosis in neuroradiology. Although these studies suggested a vital role of GPT-4 in radiological diagnosis, no study has evaluated using real-world radiology reports.
About the study
In the present study, researchers examined the diagnostic capability of GPT-4 using real-world radiology reports. ChatGPT (based on GPT-4) was prompted with imaging findings from real reports and asked to provide final and differential diagnoses.
For comparison, the same findings were presented to radiologists. Four general radiologists and three neuroradiologists participated. General radiologists specialize in areas other than imaging diagnosis.
One general radiologist and neuroradiologist reviewed collected findings, while others conducted reading tests. Brain magnetic resonance imaging (MRI) findings of preoperative tumors were collected from two institutions.
Imaging findings were verified by a general radiologist and a neuroradiologist. Diagnoses described in imaging findings were removed, but information on the reporter type (general radiologist or neuroradiologist) was retained.
MRI reports were translated from Japanese to English. ChatGPT was asked to provide three possible diagnoses using the imaging findings. The diagnosis listed as the highest among the three was considered the final diagnosis.
The same imaging findings were provided to two neuroradiologists and three general radiologists; these experts were different from those who provided input reports.
Radiologists’ interpretations and LLM output were assessed against the pathological diagnosis of the tumor. McNemar’s test compared the diagnostic accuracy of differential and final diagnoses between GPT-4 and each radiologist.
In addition, separate analyses were performed based on whether a general radiologist or neuroradiologist prepared the input report. Fisher’s exact test compared the diagnostic accuracy between GPT-4 and all radiologists.
Findings
In total, 150 radiology reports were included; 94 were of female subjects. Pathologies included meningioma, pituitary adenoma, angioma, schwannoma, high- and low-grade glioma, sarcoma, lymphoma, and hemangioblastoma, among others. The accuracy of the final diagnosis was comparable between GPT-4 and radiologists.
The accuracy rate of GPT-4 for final diagnosis was 73%; in comparison, accuracy rates were 65% for one neuroradiologist and two general radiologists, 73% for one neuroradiologist, and 79% for one general radiologist. Further, GPT-4 achieved an accuracy of 94% for differential diagnoses compared to radiologists, whose accuracies ranged from 73% to 89%.
Notably, GPT-4 showed statistically significant differences in the final diagnoses when a general radiologist and a neuroradiologist prepared imaging findings. Its accuracy rates for the final diagnosis were 80% and 60% when the reporter was a neuroradiologist and general radiologist, respectively.
Conclusions
The study compared the diagnostic performance of GPT-4 and five radiologists using brain MRI findings from 150 cases. GPT-4 was 73% accurate in listing the final diagnosis, while radiologists’ accuracies ranged between 65% and 79%.
It was 94% accurate for differential diagnosis, while radiologists achieved 73% – 89% accuracy. Notably, GPT-4 had a significantly higher accuracy for final diagnosis when a neuroradiologist prepared the input reports.
However, there were no significant differences for differential diagnoses, regardless of the reporter type. The study used textual information only and did not assess the effect of including other information, such as MRI images and patient history. Further, GPT-4’s performance was evaluated in only one language; how it varies in different languages remains unknown.