In a recent study posted to the medRxiv* preprint server, researchers evaluated the diagnostic accuracy of ChatGPT.
Recent years have seen a significant increase in the number of people seeking medical advice online. Many individuals search for a probable diagnosis by searching literature on the web concerning the symptoms they experience. Generative pre-trained transformer (GPT) models such as chatbots (such as ChatGPT) could revolutionalize the field of medicine and initiate self-diagnosis by providing data, including symptoms and differential diagnoses of medical conditions.
Study: ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases. Image Credit: metamorworks / Shutterstock
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
About the study
In the present study, researchers investigated whether ChatGPT could accurately diagnose various clinical cases.
The team included 50 clinical case vignettes, including 40 commonly observed cases and 10 rare cases. The 10 rarely observed cases were generated by a random selection of rare diseases and an orphan drug with positive status from the European Medicines Agency (EMA). The names of the rare diseases were used as queries on the PunMed database, and the first matching article’s case description was used for the analysis.
Concerning common complaints, 40 of the initially obtained 45 case vignettes were used. Five cases comprising the diagnosis within the symptomatology were excluded. The team queried ChatGPT for the 10 most probable diagnoses for the clinical case vignette of patients, entered as full text. No symptom extraction was performed.
All vignettes were prompted three times in independent chat boxes. Two versions of ChatGPT were used, i.e., the 3.50 version and the 4.0 version, yielding a total of 300 prompts and 3,000 suggested medical diagnoses. A human doctor compared the ChatGPT-suggested diagnoses with the correct diagnoses for the respective case vignettes.
Cases were considered correctly diagnosed in the case of direct matching (e.g., ‘acute otitis media’ diagnosed by the chatbot as ‘acute otitis media’) or if the ChatGPT suggested direct-type hierarchical relations with the correct medical diagnosis (e.g., ‘acute pharyngitis’ for ‘pharyngitis’, ‘GM2 gangliosidosis’ for Tay-Sachs disease, and ‘ischemic stroke’ for ‘stroke’).
The precision of indicated diagnoses was expressed as topX accuracy, representing the percentage of cases solved using a maximum of X indicated diagnoses. E.g., a top 1 diagnostic accuracy of 100.0% would denote all clinical case vignettes solved by the initially suggested medical diagnosis. If seven of 10.0 cases were solved by the initially indicated diagnosis and one additional case by the subsequent indicated diagnosis, the percentages for top1 and top2 would be 70.0% and 80.0%, respectively. In addition, Fleiss tests were performed to determine the level of agreement between the diagnosis indicated by ChatGPT and the correct diagnosis.
Results
ChatGPT 4.0 could provide two diagnoses for all 40 commonly observed cases. For rare cases, the 4.0 version of ChatGPT 4.0 needed ≥8.0 diagnostic suggestions to solve 90% of cases. Concerning common cases, ChatGPT 4.0 performed consistently better for all prompts than ChatGPT 3.50. The top2 accuracy for ChatGPT 3.50 was greater than 90.0%, and the top3 accuracy for the 4.0 version was 100.0% for all cases.
The findings indicated that within two indicated diagnoses, ChatGPT 3.50 could solve >90.0% of cases, and within three indicated diagnoses, ChatGPT 4.0 could solve all cases. The results for the 4.0 version were significantly better than those for the 3.50 version, and the diagnoses indicated by chatGPT were significantly identical to the correct medical diagnoses.
Concerning rare cases, the 3.50 version was 60.0% accurate, with the correct diagnosis within the 10 diagnoses indicated by the chatbot. In addition, only 23.0% of the correct diagnoses were listed as the initial result. The 4.0 version performed better than the 3.50 version. Nevertheless, ChatGPT 4.0 diagnostic accuracy for rare cases was far from that observed for common cases.
Among rare cases, 40.0% were solved with the initial indicated diagnosis; however, a minimum of eight diagnostic suggestions were required to attain a diagnostic accuracy of 90.0%. None of the models reached 100% accuracy. However, not even one case remained unsolved by ChatGPT, i.e., using ChatGPT 4.0 thrice yielded 3.0x10 diagnostic suggestions, which included the correct diagnosis for every case ≥1.0 times.
The findings indicated that running the models repeatedly for an input prompt could improve diagnostic accuracy. The Fleiss test results indicated good agreement and moderate agreement for the common and rare cases, respectively. ChatGPT 4.0 stated the correct diagnosis directly and indirectly in the initial and subsequent results and justified the indicated diagnoses by mapping laboratory test values and providing alternative diagnoses for symptoms experienced.
To conclude, based on the study findings, ChatGPT could be a valuable tool to assist human medical consultations for the diagnosis of complicated cases. ChatGPT 4.0 semantically understands medical diagnoses rather than merely copying and pasting them from research papers, web pages, or books. Despite the good accuracy in diagnosing common cases, chatGPT must be used cautiously by non-healthcare professionals, and medical doctors must be consulted before concluding any clinical condition, as stated by the chatbot itself.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.