The ability of the AI chatbot GPT-4 to appropriately perform probabilistic reasoning in diagnosis vs a large survey of human clinicians

Download PDF Copy

By Vijay Kumar MalesuReviewed by Danielle Ellis, B.Sc.Dec 13 2023

In a recent study published in JAMA Network Open, a group of researchers evaluated the proficiency of Generative Pre-trained Transformer 4 (GPT-4) artificial intelligence (AI) in probabilistic reasoning compared to human clinicians by assessing pretest and posttest probability estimates in diagnostic cases.

*Study: Artificial Intelligence vs Clinician Performance in Estimating Probabilities of Diagnoses Before and After Testing. Image Credit: Rokas Tenys/Shutterstock.com*

Background

In order to diagnose disease, it is necessary to calculate the probability of different illnesses according to the manifestation of the symptoms and then correct these percentages using diagnostic findings.

Nevertheless, it is not easy for clinicians to estimate the pretest and posttest probabilities either through statistics or actual patient case situations. Large language models (LLMs) may help tackle intricate diagnostic problems, passing medical examinations, and empathetic patient interactions in clinical reasoning.

Further research is needed to explore the full potential and limitations of AI in complex, real-world diagnostic scenarios, as current studies show varying levels of AI performance in probabilistic reasoning compared to human clinicians.

About the study

The present study involved analyzing the performance of 553 practitioners in probabilistic reasoning using data from a national survey conducted between June 2018 and November 2019. These practitioners were evaluated across five cases, each aligned with scientific reference standards.

To assess the capabilities of AI in this domain, the researchers replicated each case from the survey into a model. This approach included incorporating specific prompts that were designed to elicit from the AI a committed response regarding pretest and posttest probabilities.

Given the stochastic nature of LLMs, the team employed a strategy to ensure the reliability of their findings. They executed an identical prompt within the LLM's application programming interface a hundred times. This was done at the model's default temperature setting, which is tuned to maintain a balance between creativity and consistency in the responses. This process, conducted in October 2023, allowed for the creation of a distribution of the AI's output responses.

To quantify the AI's performance, the researchers calculated the median and interquartile ranges (IQRs) of the LLMs estimates. Additionally, they determined the mean absolute error (MAE) and mean absolute percentage error (MAPE) for both the AI and human participants. The team conducted their analysis and created plots using R, version 4.3.0. The University of Maryland's institutional review board deemed this study exempt, as it did not involve human participants, and it adhered to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline throughout its conduct.

Study results

In a comparative study between human clinicians and an LLM, intriguing findings were observed regarding the estimation of pretest and posttest probabilities in various diagnostic cases. This study, involving an analysis of five different cases, revealed that the LLM consistently demonstrated lower error rates in predicting probabilities after a negative test result compared to human practitioners.

A notable example of this was seen in the case involving asymptomatic bacteriuria. Here, the LLMs median pretest probability was estimated at 26% (with an IQR of 20%-30%), while the human clinicians' median estimate was slightly lower at 20% but with a much broader interquartile range of 10%-50%. Despite the median estimate from the LLM being further from the correct answer than that of the humans, the LLM exhibited a lower MAE and MAPE at 26.2 and 5240%, respectively.

In contrast, the figures for human clinicians were higher, at 32.2 for MAE and 6450% for MAPE. This difference could be attributed to the LLMs narrower distribution of responses, providing a more consistent range of estimates compared to the wider variability seen in human responses.

Additionally, their estimation of the post-probability test following a positive test result was also notable yet inconsistent. For instance, concerning breast cancer and also an imaginary situation with testing, the LLM surpassed doctor-clinicians in precision. This indicates that it is possible that the LLM understood or handled these specific medical disorders better.

The performance of the AI was also similar to that of the human clinicians in two other situations, suggesting good expertise comparable to expertly trained medical personnel. Nonetheless, one instance in which the LLMs accuracy was lower than that of humans shows some spots that could be improved in the LLMs diagnostic capabilities.

These findings underscore the potential of AI, specifically LLMs, in the realm of medical diagnostics. The LLMs ability to often match or exceed human performance in estimating diagnostic probabilities showcases the advances in AI technology and its applicability in healthcare. However, the varied performance across different cases also indicates the need for continued refinement and understanding of AI's role and limitations in complex medical decision-making.

Journal reference:

Rodman A, Buckley TA, Manrai AK, Morgan DJ. Artificial Intelligence vs Clinician Performance in Estimating Probabilities of Diagnoses Before and After Testing. JAMA Netw Open. 2023 doi:10.1001/jamanetworkopen.2023.47075
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812737

Posted in: Medical Science News | Medical Research News

Comments (0)

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Kumar Malesu, Vijay. (2023, December 13). The ability of the AI chatbot GPT-4 to appropriately perform probabilistic reasoning in diagnosis vs a large survey of human clinicians. News-Medical. Retrieved on February 11, 2026 from https://www.news-medical.net/news/20231213/The-ability-of-the-AI-chatbot-GPT-4-to-appropriately-perform-probabilistic-reasoning-in-diagnosis-vs-a-large-survey-of-human-clinicians.aspx.
MLA
Kumar Malesu, Vijay. "The ability of the AI chatbot GPT-4 to appropriately perform probabilistic reasoning in diagnosis vs a large survey of human clinicians". News-Medical. 11 February 2026. <https://www.news-medical.net/news/20231213/The-ability-of-the-AI-chatbot-GPT-4-to-appropriately-perform-probabilistic-reasoning-in-diagnosis-vs-a-large-survey-of-human-clinicians.aspx>.
Chicago
Kumar Malesu, Vijay. "The ability of the AI chatbot GPT-4 to appropriately perform probabilistic reasoning in diagnosis vs a large survey of human clinicians". News-Medical. https://www.news-medical.net/news/20231213/The-ability-of-the-AI-chatbot-GPT-4-to-appropriately-perform-probabilistic-reasoning-in-diagnosis-vs-a-large-survey-of-human-clinicians.aspx. (accessed February 11, 2026).
Harvard
Kumar Malesu, Vijay. 2023. The ability of the AI chatbot GPT-4 to appropriately perform probabilistic reasoning in diagnosis vs a large survey of human clinicians. News-Medical, viewed 11 February 2026, https://www.news-medical.net/news/20231213/The-ability-of-the-AI-chatbot-GPT-4-to-appropriately-perform-probabilistic-reasoning-in-diagnosis-vs-a-large-survey-of-human-clinicians.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.