A new study shows AI can match or exceed physicians on challenging diagnostic tasks. However, key questions remain about how these systems will perform in real clinical care and decision-making.
Study: Performance of a large language model on the reasoning tasks of a physician. Image credit: MUNGKHOOD STUDIO/Shutterstock.com
In a recent study published in Science, researchers conducted a comprehensive evaluation of the OpenAI o1 large language model (LLM) against hundreds of physicians to test its clinical reasoning performance on complex tasks. The study comprised data acquisition across five experimental benchmarks and a real-world emergency department study, including "gold standard" medical puzzles and real-world emergency room scenarios.
Study findings revealed that the artificial intelligence (AI) model generally outperformed human physician baselines across multiple tasks, suggesting that advanced models may have now surpassed many established benchmark tests of clinical reasoning. This study suggests that, in the near future, AI could move beyond information retrieval to provide sophisticated, reliable clinical second opinions.
Earlier AI struggled with real-world patient complexity
Decades-old records revealed that, since the 1950s, the medical community has sought computational systems capable of the nuanced logic required to diagnose complex diseases. For over 65 years, as systems aimed at realizing this requirement were developed, the New England Journal of Medicine (NEJM) clinicopathological case conference (CPC) series, complex, real-life medical puzzles, has served as their ultimate test.
The advent of the modern age of artificial intelligence (AI) has promised new generations of these clinical-reasoning-capable computational systems. However, reviews on the topic reveal that early AI attempts relied on rigid, symbolic rules that struggled with the "messy" reality of patient care.
Furthermore, while previous generations of LLMs, AI systems trained on massive amounts of text to predict and generate human-like language, showed promise, they often lacked a human-level baseline for comparison. However, as novel LLMs begin to demonstrate "benchmark saturation," researchers now aim to determine whether they can truly reason through clinical uncertainty or merely default to regurgitating memorized facts.
Large-scale comparison of AI against physician performance
The present study aimed to investigate whether the latest generation of AI models (specifically OpenAI’s o1-preview model) could match or exceed the performance of human experts across multiple distinct clinical diagnostic and management challenges. The study’s methodologically diverse testing environments included traditional puzzles that leveraged medical data from 143 cases (NEJM CPC), evaluating diagnostic accuracy.
Similarly, 20 encounters from the NEJM Healer curriculum - a digital platform for assessing clinical logic - were used to score the model's reasoning process. Real-world performance was measured in a Boston-based, blinded study in which o1 was tested against two expert attending physicians using 76 unstructured patient records collected directly from a major academic emergency department (ED).
Notably, the model's performance was compared with that of datasets including hundreds of practitioners, including residents (doctors in training) and attending physicians (senior experts). Statistical analysis included the Bond scale to measure diagnostic accuracy and the Revised-IDEA (R-IDEA) score, a 10-point validated scale for evaluating how well a clinician documents their clinical reasoning, to assess the quality of the model's thought process.
AI surpasses physician benchmarks across diverse clinical tasks
The study’s statistical analyses of the NEJM evaluation data revealed largely consistent findings: the AI repeatedly outperformed human baselines. In the NEJM CPC challenges, for example, o1-preview was found to include the correct diagnosis in its list 78.3 % of the time. When specifically compared on the same 70 cases included in the training dataset, o1-preview achieved 88.6 % accuracy, significantly higher than GPT-4’s 72.9 % (P = 0.015).
The AI’s management reasoning - the ability to decide on the next best step for a patient - was observed to be particularly impressive. On a set of five complex vignettes, o1-preview achieved a median score of 89 %. In contrast, physicians using conventional resources like search engines and medical databases scored a median of only 34 % (P < 0.001).
In the real-world emergency department (ER) experiment, the gap between the o1 AI model and its human expert competitors was found to be most pronounced at the "initial triage" stage. This stage is clinically considered a high-stakes moment, as it occurs when a patient first arrives, information is scarce, and quick decisions are vital.
Here, the o1 model identified the correct diagnosis 67.1 % of the time, while the two expert physicians achieved 55.3 % and 50.0 %, respectively. Furthermore, in the NEJM Healer cases, the AI achieved a perfect R-IDEA score in 78 out of 80 instances, outperforming both residents and attendings (P < 0.0001).
However, not all comparisons showed statistically significant improvements, and in some tasks, performance was comparable to prior models or physicians. The authors also noted that both human and AI performance improved as more clinical information became available, and that model outputs still exhibited uncertainty.
AI reaches high-level performance on clinical reasoning benchmarks
The present study is likely the first to conclude that LLMs have now reached a level of computational and reasoning advancement that enables them to provide high-level diagnostic support on benchmark tasks.
However, the authors note important limitations: the study focused on text-only inputs, whereas real-world medicine is "multimodal," involving visual cues, physical exams, and the patient's voice. Additionally, the tests focused on internal and emergency medicine, which are not generalizable or indicative of model performance in fields like surgery. The authors also emphasize that some evaluations rely on curated or educational cases, which may overestimate performance compared to real-world clinical workflows.
Despite these caveats, the researchers argue that the rapid improvement of these tools underscores the urgent need for prospective clinical trials to test their clinical applicability in real-world patient care settings and to better understand how clinicians and AI systems may work together.
Download your PDF copy by clicking here.