Autonomous medical AI outperforms doctors in simulated EHR cases

Download PDF Copy

By Hugo Francisco de SouzaReviewed by Susha Cheriyedath, M.Sc.Jun 21 2026

A new study shows how MIRA turned clinical reasoning into structured EHR actions, outperforming physicians in simulated emergency cases while underscoring the safeguards needed before autonomous AI reaches real care.

MIRA is an autonomous medical AI agent that operates within an EHR sandbox, using a suite of tools to simulate clinical workflows: it can order tests, synthesize results and produce diagnoses and treatment plans while interacting through chat with a patient AI agent that is grounded in the documented HPI extracted from retrospective notes from real cases. Left, exemplary conversation between patient and MIRA with interleaved tool calls. Right, the FHIR-based architecture that executes tool calls and records medical outputs. Note: the data shown here are shortened and slightly modified to adhere to the privacy restrictions of the dataset.

In a recent study published in the journal Nature, researchers introduced MIRA, an autonomous artificial intelligence (AI) agent built to operate within sandboxed electronic health record (EHR) environments.

Unlike previous implementations, which predominantly comprised task-specific chat applications, MIRA was designed to independently ingest patient histories, order relevant diagnostic tests, and use these datasets to formulate diagnoses and treatment plans within a controlled simulation.

The study reveals that, across 574 MIMIC-IV cases, MIRA achieved 88.9% diagnostic accuracy, while in a matched 311-case physician comparison, it achieved 87.8% accuracy, significantly outperforming experienced human physicians under identical simulated conditions while showing strong, though not perfect, safety and guideline performance.

Background

Large language models (LLMs) have already proven highly capable at passing standardized medical exams and answering complex clinical questions. However, reviews reveal that translating this raw clinical knowledge into the operational workflow of a hospital has remained a major challenge.

This discrepancy is attributed to the architectural design of traditional medical AI tools, which act as narrow, task-specific search or text-generation utilities rather than active partners in care.

In contrast, true clinical decision-making is characterized as an intricate, multi-step process in which doctors repeatedly interview patients, order blood tests or imaging, synthesize conflicting results, and update hypotheses before arriving at a final treatment plan.

Furthermore, nearly all of this clinical work happens within electronic health record (EHR) systems that rely on complex, standardized coding protocols. Until now, it remained unproven whether an automated system could reliably handle this end-to-end clinical action space in a realistic EHR-style environment without unacceptable errors.

About the study

The present study aimed to address this functional gap by developing MIRA, a novel AI tool designed to autonomously ingest and access patients’ medical records, identify knowledge gaps, and order diagnostic tests to supplement EHR records, and then use the completed dataset to recommend clinical interventions.

The study subsequently tested MIRA’s capabilities in a sandboxed, virtual EHR environment compliant with standard healthcare protocols, including the HL7 FHIR (FHIR). The sandboxed test was conducted on a curated benchmarking dataset of 574 real-world emergency department cases from the Medical Information Mart for Intensive Care (MIMIC-IV) database.

Included cases encompassed eight distinct diagnoses across surgery (appendicitis), internal medicine (pneumonia), and oncology (pancreatic cancer), which MIRA navigated using 11 specialized digital tools with more than 85,000 operational choices. The tool was allowed to request physical examinations, order targeted laboratory values, look up medical histories, and generate medication orders within the simulated EHR, rather than in live patient care.

MIRA’s output was compared against two distinct human physician groups managing the exact same cases under identical conditions. The human groups comprised: 1. A cohort of four board-certified physicians, and 2. A mixed-seniority team consisting of four residents and two board-certified doctors.

Additionally, a separate (conventional) text-based AI agent was used to simulate the patients under MIRA’s (or the human physician teams’) care. This agent was instructed to respond to questions posed by MIRA or its human counterparts solely based on authentic clinical histories, while resisting adversarial attempts to trick it into prematurely leaking information. The authors noted, however, that simulated patient speech may be more structured than real emergency department conversations.

Study findings

The study’s results revealed that MIRA performed at or above the level of experienced human doctors. MIRA was found to achieve 88.9% diagnostic accuracy across the full 574-case dataset and 87.8% accuracy in the matched 311-case physician comparison. In comparison, board-certified physicians reached an average accuracy of 78.1% (p < 0.001), while the mixed-seniority medical cohort averaged 71.1% (p < 0.001).

Furthermore, MIRA was found to excel at identifying appendicitis and pancreatitis, achieving a perfect 100% recall for laparoscopic appendectomies. For pancreatic cancer, its diagnostic performance was equivalent to that of board-certified physicians, while pneumonia and urinary tract infections remained more challenging. Notably, MIRA did not achieve this superior accuracy by simply "ordering everything". While it was observed to request a broader, more comprehensive set of individual blood parameters than human doctors, its overall test selection remained well below historical dataset baselines.

The study findings demonstrated that the AI model successfully avoided the systematic over-ordering of high-cost radiological imaging, matching or exceeding physicians in overall resource-alignment metrics.

Safety evaluations were similarly encouraging but still preliminary. An independent, blinded medical review of 56 patient-level outputs and a separate assessment of 468 prescriptions written by MIRA established that the agent caused zero high-severity drug-drug interactions, zero renal dosing incompatibilities, and zero medication-allergy mismatches. Route specification was the weakest prescription field, at 97% correctness.

Furthermore, when making critical hospital admission decisions for pneumonia and pulmonary embolism, MIRA achieved a perfect recall score of 1.00, indicating that the AI tool never missed a single patient who required inpatient care. However, the pulmonary embolism analysis also suggested a tendency toward over-admission, reflecting a cautious disposition strategy.

Conclusions

The present study introduces an integrated EHR AI agent (MIRA) that successfully translates clinical intents into structured, safe, and accurate operations, potentially supporting physicians. However, the authors caution that MIRA (and similar AI agents) are not replacements for expert human staff.

The model did not achieve 100% perfection in all treatment choices, such as specific antibiotic selections, highlighting the ongoing need for strict human supervision and patient-level safeguards. Future model iterations may improve performance by incorporating evidence from retrieval-based support, stronger governance, and prospective real-world validation before clinical deployment.

Download your PDF copy by clicking here.

Journal reference:

Ferber, D., et al. (2026). Towards autonomous medical artificial intelligence agents. Nature. DOI: 10.1038/s41586-026-10675-5. https://www.nature.com/articles/s41586-026-10675-5

Posted in: Device / Technology News | Medical Research News | Healthcare News

Comments (0)

Written by

Hugo Francisco de Souza

Hugo Francisco de Souza is a scientific writer based in Bangalore, Karnataka, India. His academic passions lie in biogeography, evolutionary biology, and herpetology. He is currently pursuing his Ph.D. from the Centre for Ecological Sciences, Indian Institute of Science, where he studies the origins, dispersal, and speciation of wetland-associated snakes. Hugo has received, amongst others, the DST-INSPIRE fellowship for his doctoral research and the Gold Medal from Pondicherry University for academic excellence during his Masters. His research has been published in high-impact peer-reviewed journals, including PLOS Neglected Tropical Diseases and Systematic Biology. When not working or writing, Hugo can be found consuming copious amounts of anime and manga, composing and making music with his bass guitar, shredding trails on his MTB, playing video games (he prefers the term ‘gaming’), or tinkering with all things tech.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Francisco de Souza, Hugo. (2026, June 21). Autonomous medical AI outperforms doctors in simulated EHR cases. News-Medical. Retrieved on June 22, 2026 from https://www.news-medical.net/news/20260621/Autonomous-medical-AI-outperforms-doctors-in-simulated-EHR-cases.aspx.
MLA
Francisco de Souza, Hugo. "Autonomous medical AI outperforms doctors in simulated EHR cases". News-Medical. 22 June 2026. <https://www.news-medical.net/news/20260621/Autonomous-medical-AI-outperforms-doctors-in-simulated-EHR-cases.aspx>.
Chicago
Francisco de Souza, Hugo. "Autonomous medical AI outperforms doctors in simulated EHR cases". News-Medical. https://www.news-medical.net/news/20260621/Autonomous-medical-AI-outperforms-doctors-in-simulated-EHR-cases.aspx. (accessed June 22, 2026).
Harvard
Francisco de Souza, Hugo. 2026. Autonomous medical AI outperforms doctors in simulated EHR cases. News-Medical, viewed 22 June 2026, https://www.news-medical.net/news/20260621/Autonomous-medical-AI-outperforms-doctors-in-simulated-EHR-cases.aspx.