TECH

OpenAI's o1 outperforms ER doctors in Harvard Science study on real emergency cases

32+

Signals

Strategic Overview

01.
A peer-reviewed study by Harvard Medical School and Beth Israel Deaconess Medical Center, with Stanford collaborators, was published in Science on April 30, 2026 (DOI: 10.1126/science.adz4433), evaluating OpenAI's o1-preview reasoning model against attending physicians on real emergency department cases.
02.
On 76 real ER cases from Beth Israel Deaconess in Boston, o1-preview reached 67.1% exact-or-near-exact diagnostic accuracy at initial triage, versus 55.3% and 50.0% for two attending physicians given the same raw, unstructured electronic health records.
03.
The most striking gap was on management reasoning using expert-scored clinical vignettes: the AI scored a median of 89% while 46 physicians with conventional resources scored 34%, and across vignettes o1 received a perfect clinical-reasoning score on 98% of cases versus 35% for attendings.
04.
The authors explicitly cautioned that the findings do not justify deploying o1 in real ERs, calling for prospective randomized clinical trials and noting that tests were text-only — the model received no medical imaging, audio, or visual cues such as patient distress.

Deep Analysis

The 89% vs 34% management-reasoning gap is the real story, not triage accuracy

Most coverage led with the headline triage number — 67.1% for o1-preview versus 55.3% and 50.0% for two attending physicians on 76 Beth Israel Deaconess cases. That gap is real but modest, and with richer clinical detail the spread compressed to 82% for o1 versus 70-79% for physicians, a difference the authors note was not statistically significant. The astonishing result lives one decision point deeper: on management reasoning using expert-scored clinical vignettes, o1 scored a median of 89% while 46 physicians with conventional resources scored 34%. As an emergency physician noted in published commentary on the result, 'That is not a typo.'

Management reasoning is the chain of choices after diagnosis — what to image, what to admit, which therapy to start, what to monitor. It is where attending physicians earn their salary, and it is where benchmark-style multiple-choice tests have historically failed to discriminate between strong and weak clinicians. The Science paper also reports that o1 received a perfect clinical-reasoning score on 98% of cases versus 35% for attending physicians. Whatever the eventual deployment story, the implication for medical AI evaluation is structural: the locus of automation pressure is shifting from 'name the disease' to 'plan the next 24 hours of care,' and that is the dimension where current attendings look least defensible on paper.

Why peer review in Science changes the conversation

An emergency physician writing about the result captured the shift bluntly: 'This was published in Science. Not a preprint. Not a company blog post. Peer-reviewed, in one of the two most prestigious scientific journals in the world.' Until April 30, 2026, the strongest AI-vs-physician numbers came either from journal pieces using clean vignettes (the JAMA Network Open GPT-4 study at 90% versus 74-76%), from AI lab releases (Microsoft's MAI-DxO at 85% versus 20% for unaided physicians on 304 NEJM cases), or from preprints on NEJM clinicopathological conference cases. Each could be discounted by skeptics as either too clean, too marketed, or too narrow.

The Harvard-BIDMC paper closes those escape hatches. It uses raw, unstructured electronic health records exactly as they appear in clinical practice. It evaluates the model at three sequential decision points — initial triage on arrival, first contact with a physician, and admission to the medical floor or ICU — rather than at a single artificial moment. It uses real attendings as the comparator, not residents or web-search baselines. And it ran the gauntlet of Science peer review. As co-author Peter Brodeur warns, existing multiple-choice and vignette benchmarks are saturating near 100%, which is precisely why this study's choice to anchor evaluation on attending physicians and raw EHRs matters: it is the new ceiling test, and the ceiling is no longer above the ground floor.

The deployment debate: triadic care vs. cheaper-than-nothing access

The paper's authors are unusually loud about how the result should not be used. Adam Rodman has explicitly proposed a 'triadic care model … the doctor, the patient, and an artificial intelligence system' rather than replacement, and Arjun Manrai — while saying o1 'eclipsed both prior models' — pushes back on framing the result as AI replacing doctors. Coverage notes that both researchers publicly criticize companies marketing AI to 'cut doctors out of the loop.' Peter Brodeur reinforces the guardrail: 'Humans should be the ultimate baseline when it comes to evaluating performance and safety.' The accountability picture is also exposed: reporters note 'there is not a formal framework right now for accountability,' and Rodman warns a regulatory posture of 'just trust me' would limit adoption.

Reader communities split along a different axis. Reddit splits along an accessibility-cheerleader vs. dystopia-skeptic line: top voices in r/singularity frame the result as a path to cheaper care for people with no doctor at all — 'getting care from an error-prone AI is better than dying with 0 care' — while r/artificial commenters worry the next step is insurer-mandated AI gates implemented 'in the worst and cheapest way possible,' and flag that the model still cannot prescribe. Practitioner comments cut a third way: a pediatrics resident and a physician describe using o1-class models as 'instant professional consultants,' but one commenter advises walking the human doctor step-by-step toward the AI's conclusion without revealing AI was used, suggesting clinicians already navigate a quiet de facto deployment. Yujin Potter at UC Berkeley adds the alignment dimension that the accuracy numbers do not address: hallucinations, deception risk, and the gap between benchmark performance and safety in the loop.

What the model didn't see, and why the multimodal blind spot defines the next study

Every result in the Science paper rests on text. The study notes that real-life clinical medicine is 'multifaceted and awash with non-text inputs,' and that auditory cues such as patient distress and visual inputs such as medical imaging were not part of the evaluation. Thomas Buckley, a doctoral candidate at HMS Biomedical Informatics, is direct about the consequence: 'They're underperforming on most medical imaging benchmarks.' Coverage confirms parallel studies on image and signal performance are underway. Until those land, any reading of the paper that claims o1 can do an attending's full job is going further than the data.

The text-only constraint also explains why the management reasoning gap is so wide. Management plans on a written EHR are heavily verbal artefacts — admit/discharge, order this lab, start this antibiotic — and a step-by-step reasoning model with no time pressure, no shift fatigue, and no patient interaction friction is well matched to that task. Strip the model of physical exam, imaging, and the noisy social environment of an ER, and you have removed the dimensions where physicians most clearly outperform. The press release flag that the model may recommend 'unnecessary testing that could expose patients to harm' is the obvious failure mode of a text-strong, context-poor recommender. Future work — and the prospective trials the authors are calling for — has to test whether the gap survives when imaging, EKGs, and bedside cues come back into the loop.

Why this paper lands now: benchmark saturation and a regulatory vacuum

Three forces converged to make this study land hard. First, reasoning models architecturally fit the task: o1's step-by-step chain-of-thought design was engineered for multi-step problem solving, which maps onto how clinicians construct differentials and management plans. Second, the existing benchmark stack is saturating — top models score near 100% on multiple-choice and vignette tests, which is exactly why Brodeur and colleagues anchored on raw EHRs and attending-physician baselines. Without that pivot, the result would not have been distinguishable from prior preprints. Third, ER workflow itself is the highest-leverage entry point for text-strong AI: triage relies on parsing messy, noisy electronic health records under time pressure with sparse early data, the regime where LLM second-readers can plausibly add value.

The demand side is also already in place. Coverage cites that by 2025 roughly 20% of clinicians were consulting LLMs for second opinions, with UK survey figures at 16% daily and 15% weekly use. That is de facto deployment without trials, without accountability frameworks, and without integration into liability and reimbursement. The Harvard-BIDMC authors are reading the room: they want their result interpreted as a call for prospective randomized trials, human-computer interaction studies, and equity/cost/safety evaluation rather than as a green light. Hopkins and Cornelisse at Flinders University frame the same point externally: 'Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes.' The takeaway for tech-adjacent readers is that the headline number (89% vs 34%) and the policy posture (no, do not deploy this in your ER) are coming from the same authors, on purpose, and the gap between the two is exactly what the next two years of medical AI regulation will be about.

Historical Context

2024-11-17

A widely cited trial of 50 doctors showed GPT-4 alone scored 90% on six diagnostic vignettes while doctors with or without AI scored 74-76%, establishing the modern AI-vs-physician baseline that the Harvard team wanted to push beyond.

2024-12-19

Earlier evaluations on NEJM clinicopathological conference cases reported o1-preview reaching roughly 80% diagnostic accuracy versus about 30% for human clinicians, foreshadowing the Harvard ER findings.

2025-06-30

Microsoft published its AI Diagnostic Orchestrator, reporting up to 85% accuracy on 304 NEJM cases versus around 20% for 21 unaided physicians — a parallel industry result preceding the Harvard peer-reviewed study.

2026-04-30

First peer-reviewed Science paper to evaluate a reasoning LLM on raw ER electronic health records, with o1-preview hitting 67.1% triage accuracy versus 50-55% for attending physicians and 89% versus 34% on management reasoning.

Power Map

Key Players

Subject

OpenAI's o1 outperforms ER doctors in Harvard Science study on real emergency cases

Harvard Medical School

Lead institution; senior co-authors Arjun Manrai (Assistant Professor of Biomedical Informatics) and Adam Rodman (Assistant Professor of Medicine and director of the HMS AI curriculum task force) drove the medical-AI research agenda.

Beth Israel Deaconess Medical Center (BIDMC)

Boston teaching hospital that supplied the 76 real ER cases and houses co-first author Peter Brodeur and senior co-author Adam Rodman; provides the real-world clinical setting being evaluated.

OpenAI

Developer of the evaluated o1-preview reasoning model; benefits from a Science-tier validation of its medical reasoning capabilities but did not run or fund the study according to reporting.

Stanford University

Collaborating institution; Jonathan Chen and Ethan Goh are among the co-authors, extending the Harvard-led medical-AI research network.

Science (AAAS journal)

Publisher of the peer-reviewed paper on April 30, 2026, lending the work top-tier scientific credibility and pushing the result well beyond a preprint or company blog post.

Emergency department physicians

Profession most directly affected; serve as the human baseline (50-55% triage accuracy, 34% management reasoning) and bear the workflow and accountability implications of any future deployment.

Source Articles

Top 3

THE SIGNAL.

Analysts

"Pushes back against vendors trying to remove physicians from the loop and proposes a 'triadic care model' where doctor, patient, and AI work together rather than any replacement scenario."

Adam Rodman

Assistant Professor of Medicine, Harvard Medical School; Director of AI Programs, Carl J. Shapiro Center, BIDMC; senior co-author

"Says o1 surpassed essentially every benchmark and prior model — 'We tested the AI model against virtually every benchmark, and it eclipsed both prior models' — but rejects framing the result as AI replacing doctors."

Arjun (Raj) Manrai

Assistant Professor of Biomedical Informatics, Harvard Medical School Blavatnik Institute; Founding Deputy Editor, NEJM AI; senior co-author

"Warns existing benchmarks are saturating and human clinicians must remain the safety yardstick: 'Humans should be the ultimate baseline when it comes to evaluating performance and safety.'"

Peter Brodeur

Clinical Fellow in Medicine, Harvard Medical School / Beth Israel Deaconess; co-first author

"Cautions that even strong text reasoning leaves multimodal medical imaging as a major weak spot for current models: 'They're underperforming on most medical imaging benchmarks.'"

Thomas Buckley

Doctoral candidate, Harvard Medical School Department of Biomedical Informatics; Dunleavy Fellow

"Highlights model hallucinations, deception risk, and AI alignment as concerns that diagnostic-accuracy numbers alone do not address, arguing safety must be evaluated alongside performance."

Yujin Potter

AI researcher, UC Berkeley

The Crowd

"OpenAI o1-preview beats doctors in hard clinical reasoning, it's not even close, ~80% vs 30% on 143 hard NEJM CPC diagnoses"

@u/obvithrowaway34434672

"AI Outperforms ER Doctors in Diagnostic Cases, Study Points to Collaborative Care"

@u/PhoenixRising656244

"o1-preview is far superior to doctors on reasoning tasks and it's not even close"

@u/MetaKnowing83

Broadcast

Can OpenAI's o1 solve complex medical problems?

AI beats doctors in diagnoses, but human judgment still key: study

Can artificial intelligence outperform human doctors? Here's what new Stanford study found