TECH

OpenAI o1 outperforms ER physicians in diagnosis

29+

Signals

Strategic Overview

01.
OpenAI's o1-preview produced the 'exact or very close' diagnosis in 67.1% of 76 real ER triage cases at Beth Israel Deaconess Medical Center, compared with 55.3% and 50.0% for the two attending internal medicine physicians evaluated on the same cases.
02.
The peer-reviewed paper, 'Performance of a large language model on the reasoning tasks of a physician,' was published in Science on April 30, 2026, led by Harvard Medical School and Beth Israel Deaconess with collaborators at Stanford.
03.
In one ER case, o1-preview flagged a rare flesh-eating infection in a transplant patient roughly 12-24 hours before the treating physician identified it — a window the senior author called the difference between survival and death.
04.
Two blinded reviewing physicians could not consistently distinguish whether ER assessments came from the AI or from the human attendings, and the researchers explicitly call for prospective randomized trials before any clinical deployment.

Deep Analysis

The Real Result Isn't 67% — It's That the Data Was Messy

The headline number — 67.1% diagnostic accuracy versus 55.3% and 50.0% for human attendings — is striking, but the methodological choice underneath it is the more consequential finding. Unlike previous medical AI evaluations, the Harvard team did not curate, summarize, or vignette-ify the cases. Each ER patient was handed to o1-preview exactly as the chart appeared in the electronic health record: vital signs, intake nurse notes, free-text fragments, and all. That is the closest thing the literature has produced to an in-the-wild physician comparison.

The second mechanical pillar is the model itself. o1-preview was OpenAI's first 'step-by-step reasoning' system, and the authors attribute the leap over GPT-4o — its non-reasoning predecessor — to that architectural change. On a separate 70-case NEJM head-to-head, o1-series scored ~89% exact-or-very-close versus ~73% for GPT-4. In other words, the jump in real ER triage tracks the jump on hard published cases, suggesting it isn't an artifact of the 76-case sample. The story is: a reasoning model, fed raw clinical text, beat physicians who had access to the same record. That is a different claim than 'AI passes the USMLE,' which the field has heard for two years.

The ER Doctor Pushback That Most Coverage Skipped

The comparator group was not ER specialists. It was two attending internal medicine physicians being asked to make ER triage calls, and emergency physician Kristen Panthagani has been pointed about why that matters: 'If we're going to compare AI tools to physicians' clinical ability, we should start by comparing to physicians who actually practice that specialty.' Her deeper objection is that ER triage isn't really a guess-the-diagnosis task — it's a rule-out-lethal-conditions task, with very different success criteria than the study's exact-match scoring.

Reddit's medical commenters surfaced a parallel critique: doctors in real triage have information that isn't in the chart — the patient's color, breathing pattern, demeanor — and the study only handed the model text. The top community read across r/EverythingScience and r/Futurology was that the model was effectively producing a 'second opinion based on paperwork,' not replacing a clinician at the bedside. Even the senior authors agree the text-only modality is a real limitation, since real ER care also depends on imaging, exam findings, and patient interaction. The result is best read as a ceiling on what unprocessed-text reasoning can do, not as a verdict on ER medicine.

The Liability Trap: More Accurate Than Doctors, Legally Required to Ignore

The most uncomfortable thread running through both X and Reddit reactions is structural: a hospital can now have a system that demonstrably catches more diagnoses, and its general counsel will tell it not to use that system. There is no formal accountability framework for an AI-influenced diagnosis that turns out wrong — no clear malpractice standard, no FDA pathway tuned for general-purpose LLM-as-second-opinion, no precedent for who pays when the model is right and the doctor overrules it (or vice versa). One widely shared X thread distilled the asymmetry to a sentence: an AI that is right 67% of the time gets called dangerous, while a doctor who is right 55% of the time gets called board-certified.

This is why the authors' insistence on randomized controlled trials matters beyond academic hygiene. The first health system that figures out how to deploy an LLM overlay — passively flagging differential diagnoses against the EHR, with a clear decision-rights and indemnity model — captures a generational advantage. Yujin Potter at Berkeley reframes it as an AI safety problem rather than a regulatory one: outperforming doctors on aggregate doesn't tell you what the model does on the rare adversarial case, and the study didn't formally measure hallucination rates. Arya Rao's warning is the same in clinical language: LLM 'reasoning' is brittle precisely where uncertainty and nuance matter most. The deployment question is therefore less 'is the model good enough' and more 'what failure mode does the institution agree to own.'

Why the Sept-2024 Model Beating Doctors in 2026 Is the Bigger Tell

o1-preview shipped in September 2024. The Harvard paper landed in Science in late April 2026 — roughly nineteen months later, multiple frontier model generations downstream. The implicit question that drove the loudest social response, including from The Rundown AI's audience, was: what do the numbers look like for systems that aren't already two years behind? The study isn't a measurement of frontier capability; it's a measurement of a generation-old system that already exceeded the comparator.

That reframing connects to Thomas Buckley's other observation: multiple-choice medical exams are saturated near 100%, so the field can't use them to track progress anymore. The real research frontier is moving toward unprocessed-EHR evaluation of the kind this paper pioneered — harder, messier, closer to deployment reality, and resistant to the kind of memorization that inflated earlier benchmark scores. Combined with parallel results from Microsoft's MAI-DxO orchestrator (better diagnoses at ~20% lower cost) and the Elsevier survey finding ~20% of clinicians already quietly using LLMs for second opinions, the Harvard study reads less as a single startling result and more as confirmation that the bottleneck has migrated. It's no longer can the model do it. It's whether the legal, regulatory, and clinical workflow infrastructure can catch up before the technology overtakes it twice over.

Historical Context

1959

The challenging case sets used in this study trace back to benchmarks that have measured computer diagnostic ability since 1959, framing this as a six-decade arc.

2024

Earlier retrospective work showed GPT-4 outperforming GPT-3.5 and ED resident physicians on internal-medicine emergency diagnostic accuracy, foreshadowing the o1 result.

2024-09-12

Released o1-preview as its first reasoning model — the system the Harvard study evaluated, even though by publication it was already roughly nineteen months old.

2025

About 1 in 5 (~20%) clinicians worldwide reported already using LLMs for second opinions on complex cases, with more than half wanting to — real-world adoption preceded peer-reviewed validation.

2025-06

Microsoft reported its MAI-DxO orchestrator outperformed doctors on diagnostic accuracy at roughly 20% lower cost, part of the broader 2025-26 wave of AI-vs-physician evaluations.

2026-04-30

Publication of 'Performance of a large language model on the reasoning tasks of a physician' (DOI 10.1126/science.adz4433) in Science, the marquee peer-reviewed entry in this debate.

Power Map

Key Players

Subject

OpenAI o1 outperforms ER physicians in diagnosis

Harvard Medical School / Blavatnik Institute (Department of Biomedical Informatics)

Lead institution running the study; senior co-author Arjun Manrai's AI lab designed and ran the benchmarks against unprocessed EHR data.

Beth Israel Deaconess Medical Center

Clinical partner that supplied the 76 real ER cases and the attending physicians; Adam Rodman directs its Shapiro Center AI program and was senior author.

Stanford University (Jonathan Chen, Ethan Goh)

Co-authors and collaborators on study design and physician benchmarking, extending the Harvard-Stanford axis on medical LLM evaluation.

OpenAI

Provider of the evaluated o1-preview reasoning model; not directly involved in the study but its product is the central subject and a downstream commercial beneficiary.

Science (AAAS journal)

Peer-reviewed publication venue, giving the findings academic legitimacy that prior preprints in this space lacked.

Source Articles

Top 3

THE SIGNAL.

Analysts

"The model broadly eclipsed both prior systems and physician baselines, signaling a paradigm shift even as patients still want humans guiding life-or-death decisions: 'We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines.'"

Arjun K. Manrai

Senior co-author; Assistant Professor of Biomedical Informatics, Harvard Blavatnik Institute

"Was openly surprised by the magnitude of the result — 'I thought it was going to be a fun experiment but that it wouldn't work that well. That was not at all what happened' — but insists on a randomized controlled trial before any clinical deployment."

Adam Rodman

Senior author; internal medicine physician and Director of the AI Program at the Shapiro Center, Beth Israel Deaconess Medical Center

"Argues traditional medical AI evaluation has hit a ceiling: 'We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can't track progress anymore because we're already at the ceiling' — motivating the shift to unprocessed EHR data."

Thomas Buckley

Doctoral candidate, Department of Biomedical Informatics, Harvard Medical School

"Calls the framing methodologically weak because the comparator wasn't ER specialists: 'If we're going to compare AI tools to physicians' clinical ability, we should start by comparing to physicians who actually practice that specialty.' ER triage is about ruling out lethal conditions, not guessing the final diagnosis."

Kristen Panthagani

Emergency physician

"Highlights that the strong showing on management decisions is the more meaningful result: 'Management reasoning is likely a more complex task than diagnostic reasoning.'"

Peter G. Brodeur

Clinical fellow in medicine and study co-author, Beth Israel Deaconess Medical Center

"Says the result is informative but reframes it as an AI safety mandate: 'This paper is informative. It's good. But also, this actually means that we also need to understand AI safety better.'"

Yujin Potter

AI research scientist, UC Berkeley

"Cautions that LLM 'reasoning' is not the same construct as clinical reasoning and is fragile in exactly the cases that matter most: 'Their reasoning is brittle precisely where uncertainty and nuance matter most.'"

Arya Rao

Researcher, Harvard Medical School

The Crowd

"Harvard just tested AI against ER doctors in real emergency triage. AI nailed diagnoses 67% of the time. Doctors: 50-55%. Especially strong with minimal info - exactly when every second counts. A "profound change" in medicine is here."

@@stats_feed4800

"THIS HARVARD STUDY JUST PUT AN LLM AHEAD OF ER DOCTORS. Beth Israel gave o1 and real doctors the same 76 ER triage cases. o1: 67%. Doctors: 50-55%. We're in this weird moment where the AI outperforms the doctor but the doctor is still legally required to ignore it. The study itself says there's no accountability framework. Which means a hospital could have a tool that saves more lives and their lawyers would tell them not to use it. An AI that's right 67% of the time gets called dangerous. A doctor that's right 55% of the time gets called board certified. Whoever figures out how to use AI in healthcare and deal with the liability problem is sitting on a generational company."

@@gregisenberg110

"A new study from Harvard just found that AI diagnosed real ER patients more accurately than two attending physicians from elite med schools. The model used? OpenAI's o1-preview... Released in September 2024. The correct diagnosis at initial ER triage on 76 cases from a Boston hospital: AI: 67.1%. Doctor #1: 55.3%. Doctor #2: 50.0%. The two other physician reviewers tasked with scoring couldn't tell which diagnoses came from the model and which came from the humans. What will the results look like from models that aren't already two years behind?"

@@TheRundownAI75

"AI outperforms doctors in Harvard trial of emergency triage diagnoses"

@u/Doener23235

Broadcast

AI shows promise in emergency room diagnosis

New study suggests AI is starting to outperform doctors, sheds light on growing capabilities

Can OpenAI's o1 solve complex medical problems?