TECH

OpenAI o1 outperforms ER doctors in diagnostic accuracy

24+

Signals

Strategic Overview

01.
A Harvard-led study published in Science on April 30, 2026 found OpenAI's o1-preview reasoning model matched or exceeded attending physicians at diagnosing real emergency department patients across triage, admission, and long-term treatment planning stages.
02.
On 76 real Beth Israel ED cases, o1 produced the correct diagnosis 67.1% of the time at triage versus 55.3% and 50.0% for two attending physicians; the gap widened to 89% vs 34% on five complex long-term treatment scenarios developed by 25 experts.
03.
The model was tested only on text inputs (notes and EHR data), with no imaging, physical exam findings, audio cues, or in-person interaction — a constraint the authors flag as central to interpreting the results.
04.
Researchers framed the result as evidence for a 'triadic' patient + doctor + AI care model rather than AI replacement, and explicitly warned that no formal accountability framework exists for clinical AI deployment.

Deep Analysis

The accuracy gap widens exactly where doctors usually win

The most counterintuitive pattern in the Harvard data is not that o1 beat physicians — it's where the gap was largest. At triage, with the least information available, o1 hit 67.1% versus 55.3% and 50.0% for two attending physicians: a meaningful but not enormous lead. By the hospital admission stage, when more clinical information had accumulated, the human physicians closed much of the gap (78.9% and 69.7% versus 81.6% for o1). And on long-term treatment planning across five complex cases, the gap blew open: a median of 89% for o1 versus 34% for doctors using conventional resources including up-to-date Google searches. The intuitive story is that AI helps most when humans are overwhelmed by noisy partial data; the actual pattern is closer to the opposite. AI's lead is widest on the cognitively richest task (multi-step longitudinal planning across complex scenarios) and narrowest on the moment of acute uncertainty where ER doctors have spent careers building heuristics. That suggests the model's advantage is less about pattern-matching the unknown and more about systematically working through structured medical reasoning chains — exactly the domain where chain-of-thought reasoning models are designed to excel.

What the text-only design does and doesn't prove

Every critic of the study, from the r/Futurology top comment to the authors themselves, anchors on the same point: the model never saw a patient. No imaging, no physical exam, no audio of distress, no skin tone, no smell of ketones, no body language. Researchers acknowledged in the paper that current foundation models are more limited in reasoning over nontext inputs. Reddit user dragoon7201, identifying as ER, made the sharper version of this point — that ER triage is 'relatively straightforward and algorithm based' and the difficult part is 'arranging consultant services… and distinguishing drug seekers from real pain patients,' tasks the benchmark didn't measure. But the text-only frame also matters in the other direction: the team explicitly used real, unprocessed Beth Israel records rather than cleaned vignettes ('We didn't pre-process the data at all'). That makes the result more impressive than a typical clean-benchmark medical AI demo, because messy real EHR text is a known failure mode for LLMs. The honest read: o1 is genuinely strong at the text-reasoning slice of clinical diagnosis, and that slice is a real and underserved part of ED workflow — but it is a slice.

The accountability vacuum that's now load-bearing

Arjun Manrai's call for prospective trials and Adam Rodman's flat statement that 'there is not a formal framework right now for accountability' both point at the same gap: the regulatory and liability infrastructure for clinical AI hasn't been built, and a strong peer-reviewed result like this will pressure it to be built fast. Peter Brodeur's warning specifies one mechanism by which harm enters even when accuracy is high — the model might recommend unnecessary testing, exposing patients to procedural risk and cost cascades. Wei Xing's framing is the most pointed: 'It does not demonstrate that AI is safe for routine clinical use.' Three distinct accountability questions are sitting unresolved: (1) when an AI-recommended diagnosis is wrong, who carries malpractice liability — the physician who deferred, the hospital that deployed, or the vendor; (2) what counts as adequate human verification when the model is right more often than the doctor; and (3) how to constrain over-testing recommendations without losing the diagnostic accuracy that motivated deployment. The Harvard team's posture — publish the result, push for prospective trials, refuse to endorse autonomous use — is essentially an attempt to keep the regulatory window open against vendor pressure to declare the question settled.

Why the comparator argument matters more than it looks

Kristen Panthagani's critique that the human baseline was internal medicine physicians rather than emergency medicine specialists could read as turf defense, but it has structural weight. ED triage is a specialty skill — it's not the same task as inpatient differential diagnosis. The skeptical Reddit thread on the related NEJM CPC paper captured the parallel concern: u/Craygen9 noted that 'NEJM clinical pathologic conferences showcase rare and complex cases that will be difficult for a general clinician to diagnose,' which inflates AI's apparent advantage. Combined, these critiques suggest the Harvard headline number compares o1 against a deliberately or accidentally weak human baseline — not the practitioner who would actually do the work. This doesn't invalidate the result, but it reframes it: o1 is meaningfully better than non-specialist physicians at text-based ED diagnosis, which is a much narrower claim than 'AI beats ER doctors.' For deployment decisions, the honest benchmark would be o1 versus board-certified emergency medicine attendings working in their normal workflow with normal information access, not internal medicine physicians constrained to the same text-only inputs as the model.

The triadic care model and the integration bottleneck

The researchers' preferred framing is a 'triadic' patient + doctor + AI care model, and it lines up with where actual clinician adoption is heading: an RCP survey cited alongside the study found 16% of UK doctors using AI daily and another 15% weekly. But the most interesting practitioner pushback in the social discussion came from r/ArtificialIntelligence commenter u/zeapha, who argued that 'diagnosis is one thing, but it's actually a smaller component of care than you think… this sort of technology tends to take more time to use not less' — and pointed at deep EHR integration as the real bottleneck for impact. That reframes the product question. A model with 67% triage accuracy that requires copy-pasting notes into a chat interface won't change ED throughput. The same model wired into the EHR with a verified ordering pathway, pre-populated differential, and clean audit trail could. The Harvard study proves the underlying capability exists in the model layer; whether it converts into care quality depends on workflow plumbing the study doesn't address. That's the unglamorous gap between a Science paper and a deployed system, and it is where most of the value — and most of the failure modes — actually live.

Historical Context

1959

An early Science paper anticipated computerized diagnosis; the Harvard study is framed as a long-arc fulfillment of that 67-year-old prediction in the same publication.

September 2024

OpenAI released o1-preview, the first widely available chain-of-thought reasoning model — the specific model evaluated in the Harvard study.

April 30, 2026

Study published in Science showing o1-preview outperforming attending physicians across triage (67.1% vs 50-55%), admission (81.6% vs 69.7-78.9%), and long-term treatment planning (89% vs 34%).

Power Map

Key Players

Subject

OpenAI o1 outperforms ER doctors in diagnostic accuracy

OpenAI

Developer of the o1 (o1-preview) reasoning model evaluated in the study; gains credibility for medical-grade reasoning use cases via a peer-reviewed Science publication.

Harvard Medical School

Lead institution; its biomedical informatics group ran the experiments and authored the Science paper, with Arjun Manrai as senior co-author.

Beth Israel Deaconess Medical Center

Provided the 76 real ED cases used in the benchmark and clinical co-leadership; internist Adam Rodman directs AI integration there and co-led the study.

Stanford University

Collaborator on the Harvard-led study contributing to model evaluation and analysis.

Emergency-medicine clinicians

Subjects compared against AI; some EM physicians critique that the human comparators were internal medicine attendings rather than ED specialists, complicating interpretation.

Healthcare AI vendors

Stand to commercialize results into clinical decision support products; researchers explicitly cautioned against vendors overclaiming that AI can replace doctors based on this study.

Source Articles

Top 3

THE SIGNAL.

Analysts

"Reports that o1 cleared every benchmark and physician baseline they tested, while pressing for prospective trials and warning vendors against overclaiming. His operating posture is 'trust, but verify.' Quote: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines." And: "My mantra is still 'trust, but verify.'""

Arjun Manrai

Associate Professor of Biomedical Informatics, Harvard Medical School; senior co-author

"Highlights that no governance scaffolding exists for AI-driven clinical decisions. Quote: "There is not a formal framework right now for accountability.""

Adam Rodman

Internist, Beth Israel Deaconess Medical Center; co-senior author

"Warns that headline accuracy obscures workflow harms — particularly over-testing recommendations. Quote: "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm.""

Peter Brodeur

Subspecialty fellow, Beth Israel Deaconess Medical Center; lead study author

"Argues accuracy on a benchmark is not the same as safety in deployment. Quote: "It does not demonstrate that AI is safe for routine clinical use.""

Wei Xing

Researcher, University of Sheffield

"Critiques the comparator design — measuring AI against internal medicine attendings rather than emergency medicine specialists who actually run the triage workflow being modeled."

Kristen Panthagani

Emergency room physician

The Crowd

"AI outperforms doctors in Harvard trial of emergency triage diagnoses"

@u/Doener230

"AI Outperforms Doctors in Emergency Room Tasks, New Harvard Study Shows"

@u/77thway13

"o1-preview is far superior to doctors on reasoning tasks and it's not even close"

@u/MetaKnowing83

Broadcast

AI shows promise in emergency room diagnosis

OpenAI o1 Outperforms ER Doctors in Harvard Trial

Can OpenAI's o1 solve complex medical problems?