The 89% vs 34% management-reasoning gap is the real story, not triage accuracy
Most coverage led with the headline triage number — 67.1% for o1-preview versus 55.3% and 50.0% for two attending physicians on 76 Beth Israel Deaconess cases. That gap is real but modest, and with richer clinical detail the spread compressed to 82% for o1 versus 70-79% for physicians, a difference the authors note was not statistically significant. The astonishing result lives one decision point deeper: on management reasoning using expert-scored clinical vignettes, o1 scored a median of 89% while 46 physicians with conventional resources scored 34%. As an emergency physician noted in published commentary on the result, 'That is not a typo.'
Management reasoning is the chain of choices after diagnosis — what to image, what to admit, which therapy to start, what to monitor. It is where attending physicians earn their salary, and it is where benchmark-style multiple-choice tests have historically failed to discriminate between strong and weak clinicians. The Science paper also reports that o1 received a perfect clinical-reasoning score on 98% of cases versus 35% for attending physicians. Whatever the eventual deployment story, the implication for medical AI evaluation is structural: the locus of automation pressure is shifting from 'name the disease' to 'plan the next 24 hours of care,' and that is the dimension where current attendings look least defensible on paper.


