The accuracy gap widens exactly where doctors usually win
The most counterintuitive pattern in the Harvard data is not that o1 beat physicians — it's where the gap was largest. At triage, with the least information available, o1 hit 67.1% versus 55.3% and 50.0% for two attending physicians: a meaningful but not enormous lead. By the hospital admission stage, when more clinical information had accumulated, the human physicians closed much of the gap (78.9% and 69.7% versus 81.6% for o1). And on long-term treatment planning across five complex cases, the gap blew open: a median of 89% for o1 versus 34% for doctors using conventional resources including up-to-date Google searches. The intuitive story is that AI helps most when humans are overwhelmed by noisy partial data; the actual pattern is closer to the opposite. AI's lead is widest on the cognitively richest task (multi-step longitudinal planning across complex scenarios) and narrowest on the moment of acute uncertainty where ER doctors have spent careers building heuristics. That suggests the model's advantage is less about pattern-matching the unknown and more about systematically working through structured medical reasoning chains — exactly the domain where chain-of-thought reasoning models are designed to excel.


