A blood clot patient arrives at a Boston emergency room. Treatment isn’t working. The medical team suspects the anticoagulants are failing. An AI model reads the electronic health record and notices something the doctors missed: a history of lupus, which could explain lung inflammation. The AI is right.
That case wasn’t hypothetical. It came from 76 real emergency department visits at Beth Israel Deaconess Medical Center, part of a Harvard-led study published Thursday in Science that tested OpenAI’s o1 reasoning model against experienced physicians in live clinical conditions — not curated benchmarks.
The results are striking. At triage, when information is sparsest, the AI identified the correct or near-correct diagnosis 67% of the time, compared with 50% to 55% for two expert attending physicians. When more data became available, AI accuracy rose to 82% versus 70% to 79% for humans — though that gap was not statistically significant.
On treatment planning, the gap widened dramatically. Asked to develop management plans for five clinical case studies, the AI scored 89%, while 46 physicians using conventional resources like search engines scored 34%.
“This is the big conclusion for me — it works with the messy real-world data of the emergency department,” said Dr. Adam Rodman, a study co-author and clinical researcher at Beth Israel.
But the study’s limitations are real. The AI operated on text alone — no physical examination, no chest X-rays, no reading of a patient’s visible distress. Co-author Arjun Manrai noted that physicians routinely process EKGs, imaging, and physiological signals the model never saw. Rodman acknowledged the AI likely wouldn’t have performed as impressively with patients who’d spent extended time in the hospital.
None of the researchers frame this as a replacement story. “I think humans want humans to guide them through life-or-death decisions,” Manrai said. Nearly one in five US physicians already consult AI for diagnostic help. The open question — echoed by independent experts — is how to integrate these tools without cutting doctors out of the loop.
Discussion (5)