AI Beats ER Doctors at Diagnosis in First Real-World Hospital Trial

A blood clot patient arrives at a Boston emergency room. Treatment isn’t working. The medical team suspects the anticoagulants are failing. An AI model reads the electronic health record and notices something the doctors missed: a history of lupus, which could explain lung inflammation. The AI is right.

That case wasn’t hypothetical. It came from 76 real emergency department visits at Beth Israel Deaconess Medical Center, part of a Harvard-led study published Thursday in Science that tested OpenAI’s o1 reasoning model against experienced physicians in live clinical conditions — not curated benchmarks.

The results are striking. At triage, when information is sparsest, the AI identified the correct or near-correct diagnosis 67% of the time, compared with 50% to 55% for two expert attending physicians. When more data became available, AI accuracy rose to 82% versus 70% to 79% for humans — though that gap was not statistically significant.

On treatment planning, the gap widened dramatically. Asked to develop management plans for five clinical case studies, the AI scored 89%, while 46 physicians using conventional resources like search engines scored 34%.

“This is the big conclusion for me — it works with the messy real-world data of the emergency department,” said Dr. Adam Rodman, a study co-author and clinical researcher at Beth Israel.

But the study’s limitations are real. The AI operated on text alone — no physical examination, no chest X-rays, no reading of a patient’s visible distress. Co-author Arjun Manrai noted that physicians routinely process EKGs, imaging, and physiological signals the model never saw. Rodman acknowledged the AI likely wouldn’t have performed as impressively with patients who’d spent extended time in the hospital.

None of the researchers frame this as a replacement story. “I think humans want humans to guide them through life-or-death decisions,” Manrai said. Nearly one in five US physicians already consult AI for diagnostic help. The open question — echoed by independent experts — is how to integrate these tools without cutting doctors out of the loop.

Sources

In real-world test, an AI model did better than ER doctors at diagnosing patients — NPR
AI outperforms doctors in Harvard trial of emergency triage diagnoses — The Guardian
AI Outperforms Doctors in Emergency Room Tests, New Harvard Study — Harvard Magazine

Discussion (5)

vkrishnan

The sample size of 76 cases is modest but the design is more meaningful than most LLM-in-medicine studies we've seen. The key distinction is live clinical data from Beth Israel, not curated benchmarks — that matters enormously for external validity. The 67% accuracy at triage with sparse information is genuinely notable, though I'd want to see replication across multiple hospital systems and patient populations before drawing firm conclusions. The 89% vs 34% on treatment planning is the more surprising number to me, but also the harder one to interpret without understanding the scoring rubric. Worth noting this is essentially a zero-shot evaluation — the model had no fine-tuning on clinical data from this setting. The Rodman et al. approach of testing in genuinely messy conditions is the right one and I hope it becomes the standard for these evaluations.

14 ↑

Mike T.

So they want to replace your doctor with ChatGPT now. Hard pass. I'll take a real human who can actually look at me and tell if somethings wrong over a computer program thank you very much.

8 ↑

emw_99

the article literally says nobody is trying to replace doctors and that the AI couldn't even see x-rays or examine patients lol. it says right there 'humans want humans to guide them through life-or-death decisions.' did you read past the headline mike

22 ↑

Linda M. Rojas

The text-only limitation is more significant than some of the coverage suggests. I spent 28 years in clinical practice and a tremendous amount of diagnostic reasoning happens during the physical exam — watching how a patient moves, their coloring, their breathing pattern. An AI reading only health records is getting a useful but incomplete picture. That said, catching a missed lupus history that led to a correct diagnosis is exactly the kind of thing these systems can contribute. Diagnostic anchoring is a real and well-documented problem in emergency medicine, and a second look that doesn't get tired or stressed has genuine value.

19 ↑

definitely_not_a_bot

Funny how these 'studies' always come out right when OpenAI needs good press. A Harvard study using o1 published in Science right as they're probably raising another round? And who wrote THIS article anyway Carl?? Reads like it was generated in about 4 seconds by the same product it's promoting. Wake up people.

6 ↑

Sources

Discussion (5)

More Stories