In Harvard study, AI offered more accurate diagnoses than emergency room doctors

Published: 1 day ago (May 3, 2026 at 02:00 PM EDT)

3 min read

Source: TechCrunch

Study Overview

A new study examines how large language models perform in a variety of medical contexts, including real emergency room cases — where at least one model seemed to be more accurate than human doctors. The study was published this week in Science (https://www.science.org/doi/10.1126/science.adz4433) and comes from a research team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers conducted a variety of experiments to measure how OpenAI’s models compared to human physicians.

Methodology

In one experiment, researchers focused on 76 patients who came into the Beth Israel emergency room, comparing the diagnoses offered by two attending physicians to those generated by OpenAI’s o1 and 4o models. These diagnoses were assessed by two other attending physicians, who were blinded to the source (human vs. AI).

The study emphasized that the AI models were presented with the same information available in the electronic medical records at the time of each diagnosis, with no pre‑processing of the data.

Results

At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o.
The differences were especially pronounced at the first diagnostic touchpoint (initial ER triage), where information is scarce and urgency is high.
Using the same triage information, the o1 model offered “the exact or very close diagnosis” in 67 % of cases, compared to 55 % for one physician and 50 % for the other.

Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the study’s lead authors, said: “We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines.”

Implications

The study does not claim that AI is ready to make real life‑or‑death decisions in the emergency room. Instead, it highlights an urgent need for prospective trials to evaluate these technologies in real‑world patient‑care settings.

The researchers also noted that the study only examined performance with text‑based information, and that “existing studies suggest that current foundation models are more limited in reasoning over non‑text inputs.”

Commentary

Adam Rodman, a Beth Israel doctor and co‑author of the study, warned in The Guardian that there is “no formal framework right now for accountability” around AI diagnoses, and that patients still “want humans to guide them through life or death decisions and to guide them through challenging treatment decisions.” (https://www.theguardian.com/technology/2026/apr/30/ai-outperforms-doctors-in-harvard-trial-of-emergency-triage-diagnoses)

References

Study publication: https://www.science.org/doi/10.1126/science.adz4433
Harvard Medical School press release: https://hms.harvard.edu/news/study-suggests-ai-good-enough-diagnosing-complex-medical-cases-warrant-clinical-testing
Guardian article: https://www.theguardian.com/technology/2026/apr/30/ai-outperforms-doctors-in-harvard-trial-of-emergency-triage-diagnoses

In Harvard study, AI offered more accurate diagnoses than emergency room doctors

Study Overview

Methodology

Results

Implications

Commentary

References

Related posts

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

¿Cuánta energía, agua, dinero e infraestructura estamos dispuestos a gastar para sostenerla?

The AI 'Intelligence-Authority' Gap: Why Your Agents Need a Deterministic Handbrake

Claude, Microsoft Copilot Fail Again to Predict the Winners of the Kentucky Derby