In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors
Source: TechCrunch
A new study examines how large language models perform in a variety of medical contexts, including real emergency‑room cases — where at least one model seemed to be more accurate than human doctors.
The study was published this week in Science and comes from a research team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers conducted a variety of experiments to measure how OpenAI’s models compared to human physicians.
Experiment Overview
- Patients: 76 individuals who presented to the Beth Israel emergency room.
- Comparators: Two internal‑medicine attending physicians vs. OpenAI’s o1 and 4o models.
- Evaluation: Diagnoses were assessed by two additional attending physicians who were blinded to the source (human vs. AI).
Key Findings
- At each diagnostic touchpoint, o1 performed nominally better than or on par with the two attending physicians and 4o.
- The differences were most pronounced at the first diagnostic touchpoint (initial ER triage), when information is scarce and urgency is highest.
- Using the same electronic‑medical‑record data available at the time of each diagnosis (no pre‑processing), o1 offered “the exact or very close diagnosis” in 67 % of triage cases, compared with 55 % for one physician and 50 % for the other.
“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, head of an AI lab at Harvard Medical School and lead author of the study.
The researchers emphasize that the study does not claim AI is ready to make real‑life‑or‑death decisions in the emergency room. Instead, the findings highlight an “urgent need for prospective trials to evaluate these technologies in real‑world patient‑care settings.”
Limitations
- The study only examined performance with text‑based information; existing research suggests foundation models are more limited when reasoning over non‑text inputs.
- No formal accountability framework currently exists for AI‑generated diagnoses, a point underscored by one of the study’s lead authors, Adam Rodman, who warned that patients still “want humans to guide them through life‑or‑death decisions.”
Reactions and Commentary
-
Kristen Panthagani, an emergency physician, noted that the study’s headlines were “overhyped,” emphasizing that the AI was compared to internal‑medicine physicians rather than ER physicians. She argued that when an ER doctor sees a patient for the first time, the primary goal is not to guess the ultimate diagnosis but to determine whether the patient has a condition that could be fatal.
“If we’re going to compare AI tools to physicians’ clinical ability, we should start by comparing to physicians who actually practice that specialty,” Panthagani said.
-
In a Guardian interview, Rodman highlighted the lack of a formal accountability framework for AI diagnoses and the continued patient preference for human guidance in critical decisions.
The post and headline have been updated to reflect that the diagnoses in the study came from internal‑medicine attending physicians and to include commentary from Kristen Panthagani.