In real-world test, an AI model did better than ER doctors at diagnosing patients

NPR | By Will Stone

Published April 30, 2026 at 2:00 PM EDT

Researchers tested an AI model against ER doctors and found the model outperformed the humans.

A patient shows up at the hospital with a pulmonary embolism — a blood clot that has traveled to the lungs. After initially improving, their symptoms start to worsen. The medical team suspects the medication isn't working.

In steps artificial intelligence — with its own theory.

It has scanned the medical records and suspects a history of lupus, an autoimmune condition which can lead to heart inflammation, could explain what was really ailing the patient.

Turns out, the AI model is correct.

This type of scenario could become a reality in the-not-too-distant future, according to a study published Thursday in the journal Science.

Researchers based at Harvard Medical School and Beth Israel Deaconess Medical Center found that an AI reasoning model, developed by OpenAI, excelled at diagnosing patients and making decisions about managing their care. It matched and often outperformed doctors and the earlier AI model, GPT-4.

The researchers ran a series of experiments on the AI model to test its clinical acumen — including actual cases like the lupus patient who'd been previously treated at the emergency department at Beth Israel in Boston.

The team graded how well the AI model could provide an accurate diagnosis at three moments in time, from the triage stage in the ER, up to being admitted into the hospital.

Overall, AI outperformed two experienced physicians — and did so with only the electronic health records and the limited information that had been available to the physicians at the time.

"This is the big conclusion for me — it works with the messy real-world data of the emergency department, " said Dr. Adam Rodman, a clinical researcher at Beth Israel and one of the study authors. "It works for making diagnoses in the real world."

Other parts of the study focused on case reports published in the New England Journal of Medicine and clinical vignettes to suss out whether the AI model could meet well-established "benchmarks" and game out thorny diagnostic questions.

"The model outperformed our very large physician baseline," said Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School who was also part of the study.

The authors emphasize the AI relied on text alone, while in real life, clinicians need to attend to many other inputs like images, sounds and nonverbal cues when diagnosing and treating a patient.

Still, the work showcases just how far the technology has advanced in the last few years. Prior versions of large language models faltered when dealing with uncertainty, and in generating a list of possible conditions that could explain symptoms, what's known as a differential diagnosis.

"This paper is a beautiful summary of just how much things have improved," says Dr. David Reich, chief clinical officer for Mount Sinai Health System in New York, who was not involved in the work.

"You have something which is quite accurate, possibly ready for prime time," he says. "Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?"

After all, arriving at some tricky, final diagnosis — which the AI model shines at — isn't necessarily reflective of how things play out "in real clinical medicine," says Reich, where the "outcomes are much more subtle and perhaps more diverse."

And the emergency department is only a small portion of the patient's total medical care. Rodman acknowledges it's unlikely AI would have done such an "impressive" job had the team provided it with the records of someone who'd spent a month in the hospital.

None of those involved in the new study believe the findings support supplanting doctors with AI, "despite what some companies are likely to say and how they're likely to use these results," says Manrai.

"I think it does mean that we're witnessing a really profound change in technology that will reshape medicine," he adds.

But the results do make the case that AI models need to be tested in a rigorous fashion, ideally through forward-looking trials that can give more certainty about how the technology ultimately impacts clinical practice.

"It's a very challenging process to design these trials," says Reich, "but this study is a perfect call to action."