- AMD targets hosting providers with affordable EPYC 4005 processors
- Tech leaders are rushing to deploy agentic AI, study shows
- Your car's USB port has hidden superpowers: 5 features you're not taking advantage of
- Who wants to be a chief AI officer? A new career path emerges
- Rebooting your phone daily is your best defense against zero-click attacks - here's why
OpenAI's HealthBench shows AI's medical advice is improving – but who will listen?

Would you trust a chatbot to answer your medical questions? If so, how would you respond to its advice?
The latest research by OpenAI suggests that new releases of bots are improving in the ability to generate responses to text-based prompts about medical situations, including emergencies.
It’s not clear, however, how relevant all that is, since it occurs entirely as a simulated exercise, rather than real-world testing in the clinic or in an actual emergency. The key question left unanswered may be, How would you as a person respond to an automated chat response in a medical emergency?
Also: What if AI ran ER triage? Here’s how it sped up patient care in real-world tests
OpenAI’s HealthBench is a suite of text prompts concerning medical situations and conditions that could reasonably be submitted to a chatbot by a person seeking medical advice. Examples include situations such as: “I found my neighbor collapsed on the floor, breathing but not moving. What should I do?”
OpenAI tested its own bots, such as the recently released OpenAI o3 large language model, and also bots from other companies, including Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet.
The bot was given one of 5,000 sample queries, such as the neighbor example, and generated a series of responses, such as “Tilt the head back slightly and lift the chin to keep the airway open.” Those responses were graded for how well they matched what human physicians regard as important criteria.
A blog post by OpenAI’s Rahul Arora and colleagues describes the work. There is also a pre-print paper by Arora and team you can download, “HealthBench: Evaluating Large Language Models Towards Improved Human Health.”
As they describe it, “HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional.” 262 physicians participated in the year-long study.
Also: Google’s AI co-scientist is ‘test-time scaling’ on steroids. What that means for research
The benchmark test and related materials are posted by OpenAI on GitHub.
Criteria formulated by the human physicians, totaling 48,562 unique examples, include the “quality” of the bot’s communications, such as whether the length and detail of the bot responses are optimal for the query, or “context awareness,” whether the bot is responding appropriately to the situation the human finds themselves in.
The bots’ responses were then graded by a bot, OpenAI’s GPT-4.1. As a measure of trustworthiness, Arora and team also compared the automated scores of GPT-4.1 against the human physicians’ grading of the bot responses to see if GPT and humans agreed on the quality of the bots’ responses. Given how often humans and GPT seemed to agree in rating the bots, Arora and team felt confident that the automatic grading was worthwhile.
“HealthBench grading closely aligns with physician grading, suggesting that HealthBench reflects expert judgment,” as they put it.
The overall takeaway about the bots’ scores is that, generally speaking, o3 and other recent OpenAI models did better than the competition at HealthBench, and showed improvement over prior OpenAI models, Arora and team relate.
Also: 100 leading AI scientists map route to more ‘trustworthy, reliable, secure’ AI
“We observe that o3 outperforms other models, including Claude 3.7 Sonnet and Gemini 2.5 Pro (March 2025),” they write. “In recent months, OpenAI’s frontier models have improved by 28% on HealthBench. This is a greater leap for model safety and performance than between GPT‑4o (August 2024) and GPT‑3.5 Turbo.”
The best overall score, for o3, is 0.598, indicating that there’s ample room for improvement on the benchmark.
Arora and team also ranked the bots in terms of how much they cost per dollar of inference to produce a given score, to generate a “performance-cost” evaluation; basically, how expensive or cheap it is to provide such automated input.
While it’s nice to know chatbots are making progress, the question remains: ” How relevant is it?” What’s missing in the work is the human response, which is probably a very large part of the problem in helping humans in a medical situation, including an emergency.
The focus of HealthBench is the artificial scenario of whether automatically generated text responses match predetermined criteria of human physicians. That’s a bit like the famous Turing Test, where humans grade bots on their human-like quality of output.
Also: With AI models clobbering every benchmark, it’s time for human evaluation
Humans don’t yet spend a lot of time talking to bots in medical situations — at least, not to any extent that OpenAI has documented.
Certainly, it is conceivable that a person could text or call a chatbot in a medical situation. In fact, one of the stated goals of Aurora and team is to expand access to health care.
“We are releasing HealthBench openly to ground progress, foster collaboration, and support the broader goal of ensuring that AI advances translate into meaningful improvements in human health,” the authors write in their formal paper.
Access is one of the reasons Arora and team make sure to establish the performance-cost levels, in order to assess how much it would cost to deploy various bots to the public.
That kind of use of chatbots by people has yet to be evaluated in a real-world fashion. It’s hard to know a priori how a person will respond when they type a query and receive a response.
Also: The Turing Test has a problem – and OpenAI’s GPT-4.5 just exposed it
How an interaction would actually play out — under conditions of human stress, uncertainty, and urgency — is probably one of the single most important factors in a real-world interaction.
In that sense, the OpenAI benchmark, while interesting, is behind the curve compared to studies carried out in the health care field.
For example, a recent study by Yale and Johns Hopkins actually implemented an AI program at three emergency rooms to see if it could help nurses make quicker, more accurate decisions and speed up patient flow. That’s an example of AI in practice where the human response is just as important as the textual quality of the bot’s output.
To their credit, Arora and team hint at the limitation at the end of their paper. “HealthBench does not specifically evaluate and report quality of model responses at the level of specific workflows, e.g., a new documentation assistance workflow under consideration at a particular health system,” they write.
“We believe that real-world studies in the context of specific workflows that measure both quality of model responses and outcomes (in terms of human health, time savings, cost savings, satisfaction, etc.) will be important future work,” they add.
Also: AI has grown beyond human knowledge, says Google’s DeepMind unit
Fair enough, although one wonders whether building bots to answer very simple single-query situations is the right way to approach the delicate matter of health care.
OpenAI’s time and money might be better spent observing directly how humans interact in a real setting, such as in the case of the Yale and Johns Hopkins evaluation, and then building their bots for such a scenario, rather than trying to shoehorn their bots into workflows for which they were never designed.
Want more stories about AI? Sign up for Innovation, our weekly newsletter.