OpenAI’s New AI Models Outperform Doctors in Medical Knowledge Test

OpenAI has unveiled HealthBench, a comprehensive new benchmark designed to evaluate the medical knowledge and communication skills of language models. The test was developed with input from 262 physicians across 60 countries, who created 5,000 realistic scenarios covering 26 medical specialties in 49 languages.

HealthBench evaluates performance across seven key medical domains, using 48,000 medically validated metrics and five core criteria: communication quality, accuracy, contextual understanding, completeness, and relevance.

The results are striking. OpenAI’s latest models, GPT-4.1 and o3, outperformed human doctors in all five categories of the evaluation. This marks a significant leap from previous benchmarks. In September 2024, physicians were still able to improve upon older AI-generated answers. However, by April 2025, the new models had surpassed specialists in generating high-quality medical responses autonomously.

The o3 model achieved a score of 0.60, a notable increase from GPT-4o’s 0.32 just six months earlier. It also outperformed other major AI competitors, including Grok 3 and Gemini 2.5.

It’s important to note that HealthBench only assesses language-based performance, not real-world clinical decision-making or hands-on patient care. Still, models like GPT-4.1 have significantly reduced errors in complex medical scenarios. The lightweight GPT-4.1 nano model has also proven to be 25 times more efficient than earlier versions.

OpenAI has made the entire HealthBench dataset and evaluation tools publicly available on GitHub, inviting further research and collaboration in advancing safe and effective AI use in healthcare.

Leave a Comment Cancel reply