Microsoft’s AI Doctor Gets 85.5% of Diagnoses Right — Human Doctors Get 20% [Study Results]

Key Takeaways

Microsoft’s AI Diagnostic Orchestrator (MAI-DxO) achieved 85.5% accuracy on 304 complex medical cases from the New England Journal of Medicine, compared to 20% for a group of 21 experienced physicians.
The system simulates a virtual panel of physicians with distinct diagnostic roles, reducing testing costs by 20% compared to physicians and 70% compared to using OpenAI’s o3 model alone.
MAI-DxO generalizes across models from OpenAI, Google, Anthropic, xAI, DeepSeek, and Meta, and Microsoft’s consumer AI products already handle over 50 million health-related sessions daily.
The study used only complex teaching cases, not routine diagnoses, and physicians were restricted from using textbooks, colleagues, or AI tools during the evaluation.

What Happened

A team of 15 researchers at Microsoft, led by Harsha Nori and Dominic King, published “Sequential Diagnosis with Language Models” in June 2025. The paper introduced MAI-DxO, a system that orchestrates multiple language models to diagnose complex medical cases through iterative questioning, test ordering, and cost-aware reasoning. When configured for maximum accuracy, MAI-DxO correctly diagnosed 85.5% of 304 clinicopathological conference cases drawn from the New England Journal of Medicine. A comparison group of 21 U.S. and U.K. physicians with 5 to 20 years of experience achieved 20% accuracy on the same cases.

“Doctors aren’t going anywhere. AI will help them arrive at diagnoses and effective care plans faster, but it can’t replace the human connection and understanding patients’ needs,” said Dominic King, VP of Health at Microsoft AI.

Why It Matters

The World Health Organization projects a global shortage of 11 million health workers by 2030, concentrated in low- and lower-middle-income countries. Diagnostic AI systems that can triage and assist with complex cases could help stretch limited specialist capacity, particularly in regions where patients may wait weeks for a referral.

Microsoft already operates at health-relevant scale. Across Bing, Copilot, and other consumer AI products, the company observes over 50 million health-related sessions every day. MAI-DxO is not yet deployed in any of these products, but the research signals Microsoft’s intent to move beyond general-purpose health search toward structured diagnostic assistance.

Technical Details

The researchers created the Sequential Diagnosis Benchmark (SDBench), which transforms each NEJM case into a stepwise diagnostic encounter. A physician or AI system begins with a short case abstract and must iteratively request additional clinical details from a gatekeeper model that reveals findings only when explicitly queried. Performance is measured on both diagnostic accuracy and the cumulative cost of physician visits and tests ordered.

MAI-DxO’s architecture simulates a virtual panel of physicians with distinct diagnostic roles. The system generates differential diagnoses, estimates the marginal information value of each potential test, and employs model ensembling across multiple responses before committing to a final diagnosis. When paired with OpenAI’s o3 model, MAI-DxO achieved 79.9% accuracy at a cost of $2,397 per case, compared to the standalone o3 model’s 78.6% accuracy at $7,850 per case. At the maximum accuracy configuration of 85.5%, the cost was $7,184 per case.

The system’s performance generalized across foundation models from OpenAI, Google Gemini, Anthropic Claude, xAI Grok, DeepSeek, and Meta Llama. Co-authors included Eric Horvitz, Mustafa Suleyman, Matthew P. Lungren, and Marco Tulio Ribeiro, among others.

Who’s Affected

Health systems facing specialist shortages could benefit most directly, though significant regulatory and validation barriers remain before clinical deployment. The study’s authors explicitly note that MAI-DxO is early-stage research, not approved for clinical use. A 2025 survey cited by Microsoft found that 48% of U.S. patients are optimistic about AI improving health outcomes, while 63% of clinicians expressed similar optimism.

The medical community has raised valid concerns about the study design. The 304 NEJM cases are deliberately difficult teaching cases that do not include healthy individuals, routine presentations, or mild conditions. Physicians in the comparison group were restricted from using textbooks, consulting colleagues, or accessing AI tools — resources routinely available in clinical practice. These constraints make the 20% physician accuracy figure unrepresentative of real-world diagnostic performance.

What’s Next

Microsoft has stated that further real-world validation, governance frameworks, and regulatory review are required before any clinical deployment of MAI-DxO. The company’s research roadmap, outlined in its “Path to Medical Superintelligence” blog post, frames diagnostic AI as one component of a broader medical AI strategy. The SDBench benchmark and evaluation framework are available for independent researchers to replicate and extend the study’s findings using alternative models and case sets.

Microsoft’s AI Doctor Gets 85.5% of Diagnoses Right — Human Doctors Get 20% [Study Results]

Key Takeaways

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

Microsoft’s AI Doctor Gets 85.5% of Diagnoses Right — Human Doctors Get 20% [Study Results]

Key Takeaways

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

Alibaba’s AI Turns Phone Photos Into 3D Restaurant Tours in Minutes — Professional Photographers Are Panicking

AI Is Writing Code Faster Than Anyone Can Audit It — This $6M Startup Says That’s a $100B Security Problem

Aurora Makes AI 1.25x Faster by Learning While It’s Running — No Retraining, No Downtime

Before you go…