A 2,500-question benchmark designed to be unsolvable by current AI systems is doing exactly what its creators intended: exposing how far large language models remain from genuine expert-level reasoning. “Humanity’s Last Exam” (HLE), created by the Center for AI Safety and Scale AI with contributions from nearly 1,000 researchers worldwide, was formally published in Nature in January 2026 and has become the most demanding public test of AI capabilities to date.
The benchmark’s construction was deliberately adversarial. Questions span highly specialized domains — from deciphering ancient Palmyrene inscriptions to identifying microanatomical structures in birds — and any question an AI model answered correctly during the development phase was removed from the final set. A $500,000 prize pool incentivized expert contributors, including Dr. Tung Nguyen of Texas A&M and neuroscientist Manuel Schottdorf of the University of Delaware, to submit questions that would stump the most capable systems available.
The results are stark. When first tested, GPT-4o scored 2.7 percent and Claude 3.5 Sonnet managed 4.1 percent. OpenAI’s o1 reasoning model reached 8 percent. More recent models have improved but remain far below human performance: Gemini 3 Pro Preview leads the official leaderboard at 37.5 percent, followed by Claude Opus 4.6 Thinking at 34.4 percent and GPT-5 Pro at 31.6 percent. Human experts, by comparison, achieve approximately 90 percent accuracy.
The benchmark addresses a practical measurement problem. Standard tests like MMLU have become saturated, with leading models scoring above 90 percent — making them useless for distinguishing between frontier systems or tracking meaningful progress. Dan Hendrycks, director of the Center for AI Safety and the project’s lead architect, designed HLE to remain relevant as capabilities advance. A rolling update mechanism, HLE-Rolling, adds new questions to prevent the benchmark from going stale.
For AI developers, HLE provides a concrete metric for capability gaps. For policymakers evaluating AI governance frameworks, it offers something rarer: an honest assessment of what these systems cannot do. The 50-plus percentage point gap between the best AI scores and human expert performance represents territory that incremental model improvements are unlikely to close quickly — the questions require deep domain expertise, multi-step reasoning, and knowledge that rarely appears in training data. The next global AI governance forum, planned by the UN for July 2026, will likely reference HLE scores as a grounding mechanism against overstated capability claims.
