On the OSWorld-Verified benchmark, which simulates real desktop productivity tasks — not academic tests but actual work like filling forms, navigating browsers, and managing spreadsheets — GPT-5.4 scored 75%. Human expert testers scored 72.4%. Released March 5, 2026, GPT-5.4 is the first AI model to surpass professional human performance on a realistic work simulation benchmark.
The Three-Variant Strategy
OpenAI released GPT-5.4 in five variants: Standard (optimized for throughput), Thinking (extended reasoning), Pro (maximum capability), Mini, and Nano. The Mini and Nano variants, released March 17, run 2x faster than the main model while approaching its performance — Mini scores comparably on SWE-Bench Pro and OSWorld-Verified despite being significantly smaller.
The context window defaults to 272,000 tokens but is experimentally configurable up to 1.05 million tokens (922K input + 128K output) — the largest OpenAI has ever offered. Factual errors are reduced by 33% at the individual claim level and 18% at the full-response level compared to GPT-5.2.
What 75% vs 72.4% Actually Means
The margin — 2.6 percentage points above humans — is narrow but symbolically significant. On coding benchmarks (SWE-bench Pro at 57.7%) and knowledge work (GDPval at 83%), GPT-5.4 is the first model credibly handling all three domains at frontier level simultaneously.
In practice, the 75% score means GPT-5.4 successfully completes three out of four desktop tasks that human experts also complete. The remaining 25% failure rate — tasks where the model gets confused, takes wrong actions, or produces incorrect output — is the gap between a benchmark demo and a production-ready autonomous worker. For comparison, Claude Sonnet 4.6 scores 72.5% on the same benchmark, effectively tied with human performance.
