SPOTLIGHT

GPT-5.5 ‘Spud’ Just Dropped — Retakes the Crown From Claude on 14 of 18 Benchmarks

R Ryan Matsuda Apr 25, 2026 6 min read
Engine Score 10/10 — Critical

This story is critical due to OpenAI's release of GPT-5.5, a major model update that reclaims benchmark leadership and significantly impacts the competitive landscape. Its high novelty, industry-wide impact, and immediate actionability for developers and companies make it a top-tier news item.

Editorial illustration for: GPT-5.5 'Spud' Just Dropped — Retakes the Crown From Claude on 14 of 18 Benchmarks

On April 23, 2026, OpenAI released GPT-5.5 — internally codenamed “Spud” — the company’s first fully retrained base model since GPT-4.5. The model scores 82.7% on Terminal-Bench 2.0, a 13.3-point lead over Anthropic’s Claude Opus 4.7 (69.4%), and tops 14 of 18 standard benchmarks. The benchmark crown has changed hands again, six weeks after GPT-5.4.

Two benchmarks break the sweep — and both matter to engineers who will actually deploy these models in production. GPT-5.5 leads on general-purpose agentic tasks; Claude Opus 4.7 leads on deep debugging and multilingual comprehension. Neither advantage is structural, and neither company has held the top position for more than one release cycle in 2026.

GPT-5.5 vs Claude Opus 4.7: Full Benchmark Breakdown

OpenAI’s release covers 18 evaluation tasks spanning agentic coding, real-world computer use, mathematical reasoning, and multilingual comprehension. GPT-5.5 leads on 14. Here are the numbers that define the race:

Benchmark GPT-5.5 Claude Opus 4.7 Leader
Terminal-Bench 2.0 82.7% 69.4% GPT-5.5 (+13.3)
OSWorld-Verified 78.7% GPT-5.5
GDPval 84.9% GPT-5.5
Expert-SWE 73.1% GPT-5.5
MRCR v2 (long-context) 74.0% GPT-5.5
SWE-bench Pro 58.6% 64.3% Opus 4.7 (+5.7)
Multilingual MMLU 83.2% 91.5% Opus 4.7 (+8.3)

The Terminal-Bench 2.0 margin is the widest single-metric lead either model has posted this generation. OSWorld-Verified at 78.7% and Expert-SWE at 73.1% reinforce the pattern: GPT-5.5’s advantage is concentrated in long-horizon, real-world execution tasks — precisely the category driving the most enterprise adoption in 2026. The 84.9% GDPval score, measuring performance on data processing pipelines, adds a fourth data point in the same direction.

Long-Context Retrieval: The 37-Point Jump

The most dramatic internal improvement in GPT-5.5 is not agentic coding — it is long-context retrieval. On MRCR v2, a multi-document, multi-hop benchmark designed to stress-test model performance across very long contexts, GPT-5.5 scores 74.0%. GPT-5.4 scored 36.6% on the same evaluation. That 37.4-point gain effectively eliminates the context-dropout problem that has degraded GPT-series reliability on long-document workloads since GPT-4 Turbo.

OpenAI’s finance team demonstrated the practical ceiling: an internal GPT-5.5 deployment reviewed 24,771 K-1 tax forms totaling 71,637 pages in a single session, handling entity extraction, numerical cross-referencing, and multi-document reconciliation concurrently. Pre-GPT-5.5, that workload required structured extraction pipelines staffed by a dedicated data team. The benchmark gain and the K-1 example are measuring the same underlying capability from two directions.

Where Claude Opus 4.7 Still Wins

On SWE-bench Pro — the extended software engineering benchmark weighted toward complex multi-file debugging — Anthropic’s Claude Opus 4.7 scores 64.3% against GPT-5.5’s 58.6%. That 5.7-point gap holds across task categories most correlated with real-world debugging performance on large, entangled codebases. For engineering teams whose primary use case is debugging rather than greenfield generation, Opus 4.7 remains the stronger choice today.

The multilingual gap is sharper. GPT-5.5 scores 83.2% on multilingual MMLU; Opus 4.7 scores 91.5% — an 8.3-point difference that is material for any enterprise operating across multiple languages. Anthropic’s model architecture has prioritized multilingual robustness since Claude 3 Opus, and the three-generation investment shows clearly in the numbers.

The practical read: GPT-5.5 is the stronger general agent; Opus 4.7 is the stronger debugger and global-language model. Enterprises should select on the benchmark closest to their actual workload — the Terminal-Bench headline is the wrong heuristic for a team whose primary task is debugging a 400K-line Java monorepo in Portuguese.

Natively Omnimodal From Day One

Unlike GPT-4o, which added modalities through separate incremental deployments, GPT-5.5 is natively omnimodal across text, image, audio, and video from release. OpenAI has not published modality-specific benchmark comparisons, but the architectural shift matters for agent pipelines requiring real-time video or audio input without routing through a separate model call — a latency and cost friction point in GPT-5.4 deployments that required two-model orchestration for multimodal tasks.

The Price Just Doubled

GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens — exactly double GPT-5.4’s pricing at launch. For API teams with cost budgets built around GPT-5.4, this is a direct line-item shock with no grace period or migration discount announced as of April 25.

MegaOne AI tracks pricing across 139+ AI tools in 17 categories. A mid-year flagship doubling without a major model rename is unusual — only three other frontier foundation models have done this in the 2025–2026 cycle. At $30/M output tokens, GPT-5.5 is now in the same pricing band as Google’s Gemini 2.5 Ultra and Anthropic’s Opus 4.7. The premium tier has become a three-way competition on cost, not just capability. Whether the 37-point MRCR v2 improvement and Terminal-Bench leadership justify the price increase depends entirely on whether long-context retrieval or agentic execution is central to your production stack.

A Ramsey Number Proof, Verified in Lean

The most academically significant claim in OpenAI’s release materials: an internal GPT-5.5 variant produced a formal proof of a new Ramsey number result, subsequently verified by the Lean proof assistant. Ramsey numbers quantify the minimum structure size required to guarantee a given combinatorial property — a problem class that has resisted human mathematicians for decades, with core open problems dating to Frank Ramsey’s original 1930 work.

OpenAI had not published the specific Ramsey number or full proof details as of April 25, 2026. If the broader mathematics community validates the result, it marks the first formally verified, genuinely novel combinatorial discovery produced by a language model. That is a categorically different capability claim than any benchmark in the table above — benchmarks measure performance on known-answer tasks. A verified Ramsey result means the model generated a proof that human and machine checkers confirmed to be true of something previously unknown.

Cyber Risk Rated ‘High’ Under the Preparedness Framework

OpenAI’s internal Preparedness Framework classifies GPT-5.5 at “High” cyber risk — the second-highest tier in its four-level scale (Low / Medium / High / Critical). The designation indicates the model can provide meaningful capability uplift to sophisticated threat actors in offensive cybersecurity domains, including vulnerability discovery and exploit development.

OpenAI has not specified which capability triggers drove the classification above GPT-5.4’s rating. The Terminal-Bench 2.0 score is a plausible factor: a model completing autonomous terminal tasks at 82.7% accuracy on complex sequences is, by construction, one capable of executing malicious shell operations with comparable reliability. The ongoing debate around AI safety thresholds has sharpened as frontier capabilities have accelerated in 2026 — shipping a “High” rated model without reaching “Critical” suggests OpenAI’s deployment controls and API access restrictions are now carrying significant weight in the safety calculus, not the risk label alone.

85% Internal Adoption and 71,637 Pages

OpenAI disclosed in its release materials that 85% of the company’s employees use Codex at least once per week — a product signal, not a footnote. Codex runs on GPT-5.5. The statistic reflects the degree to which OpenAI has rebuilt its own workflows around the same agentic coding product it is positioning to enterprise customers, before those customers have had access to the model.

The K-1 form example makes the productivity claim concrete and auditable. OpenAI’s finance team processed 24,771 forms — 71,637 pages — through an internal GPT-5.5 deployment, performing entity extraction, numerical validation, and multi-document cross-referencing at a document volume that would require weeks of analyst time at most organizations. OpenAI’s broader enterprise positioning strategy has consistently relied on demonstrated internal adoption before external deployment. The K-1 example is that approach made numerically visible to prospective buyers.

Why Six Weeks?

The interval between GPT-5.4 and GPT-5.5 is the shortest inter-release gap in OpenAI’s frontier model history. Claude Opus 4.7 shipped in mid-March 2026 and immediately claimed benchmark leadership on SWE-bench Pro, multilingual MMLU, and three additional evaluations. OpenAI had GPT-5.5 in late-stage testing at that point; the Opus 4.7 release accelerated the deployment decision from a planned Q2 window to a late-April ship date.

The six-week cadence has structural implications beyond this release cycle. If OpenAI and Anthropic continue trading benchmark leadership at sub-quarter intervals, the stability assumptions embedded in enterprise API contracts — typically three-to-six-month pricing and capability windows — will require renegotiation. OpenAI’s enterprise deal velocity has been accelerating through 2026; the competitive pressure to ship benchmark-leading models faster is now part of the commercial calculus, not just a research competition between two labs.

Deploy GPT-5.5 for agent pipelines, long-document processing, and real-world computer-use tasks — those are the benchmarks where the 13.3-point Terminal-Bench lead and 37-point MRCR v2 improvement translate to production gains. Retain Claude Opus 4.7 in your evaluation set for complex multi-file debugging and multilingual workloads where the 5.7- and 8.3-point gaps are not rounding error. The model leading the next cycle is already in training at both companies.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime