SPOTLIGHT

Claude Opus 4.7 Hits 87.6% on SWE-bench — The Highest Score on Record

R Ryan Matsuda Apr 21, 2026 5 min read
Engine Score 9/10 — Critical

This story reports a new record for Claude Opus 4.7 on a critical coding benchmark, positioning it as a clear leader against competitors like GPT-5.4. This has significant implications for software development and the competitive AI landscape.

Editorial illustration for: Claude Opus 4.7 Hits 87.6% on SWE-bench — The Highest Score on Record

Anthropic’s Claude Opus 4.7, released April 21, 2026, scored 87.6% on SWE-bench Verified — the highest mark ever publicly recorded on the benchmark, up 6.8 percentage points from Claude Opus 4.6’s 80.8%. On the harder SWE-bench Pro variant, Opus 4.7 scores 64.3% against GPT-5.4’s 57.7%. The coding benchmark race has a clear leader.

SWE-bench Verified is not a curated toy dataset. It consists of 500 real GitHub issues pulled from production software repositories — the kind of bugs developers spend hours debugging. An 87.6% pass rate means the model resolves roughly 438 of those 500 issues autonomously. That number would have been dismissed as a theoretical ceiling twelve months ago.

The 6.8-Point Jump That Defines the Generation Gap

Opus 4.6 scored 80.8% on SWE-bench Verified when it launched in late 2025. The 6.8 percentage point improvement on Opus 4.7 represents faster generational progress than the preceding model transition — Opus 4.5 to 4.6 gained roughly 5 points on the same benchmark. The acceleration is not incidental.

Anthropic’s investment in long-context reasoning and multi-step agent behavior shows up directly in these numbers. SWE-bench Verified rewards exactly those capabilities: understanding a repository’s structure, tracing a bug across multiple files, generating a minimal patch, and passing the existing test suite without human guidance. It penalizes models that hallucinate plausible-looking fixes. Anthropic’s source code leak earlier this year exposed how deeply the team had invested in agentic coding infrastructure — the benchmark results confirm those investments have compounded.

claude opus 4.7 swe-bench 87.6: The Full Benchmark Scorecard

A single benchmark number obscures the full picture. Across six major evaluations, the Opus 4.7 versus GPT-5.4 comparison reveals two models with distinct strengths — not one clear winner across all domains.

Benchmark Claude Opus 4.7 GPT-5.4 / GPT-5.4 Pro Winner
SWE-bench Verified 87.6% Opus 4.7
SWE-bench Pro 64.3% 57.7% Opus 4.7 (+6.6 pts)
CursorBench 70.0% Opus 4.7 (prev. gen: 58%)
GPQA Diamond 94.2% 94.4% GPT-5.4 Pro (−0.2 pts)
BrowseComp 79.3% 89.3% GPT-5.4 (+10 pts)
Terminal-Bench 2.0 69.4% 75.1% GPT-5.4 (+5.7 pts)

The pattern is unambiguous: Opus 4.7 dominates code-centric benchmarks while GPT-5.4 leads in browser-based retrieval and terminal execution. Those are different product profiles serving different engineering workflows.

CursorBench: From 58% to 70% in One Generation

CursorBench evaluates AI performance in real IDE-integrated coding sessions — not algorithm puzzles, but editing existing codebases inside a live development environment. Opus 4.7’s 12-point jump from 58% (Opus 4.6) to 70% mirrors conditions that developers encounter daily: incomplete context, tangled dependencies, legacy code without documentation.

That improvement carries commercial weight. Cursor‘s benchmark was designed specifically to cut through vendor claims about coding ability. A 70% score means the model completes approximately seven in ten representative IDE tasks correctly — a standard that was unreachable by any model a year ago. For enterprise teams evaluating which AI to embed into their development pipeline, this number carries more weight than a synthetic algorithm score on a clean dataset.

Where GPT-5.4 Still Wins

GPT-5.4 is not losing across the board. On BrowseComp — a test of multi-step web research and information retrieval — it scores 89.3% against Opus 4.7’s 79.3%, a 10-point gap that reflects OpenAI’s sustained investment in web-connected agents. On Terminal-Bench 2.0, GPT-5.4 reaches 75.1% against Opus 4.7’s 69.4%.

These are meaningful margins, not rounding errors. For use cases involving autonomous research, competitive intelligence, or shell-based DevOps automation, GPT-5.4 is the stronger tool in April 2026. OpenAI’s aggressive acquisition strategy has systematically added capabilities that feed directly into browsing and terminal performance.

On GPQA Diamond — graduate-level science reasoning — Opus 4.7 scores 94.2% against GPT-5.4 Pro’s 94.4%. A 0.2-point gap is statistical noise. Both models are operating at the ceiling of what that benchmark can measure, and neither has a meaningful edge on scientific reasoning tasks.

The 2x Cost Premium — and Whether It Closes

Claude Opus 4.7 costs approximately twice as much per input token as GPT-5.4. At scale, that gap compounds fast. For a team running 10 million tokens per day, the premium translates to thousands of additional dollars per month in API costs before a single productivity gain is realized.

The counterargument is straightforward arithmetic. If a senior developer costs $200 per hour and an AI-resolved GitHub issue saves 30 minutes of debugging time, each closed issue is worth roughly $100 in recovered engineering capacity. The 6.6 percentage point improvement over GPT-5.4 on SWE-bench Pro means Opus 4.7 resolves approximately one additional issue per 15 attempts. For a team closing 50 AI-assisted issues per week, that translates to roughly three additional resolutions — or $300 in recovered time — every week. The cost premium is recoverable for teams at that velocity.

MegaOne AI tracks 139+ AI tools across 17 categories, and cost-efficiency remains the primary adoption variable for teams under $50K annual AI spend. For that segment, GPT-5.4’s price-performance profile is stronger. For high-velocity engineering organizations where developer time is the binding constraint, Opus 4.7’s coding premium is defensible.

‘Claude Mania’ — What the HumanX Data Actually Means

At the HumanX conference in early 2026, enterprise AI attendees described their Claude adoption as a “religion” — a phrase that circulated widely and reflected organizational loyalty that goes beyond casual tool preference. The benchmark numbers explain the intensity: when a model consistently resolves bugs that previously required senior engineering time, it stops being software and becomes infrastructure.

The cultural signal is as important as the technical one. As AI’s role in knowledge work has deepened, the difference between a model that works reliably and one that almost works has widened proportionally. A 30-point improvement on CursorBench across two model generations — from below 40% two years ago to 70% today — is what converts skeptics into advocates and advocates into evangelists.

Anthropic’s enterprise momentum reflects a deliberate strategic bet: make coding the primary value proposition, and let every other capability benefit from the halo. That bet is paying off in benchmark scores and in conference rooms.

What 87.6% Means for the Developer Backlog

SWE-bench Verified draws from production repositories including Django, sympy, and scikit-learn. These are real bug reports filed by real users against real codebases — not sanitized toy examples. The model must read the issue, locate the relevant code across the repository, write a targeted patch, and pass the existing test suite. No scaffolding, no hints.

For a team triaging a 100-issue backlog, Opus 4.7’s 87.6% rate means autonomous resolution of approximately 87 issues. GPT-5.4’s 57.7% on SWE-bench Pro would handle 57 of a comparable set. That 30-issue delta, multiplied across quarterly sprint cycles, is the calculation engineering leaders are running right now.

The model is not replacing engineers. It is eliminating the low-complexity tail of the bug backlog — the stack traces, the off-by-one errors, the import failures — freeing senior developers for architecture decisions. At 87.6%, the practical bar for meaningful autonomous code contribution has been crossed. Teams evaluating AI coding assistants in Q2 2026 should test Opus 4.7 against their actual issue backlog. The SWE-bench number predicts real performance more accurately than any vendor demo, and right now, it predicts a wide margin over every alternative.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime