A new coding benchmark called DeepSWE launched in late May 2026 and immediately revealed a gap the standard tests miss: GPT-5.5 leads Claude Opus by 16 points on real long-horizon software engineering tasks.
The result complicates the prevailing narrative. On conventional benchmarks, Claude leads coding — but DeepSWE measures something those tests do not.
What DeepSWE measures differently
DeepSWE tests multi-file, multi-step engineering workflows that take hours, not minutes — closer to how developers actually work. Standard benchmarks like SWE-bench and Terminal-Bench reward short, well-scoped tasks. DeepSWE rewards sustained planning across an extended session.
| Benchmark | Task horizon | Leader |
|---|---|---|
| SWE-bench Pro | Short, scoped | Claude Opus 4.8 |
| Terminal-Bench 2.1 | Short, scoped | Claude Opus 4.8 |
| DeepSWE | Multi-hour, multi-file | GPT-5.5 (+16) |
Why standard benchmarks miss the gap
SWE-bench and Terminal-Bench measure discrete fixes — the kind of task that completes in one reasoning pass. They do not capture whether a model can hold context, replan, and stay coherent across a multi-hour build. A model can top the short-task leaderboards and still drift on extended work.
Short tasks vs long horizons: pick your model
The practical reading is that the two leaders excel at different jobs. Claude Opus 4.8 — whose benchmark gains we covered at release — still leads on shorter, well-defined coding tasks. GPT-5.5’s planning architecture gives it the edge on extended engineering sessions.
- Short, well-scoped fixes: Claude Opus 4.8 leads.
- Multi-hour, multi-file builds: GPT-5.5 leads by 16 points.
Why this benchmark may rewrite the narrative
If long-horizon performance is what matters for real engineering, DeepSWE reframes the GPT-5.5 vs Opus 4.8 debate from “which model codes better” to “which model codes better for how long.” That distinction will shape enterprise model selection more than any single leaderboard.
The actionable step: match the model to the task horizon. Route quick fixes to the short-task leader and long, multi-file builds to the planner — and re-run the comparison as each model ships its next version.