SWE-Rebench February Update: GPT-5.4 & Qwen3.5

On the SWE-Rebench leaderboard, Claude Opus 4.6 leads at 65.3% resolved rate, followed by GPT-5.2 at 64.4% and GLM-5 at 62.8%, with GPT-5.4 ranking fourth at 62.8%.
Qwen3.5-397B-A17B places ninth at 59.9% resolved rate but achieves 71.9% on Pass@5, the highest among all models tested.
SWE-Rebench distinguishes itself from the original SWE-Bench by using continuously evolving, decontaminated problem sets to prevent benchmark gaming.

What Happened

The SWE-Rebench leaderboard updated its rankings in March 2026 with results for several new models, including GPT-5.4, Qwen3.5-397B-A17B, Gemini 3.1 Pro Preview, and Claude Sonnet 4.6. The benchmark, designed as a continuously evolving alternative to the original SWE-Bench, tests AI models on their ability to resolve real-world software engineering problems pulled from open-source repositories.

Claude Opus 4.6 holds the top position at 65.3% resolved rate. GPT-5.4 ranks fourth at 62.8%, while Qwen3.5-397B-A17B sits ninth at 59.9%. The results reflect the current state of AI coding capabilities across models from Anthropic, OpenAI, Alibaba, and Zhipu AI.

Why It Matters

SWE-Rebench addresses a persistent problem in AI benchmarking: data contamination. The original SWE-Bench became less informative over time as model providers could potentially train on its fixed problem set, inflating scores without corresponding improvements in real-world coding ability. SWE-Rebench selects problems within specific time windows and flags evaluations where model release dates overlap with problem publication dates, making it harder to game the results.

The leaderboard reveals that raw resolution rate does not tell the full story. Qwen3.5-397B-A17B, despite ranking ninth on single-attempt resolution at 59.9%, achieves 71.9% on Pass@5, the highest score of any model tested. This means that given five attempts, Qwen3.5 finds correct solutions more reliably than models ranked above it on first-try performance. For teams deploying AI coding agents in production, where retry logic is standard, this multi-attempt reliability may matter more than single-shot accuracy.

Technical Details

A February 2026 methodology update removed demonstrations and the previous strict 80-step limit, allowing models to work with larger contexts without artificial constraints. The benchmark now operates within a 128k context window limitation and includes auxiliary interface specifications for test functions. These changes were designed to better reflect how models are actually used in practice, where developers provide extensive context and expect models to reason over large codebases.

GPT-5.4 stands out for token efficiency. It achieves its fourth-place ranking with the lowest token consumption among comparable top-five models, suggesting OpenAI has optimized the model for getting more done with fewer tokens. This efficiency has practical cost implications for teams running AI coding assistants at scale, where token usage directly translates to API costs. By contrast, models like Qwen3-Coder-Next average approximately 8.12 million tokens per problem, benefiting from very large working contexts at the cost of significant computational expense.

The full top five consists of Claude Opus 4.6 (65.3%), GPT-5.2 (64.4%), GLM-5 (62.8%), GPT-5.4 (62.8%), and a cluster of models in the 60-62% range including Gemini 3.1 Pro Preview.

Who’s Affected

Engineering teams evaluating AI coding assistants can use SWE-Rebench as a more reliable signal than the original SWE-Bench. The benchmark’s decontamination approach makes it harder for model providers to optimize specifically for benchmark scores without genuine capability improvements, giving procurement teams more confidence in the rankings.

The results also matter for the broader AI model market. Chinese models like Qwen3.5 and GLM-5 are now competing directly with U.S.-built models on software engineering tasks, with GLM-5 from Zhipu AI placing third overall ahead of GPT-5.4. For organizations in regions where data residency or geopolitical considerations influence model selection, having competitive Chinese alternatives on a trusted benchmark provides useful decision-making data.

What’s Next

The SWE-Rebench team continues to update the problem set to maintain decontamination as new models are released. Future evaluations are expected as model versions ship throughout 2026, including anticipated updates from Anthropic, OpenAI, and Google. The benchmark team has indicated interest in expanding its methodology to capture more dimensions of real-world coding performance beyond single-instance problem resolution.

The gap between single-attempt and multi-attempt scores across models suggests that retry strategies and agent scaffolding may matter as much as raw model capability for practical coding applications. Engineering teams building AI-assisted development workflows should consider evaluating models on Pass@5 and token efficiency alongside headline resolution rates, since production deployments rarely rely on a single inference call. Whether SWE-Rebench incorporates agent-level evaluation metrics in future iterations remains an open question, but the benchmark’s decontamination approach has already established it as a more trustworthy signal than its predecessor for tracking real progress in AI coding capabilities.

SWE-Rebench February Update: GPT-5.4 and Qwen3.5 Lead on Decontaminated Coding Tasks

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

SWE-Rebench February Update: GPT-5.4 and Qwen3.5 Lead on Decontaminated Coding Tasks

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Claude Mythos Preview Becomes First AI Model to Clear All UK AISI Cyberattack Simulations

OpenAI’s GPT-5.5 Reportedly Hits 82.7% on Agentic Coding Benchmark, Interesting Engineering Reports

Kimi K2.6 and Xiaomi MiMo Beat Claude, GPT-5.5, Gemini in Word Gem Puzzle Coding Tournament