SWE-Rebench February Update: GPT-5.4 and Qwen3.5 Lead on Decontaminated Coding Tasks

The SWE-rebench leaderboard published its February 2026 update, evaluating frontier AI models on 57 fresh GitHub pull request tasks that were not present in any model’s training data. The benchmark, which draws from a dataset of over 21,000 interactive tasks across 3,400 GitHub repositories, provides one of the most rigorous assessments of AI coding ability by using continuously updated, real-world software engineering problems that models cannot have memorized.

The decontamination methodology is SWE-rebench’s core differentiator from older coding benchmarks like HumanEval and MBPP. Each evaluation batch uses recently created GitHub PRs — actual bug fixes, feature implementations, and refactoring tasks submitted by human developers — ensuring that models are tested on problems they have never seen during training. This eliminates the contamination problem that has undermined confidence in static coding benchmarks, where frontier models score increasingly high partly because benchmark problems leak into training data.

GPT-5.4 and Qwen3.5 emerged as leading performers in the February batch, with Gemini 3.1 also showing strong results. The specific pass rates and rankings vary across task categories — models that excel at bug fixing may underperform on feature implementation or test generation. SWE-rebench captures this variation by testing across diverse task types, providing a more nuanced picture of coding capability than single-score benchmarks.

The 57-task February batch is smaller than the full dataset but representative of the distribution of real software engineering work. Tasks range from simple single-file fixes to multi-file refactoring that requires understanding project architecture, dependency relationships, and testing conventions. This complexity spectrum tests not just code generation ability but the agentic planning and context management that distinguish useful coding assistants from models that can only complete isolated function implementations.

For engineering teams evaluating AI coding tools, SWE-rebench results offer more actionable guidance than synthetic benchmarks. A model’s performance on real GitHub PRs correlates more closely with its utility in production development workflows than its score on algorithm puzzles. The continuously updated nature of the benchmark means that results remain relevant as models improve — unlike static benchmarks that become saturated and lose discriminative power.

SWE-Rebench February Update: GPT-5.4 and Qwen3.5 Lead on Decontaminated Coding Tasks

Enjoyed this story?

BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection Rankings

Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

Liquid AI Runs 24-Billion-Parameter Model at 50 Tokens Per Second in a Web Browser

Before you go…