- Artificial Analysis and IBM Software Innovation Lab launched ITBench-AA, the first benchmark for agentic enterprise IT tasks, starting with Site Reliability Engineering.
- All frontier models score below 50%. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%; GPT-5.5 (xhigh) at 46%; Qwen3.7 Max at 42%.
- Turn counts vary nearly 3x but longer trajectories don’t translate to higher accuracy — GPT-5.5 averages 31 turns at 46% while Gemini 3.1 Pro Preview averages 83 turns at 30%.
- GLM-5.1 leads open-weight models at 40%, tied with Gemini 3.5 Flash. DeepSeek V4 Pro at 38%; Gemma 4 31B at 37%.
What Happened
Artificial Analysis and IBM’s Software Innovation Lab launched ITBench-AA on Wednesday — the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks. The launch benchmark covers Site Reliability Engineering (SRE) work, where models must diagnose live Kubernetes incidents by reading logs, tracing dependencies, and identifying root-cause entities. All tested frontier models scored below 50%.
Why It Matters
Agentic enterprise-IT tasks are one of the most operationally consequential AI deployment surfaces. SRE work — diagnosing production incidents at companies running Kubernetes at scale — typically requires senior engineers, runs on tight timelines (incidents cost money per minute of downtime), and demands precise causal reasoning across many signals. ITBench-AA is among the least-saturated agentic benchmarks: for context, frontier models score considerably higher on the more general Terminal-Bench.
The benchmark’s structural design — 40 public tasks plus 19 brand-new held-out tasks — provides resistance to training-data contamination. Artificial Analysis worked closely with IBM over the past 6 months to adapt IBM’s underlying ITBench dataset for frontier-AI evaluation. Future expansions will cover Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks.
Technical Details
ITBench-AA SRE consists of 59 total tasks. Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. Fault types span infrastructure, service, application, and chaos-injected incidents — resource-quota exhaustion, rollout failures, connection drops, and similar.
The full leaderboard ordering: Claude Opus 4.7 at 47% (proprietary leader), GPT-5.5 (xhigh) at 46%, Qwen3.7 Max at 42%, GLM-5.1 Reasoning at 40% (open-weight leader, tied with Gemini 3.5 Flash high), DeepSeek V4 Pro Reasoning Max Effort at 38%, Gemma 4 31B Reasoning at 37%, Gemini 3.1 Pro Preview at 30%. The over-investigation pattern — Gemini 3.1 Pro Preview using 83 turns to GPT-5.5’s 31 — produces false positives: models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as the wrong answer.
Who’s Affected
Anthropic gains an external benchmark validating Claude Opus 4.7’s positioning as the leader on enterprise agentic tasks. OpenAI faces a 1-point gap on a benchmark designed by independent third parties. Open-weight model providers — Zhipu (GLM-5.1), Alibaba (Qwen3.7), DeepSeek (V4 Pro), Google (Gemma 4) — gain a comparable that shows the open-weight gap is narrower than other benchmarks suggest. Enterprise SRE teams considering AI augmentation gain empirical data on capability ceilings: none of the frontier models is reliable enough for autonomous Kubernetes incident response. IBM gains a research-output credit from the dataset development.
What’s Next
Artificial Analysis and IBM plan to expand ITBench-AA into FinOps and CISO task domains. The benchmark sub-50% scores suggest substantial headroom for capability development. Anthropic, OpenAI, and the open-weight ecosystem will likely use ITBench-AA SRE as a target benchmark for the next model-release cycle. Independent enterprise-SRE adoption studies will be among the most-watched downstream data points through the second half of 2026.