ARC-AGI-3 Analysis: GPT-5.5 & Opus 4.7 Failures

The ARC Prize Foundation analyzed 160 game runs of OpenAI’s GPT-5.5 and Anthropic’s Opus 4.7 on the ARC-AGI-3 interactive benchmark, released in late March 2026.
GPT-5.5 scored 0.43 percent at roughly $10,000 in compute; Opus 4.7 scored 0.18 percent. Both models stay below 1 percent on a benchmark humans solve from scratch.
Three systematic error patterns emerged: failure to build world models from observed mechanics; false analogies to familiar games in training data; and treating lucky wins as confirmation of incorrect theories.
Behavioral profiles diverge: Opus 4.7 locks onto wrong rules early; GPT-5.5 struggles to commit to correct ones once found.

What Happened

The ARC Prize Foundation published an analysis of 160 reasoning traces from OpenAI’s GPT-5.5 and Anthropic’s Claude Opus 4.7 on the ARC-AGI-3 benchmark, surfaced May 2, 2026. The benchmark, released in late March 2026, evaluates AI agents in interactive turn-based game environments where the model must explore, form hypotheses, and execute action plans without instructions. Every frontier model tested so far has scored below 1 percent against human baselines.

Why It Matters

Most AI benchmarks report only pass-fail aggregate results. ARC-AGI-3’s design — recording every step of the model’s reasoning — lets researchers identify exactly where models fail rather than only that they fail. The Foundation’s analysis is the first detailed public characterization of how frontier reasoning models break on novel interactive tasks. The findings suggest the gap between current models and the kind of general intelligence ARC measures is not a matter of compute or fine-tuning but of architectural reasoning patterns that persist across labs.

Technical Details

GPT-5.5 hit 0.43 percent at a cost of roughly $10,000 per run; Opus 4.7 reached 0.18 percent. The Foundation identified three recurring error patterns:

1. Local effects, missed world model. Models recognize specific cause-effect relationships but fail to assemble them into a coherent rule set. In game cd82, Opus 4.7 knew by step 4 that ACTION3 rotates a container and by step 6 that ACTION5 pours paint, but never connected these into the realization that the bucket needed to be aligned before dipping to reproduce the target image.

2. False analogies from training data. Models repeatedly mistook unknown mechanics for Tetris, Frogger, Sokoban, Breakout, Pong, or Boulder Dash. GPT-5.5 interpreted the ls20 environment as Breakout when it was actually about key combinations: “Then again, it could be more like ‘Breakout,’ with bricks at the top and a paddle. The central object might be the ball,” it wrote in its trace, killing any chance of progress.

3. Lucky wins reinforce wrong theories. When a model solves a level, it does not check why the solution worked. In ka59, Opus 4.7 solved level 1 in 37 actions based on a false teleportation theory; the level’s simple structure happened to allow the wrong mechanics to succeed, and the model carried the misconception into level 2 where it failed.

Behavioral profiles differ between the two models. Opus 4.7 picks up mechanics earlier — on ar25 it identified the mirror structure almost immediately and solved level 1 — but locks onto false rules and refuses to revise them. GPT-5.5 is the inverse: it identifies correct rules but then fails to commit to them, drifting into hallucinated alternatives.

Who’s Affected

OpenAI and Anthropic gain a detailed external diagnostic of where their frontier reasoning models break, useful for next-generation training. Researchers building agentic systems for production — coding assistants, scientific research agents, robotics policies — face a documented warning that current models can produce confident-looking action plans grounded in incorrect world models. Benchmark authors will likely incorporate the Foundation’s diagnostic methodology, since it produces actionable signal compared to single-number pass-fail metrics.

What’s Next

The ARC Prize Foundation has not announced when it will release further analyses, but the publicly available reasoning-trace dataset enables independent research on the three error patterns. Watch for whether the next OpenAI and Anthropic frontier model releases — likely in summer 2026 — show measurable improvement on ARC-AGI-3, and whether labs publish targeted training mitigations addressing world-model coherence and training-data analogy avoidance.

ARC-AGI-3 Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Failures

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

ARC-AGI-3 Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Failures

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Philosophy Bench Maps Ethical Divergence Across Frontier AI Models: Claude Most Deontological, Grok Most Consequentialist

Empirical Study: AI Overviews Trigger on 51.5% of Real Queries, Retrieve Different Sources Than Google Search

GPT-5.5 Proved a New Ramsey Number Theorem, Verified in Lean