ARC-AGI-3 Benchmark: Frontier AI Scores Below 1% on Tasks Hu

François Chollet and co-authors submitted ARC-AGI-3 to arXiv on March 24, 2026, introducing a turn-based interactive benchmark for agentic AI systems that must infer goals and plan actions without explicit instructions.
As of March 2026, frontier AI systems scored below 1% on ARC-AGI-3, while human test-takers achieved a 100% solve rate across all environments.
The benchmark evaluates fluid adaptive efficiency using only Core Knowledge priors, deliberately excluding language and external knowledge to eliminate shortcut learning pathways.
Scoring is grounded in human action baselines and measures per-task efficiency rather than binary completion, rewarding agents that solve environments in fewer steps.

What Happened

Researcher François Chollet and colleagues submitted “ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence” to arXiv on March 24, 2026 (arXiv:2603.24621), with a revised version released April 17, 2026. The paper introduces an interactive benchmark in which AI agents navigate novel, abstract, turn-based environments with no explicit instructions — scoring below 1% across all tested frontier systems as of March 2026, while every human test-taker solved every environment.

Why It Matters

ARC-AGI-3 builds directly on its predecessors: ARC-AGI-1 revealed a large performance gap between humans and AI on abstract visual reasoning tasks, and ARC-AGI-2 raised the difficulty ceiling further. The third iteration shifts the challenge to multi-turn agentic settings, addressing a specific gap in existing evaluations that measure single-turn query performance rather than sustained exploration and planning across extended interaction sequences.

Technical Details

Each ARC-AGI-3 environment relies exclusively on Core Knowledge priors — innate cognitive systems such as object permanence, basic spatial reasoning, and elementary causality — while excluding language and external knowledge to block shortcut learning. Agents must complete four linked subtasks within each environment: exploration, goal inference, internal model construction of environment dynamics, and multi-step action planning, all without receiving an explicit goal description.

The scoring framework is efficiency-based and calibrated against human action baselines: an agent completing a task in significantly more steps than a human receives a proportionally lower score. The authors describe the benchmark as designed to measure “fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge” — distinguishing it from evaluations that can be gamed through memorization or retrieval. Difficulty was calibrated through extensive human testing, producing the 100% human solve rate reported in the paper. As of March 2026, no frontier AI system scored above 1% on the full benchmark.

Who’s Affected

The benchmark is most directly relevant to research organizations with active agentic AI programs — including Google DeepMind, Anthropic, OpenAI, and Meta AI — where multi-step planning and tool-use systems are current development priorities. Developers building on long-context models and agent scaffolding will find ARC-AGI-3 provides a concrete upper-bound measurement: existing agentic architectures, including those using chain-of-thought reasoning and tool-use pipelines, do not approach the human baseline on these tasks.

The below-1% result applies across frontier systems broadly; the full paper’s environment construction and validation methodology enables independent evaluation for organizations benchmarking their own systems.

What’s Next

The v2 revision published April 17, 2026 includes the full benchmark design, the efficiency-based scoring framework grounded in human action baselines, and the environment construction and calibration methodology. No publicly reported frontier AI score above 1% had been recorded as of that date. Independent evaluation is expected as labs incorporate ARC-AGI-3 alongside other standard capability benchmarks in 2026.

ARC-AGI-3 Benchmark: Frontier AI Scores Below 1% on Tasks Humans Solve Fully

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

ARC-AGI-3 Benchmark: Frontier AI Scores Below 1% on Tasks Humans Solve Fully

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

OpenAI Confidentially Files for IPO, a Week After Anthropic

Google Releases Gemini 3.5 Live Translate for Real-Time Speech-to-Speech

SpaceX Signs $920 Million-a-Month Deal to Supply Google 110,000 Nvidia Chips