BLOG

ARC-AGI-3: Every Frontier Model Scores Under 1%

Z Zara Mitchell Mar 31, 2026 Updated Apr 7, 2026 4 min read
Engine Score 8/10 — Important

ARC-AGI-3 showing all frontier models under 1% vs humans at 100% is a critical AGI progress benchmark.

Editorial illustration for: ARC-AGI-3: Every Frontier Model Scores Under 1%
  • ARC-AGI-3, released on March 25, 2026, is the first fully interactive benchmark in the ARC-AGI series, featuring hundreds of handcrafted turn-based games with thousands of levels and no instructions.
  • Frontier AI models scored just 0.26% on the benchmark while humans scored 100%, exposing a fundamental gap in adaptive reasoning capabilities.
  • ARC Prize 2026 offers over $2 million in prizes across two Kaggle competitions for open-source solutions to both ARC-AGI-2 and ARC-AGI-3.
  • The benchmark was created by Francois Chollet and announced at Y Combinator HQ with a fireside chat between Chollet and OpenAI CEO Sam Altman.

What Happened

The ARC Prize Foundation released ARC-AGI-3 on March 25, 2026, introducing a fundamentally new format that replaces the static grid puzzles of previous versions with interactive, turn-based game environments. The benchmark consists of hundreds of original environments, each handcrafted by a team of human game designers. There are no instructions, no rules, and no stated goals. To succeed, an AI agent must explore each environment on its own, figure out how it works, discover what winning looks like, and carry what it learns forward across increasingly difficult levels.

Greg Kamradt, who authored the announcement post, reported that every frontier model tested scored under 1%. The highest AI performance reached just 0.26%, while human participants scored 100%. The launch took place at Y Combinator headquarters in San Francisco, featuring a fireside conversation between Francois Chollet, the creator of ARC-AGI, and Sam Altman, CEO of OpenAI, on the topic of measuring intelligence on the path to AGI.

Why It Matters

Previous ARC-AGI benchmarks have accurately tracked real breakthroughs in AI capability. ARC-AGI-1 predicted the emergence of reasoning-focused models before they became mainstream. ARC-AGI-2 tracked the impact of coding agents on structured problem-solving tasks. ARC-AGI-3 now targets a capability that current AI systems demonstrably lack: the ability to explore an unfamiliar environment, infer rules without being told, and adapt behavior based on discovered patterns without any prior training on similar tasks.

The near-zero scores from frontier models suggest that scaling language models larger and adding chain-of-thought reasoning prompts have not produced systems capable of genuine open-ended learning. This makes ARC-AGI-3 a direct and measurable test of the gap between AI that can follow complex instructions and AI that can autonomously explore, hypothesize, and adapt in novel situations.

The benchmark also highlights a disconnect between industry claims about approaching AGI and measurable performance on tasks that require basic human-like adaptability. A 0.26% score on tasks that every tested human completed perfectly raises pointed questions about what current AI systems are actually learning during training.

Technical Details

ARC-AGI-3 environments are turn-based interactive games. An agent takes actions within a grid-based world, observes the results of each action, and must form and test hypotheses about how the environment operates. Each game contains multiple levels of increasing difficulty, requiring the agent to transfer learned concepts from easier levels to harder ones. The agent receives no documentation, labels, instruction text, or explicit reward signals beyond what it can observe through direct interaction.

The benchmark was specifically designed to resist memorization and prompt engineering. Because each environment has unique mechanics created by human designers, a model cannot rely on patterns it encountered during pretraining. This separates genuine reasoning from sophisticated pattern matching. A detailed technical paper describing the full design methodology, scoring criteria, and evaluation protocol is available for download on the ARC Prize website.

The two Kaggle competitions accept submissions through the end of 2026. The ARC-AGI-3 competition is structured as a new kind of agent-based challenge where submitted systems must play the games in real time. The ARC-AGI-2 Grand Prize guarantees an award to the best-performing open-source submission regardless of whether it exceeds a fixed threshold.

Who’s Affected

AI researchers working on reasoning, planning, reinforcement learning, and program synthesis are the primary audience. Labs building frontier models, including OpenAI, Anthropic, Google DeepMind, and Meta AI, now have a public benchmark that quantifies capabilities their current architectures do not possess. The $2 million prize pool is intended to incentivize open-source research rather than proprietary model development.

The benchmark also matters to AI investors and policymakers who rely on benchmark scores to assess the state of the field. ARC-AGI-3’s results provide a concrete counterpoint to narratives suggesting that current systems are close to human-level general intelligence.

What’s Next

The ARC Prize Foundation has positioned ARC-AGI-3 as a multi-year benchmark designed to track whether new architectures, training methods, or hybrid approaches can close the gap between human and machine performance on open-ended reasoning tasks. Kaggle submissions remain open through the end of 2026. The foundation has not announced plans for an ARC-AGI-4, but the interactive game framework is extensible to more complex environments and longer-horizon planning challenges.

Related Reading

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime