ARC-AGI-3: Every Frontier Model Scores Under 1%

The ARC Prize Foundation launched ARC-AGI-3 on March 25, 2026, with over $2 million in prizes for any AI system that can match untrained human performance on abstract reasoning tasks. The results from frontier models are striking: Gemini 3.1 Pro scored 0.37%, GPT-5.4 scored 0.26%, Claude Opus 4.6 scored 0.25%, and Grok 4.20 scored 0.00%. Untrained humans score 100%.

What ARC-AGI-3 Tests

Unlike benchmarks that test knowledge retrieval or pattern matching against training data, ARC-AGI-3 is the first fully interactive benchmark in the series. It requires AI agents to explore environments, form hypotheses, figure out objectives with zero instructions, and execute multi-step plans. The tasks test genuine abstraction and reasoning — capabilities that current models simulate through pattern matching but do not actually possess.

The sub-1% scores are not a failure of engineering. They reveal a fundamental architectural limitation: transformer-based models excel at interpolation within their training distribution but struggle with the kind of novel reasoning that humans perform effortlessly. A child can look at a visual pattern puzzle and deduce the rule after two examples. The most capable AI models in existence cannot reliably do this even once.

The AGI Reality Check

ARC-AGI-3 arrives at a moment when AI companies are making increasingly bold claims about approaching artificial general intelligence. OpenAI’s internal planning documents reference AGI timelines. MegaOne AI’s leaderboard shows models scoring above 90% on standard benchmarks — creating the impression that human-level AI is imminent.

The ARC-AGI-3 results demolish that impression. The gap between 0.26% and 100% is not a gap that scaling alone can close. The ARC Prize 2026 competition runs across three tracks on Kaggle from March through November. If no system comes close to human performance by November, it will be the strongest empirical evidence yet that current AI architectures, regardless of scale, are missing something fundamental about intelligence.

ARC-AGI-3: Every Frontier Model Scores Under 1%

What ARC-AGI-3 Tests

The AGI Reality Check

Enjoyed this story?

Sanders and AOC Want to Ban New AI Data Centers Until Congress Acts

Huawei’s New AI Chip Is Good Enough That ByteDance and Alibaba Are Buying It

AI Hiring Tools Are Penalizing Women Through Proxy Variables

Before you go…