BLOG

ARC-AGI-3: Every Frontier Model Scores Under 1%

M megaone_admin Mar 31, 2026 2 min read
Engine Score 7/10 — Important
Editorial illustration for: ARC-AGI-3: Every Frontier Model Scores Under 1%

The ARC Prize Foundation launched ARC-AGI-3 on March 25, 2026, with over $2 million in prizes for any AI system that can match untrained human performance on abstract reasoning tasks. The results from frontier models are striking: Gemini 3.1 Pro scored 0.37%, GPT-5.4 scored 0.26%, Claude Opus 4.6 scored 0.25%, and Grok 4.20 scored 0.00%. Untrained humans score 100%.

What ARC-AGI-3 Tests

Unlike benchmarks that test knowledge retrieval or pattern matching against training data, ARC-AGI-3 is the first fully interactive benchmark in the series. It requires AI agents to explore environments, form hypotheses, figure out objectives with zero instructions, and execute multi-step plans. The tasks test genuine abstraction and reasoning — capabilities that current models simulate through pattern matching but do not actually possess.

The sub-1% scores are not a failure of engineering. They reveal a fundamental architectural limitation: transformer-based models excel at interpolation within their training distribution but struggle with the kind of novel reasoning that humans perform effortlessly. A child can look at a visual pattern puzzle and deduce the rule after two examples. The most capable AI models in existence cannot reliably do this even once.

The AGI Reality Check

ARC-AGI-3 arrives at a moment when AI companies are making increasingly bold claims about approaching artificial general intelligence. OpenAI’s internal planning documents reference AGI timelines. MegaOne AI’s leaderboard shows models scoring above 90% on standard benchmarks — creating the impression that human-level AI is imminent.

The ARC-AGI-3 results demolish that impression. The gap between 0.26% and 100% is not a gap that scaling alone can close. The ARC Prize 2026 competition runs across three tracks on Kaggle from March through November. If no system comes close to human performance by November, it will be the strongest empirical evidence yet that current AI architectures, regardless of scale, are missing something fundamental about intelligence.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy