AI Clock Reading Benchmark: GPT-5.4 at 50%, Claude 8.9%

GPT-5.4, the most capable large language model OpenAI has shipped to date, now scores above 50% on Humanity’s Last Exam — a benchmark so hard it once stumped frontier models at single-digit percentages. IEEE Spectrum reported on April 15, 2026, that the same model scores just 50% on a task a ten-year-old handles in seconds: reading an analog clock. Claude Opus 4.6, Anthropic’s most powerful model, performs even worse — just 8.9% accuracy on the same benchmark.

This is not a quirk. It is a structural feature of how large language models are built, and it has a name: jagged intelligence.

What the AI Clock Reading Benchmark Actually Measures

The clock-reading task is deceptively simple: show a model an image of an analog clock and ask it to report the time. No trick angles, no ambiguous hand positions — standard clock faces that appear in elementary school math curricula worldwide.

GPT-5.4 answers correctly half the time. Claude Opus 4.6, Anthropic’s flagship reasoning model priced at $15 per million input tokens, manages 8.9%. A random guess among 12 possible hours yields approximately 8.3% accuracy. Claude Opus 4.6 is barely outperforming random selection on a task humans master before age eight.

The benchmark highlights a specific failure mode: spatial reasoning applied to visual inputs. Large language models process images through vision encoders trained on labeled data — captions, alt text, described images. The model learns “clock” as a concept but never develops the spatial parsing required to read where hands point relative to each other in pixel space.

Claude Opus 4.6’s 8.9%: Near Statistical Random Chance

Random guessing on a 12-hour clock face yields approximately 8.3%, assuming a uniform distribution of query times. Claude Opus 4.6 is performing at noise levels above chance — which means it is failing this task in every meaningful sense.

Anthropic has positioned Claude as the model built for reliability and safety-critical reasoning. The company’s Constitutional AI approach emphasizes calibrated, careful outputs. On clock reading, that calibration produces answers that are wrong 91.1% of the time.

This does not mean Claude Opus 4.6 is a weak model overall. On complex multi-step reasoning, legal analysis, and scientific derivation, it performs at or near the frontier. That is exactly the point of jagged intelligence: the gaps are not uniform, and they are not predictable from headline benchmark scores.

Humanity’s Last Exam: The Flip Side of the Jagged Profile

Humanity’s Last Exam (HLE) was created in 2024 as an AI-proof benchmark — 3,000 expert-level questions spanning mathematics, science, humanities, and professional domains. Early frontier models scored below 9%. As of April 2026, the best models exceed 50%.

That 41-percentage-point increase in roughly 14 months is without precedent in AI benchmark history. Models scoring above 50% on HLE work through multi-step graduate-level physics derivations, interpret complex legal reasoning chains, and solve advanced mathematical proofs. These are tasks humans spend years of specialized training to accomplish.

The clock problem inverts this entirely. No specialized training required. No domain expertise. Just spatial pattern recognition that children develop passively by observing the world — and frontier AI models largely cannot replicate.

Why Text Training Fails Spatial Tasks

Large language models are trained primarily on text. Even multimodal models that process images convert visual inputs into token representations that the language model component then reasons over. The core architecture — transformers operating on sequential token streams — is optimized for pattern recognition in discrete symbol sequences, not continuous spatial relationships.

Reading an analog clock requires:

Identifying the angular position of each hand relative to 12 o’clock
Distinguishing hour from minute hands by relative length
Estimating the proportional position of hands between hour markers
Mapping that angular position to a specific time value

Text descriptions of clocks don’t preserve spatial information accurately. When training data includes the phrase “a clock showing 3:45,” the model learns a text-to-concept association — not the pixel-level spatial parsing required to derive 3:45 from an image. MegaOne AI tracks spatial reasoning as a distinct capability axis across 139+ AI tools in its database, because text-heavy benchmarks systematically overstate model capability on vision-native tasks.

Jagged Intelligence: The Concept Visualized in Data

The term “jagged intelligence” was popularized by Ethan Mollick of the Wharton School to describe how AI capabilities don’t form a smooth frontier. A model that outperforms a PhD candidate on academic reasoning may fail tasks that a child handles trivially. When visualized, the capability profile looks jagged — not uniformly advanced.

The benchmark data makes this concrete:

Benchmark	GPT-5.4	Claude Opus 4.6	Human Baseline
Humanity’s Last Exam (expert reasoning)	>50%	Frontier-level	~34% (domain experts)
Analog clock reading (spatial vision)	~50%	8.9%	>99%

On tasks that take humans years to master, frontier models are now competitive or superior. On a task humans solve in milliseconds without conscious effort, the same models fail at rates that would be disqualifying in any deployed product. That is the jagged profile in numerical form.

Real-World Implications Beyond the Benchmark

For most enterprise AI deployments, reading a clock face is not a bottleneck. But the underlying failure mode — spatial reasoning from visual inputs — affects a wide range of practical applications: parsing charts in financial documents, interpreting radiology annotation overlays, reading architectural diagrams, extracting readings from industrial gauges, and navigating physical environments from camera feeds.

AI systems have proliferated across consumer applications — from AI weather apps that generate fluent forecasts but misread the spatial structure of radar imagery to scheduling tools that reason about time in text but struggle with visual representations of it. Clock reading is a clean, controlled version of the same underlying failure mode.

The Humans First movement’s core argument — that human cognitive capabilities retain specific, non-trivial advantages over AI systems — finds unexpected empirical support in spatial benchmarks like this one. The advantage isn’t mystical. It’s architectural: human visual systems evolved over millions of years for exactly this kind of spatial parsing, and no amount of text pretraining replicates it.

What Closes the Gap

The research community is actively working on the spatial reasoning problem. Approaches include training on synthetic spatial tasks, incorporating geometric reasoning modules, and fine-tuning on visual-spatial datasets with explicit angular relationships. The 14-month HLE trajectory — from 8.8% to 50%+ — suggests benchmark gaps close faster than expected once they become visible targets.

But closing the clock gap won’t close all gaps. Every solved benchmark reveals the next unexpected failure mode. The jagged intelligence profile is not a temporary artifact of early-stage AI development — it is a predictable consequence of training paradigms that optimize heavily for some cognitive tasks while neglecting others. Spatial vision has been neglected.

Developers deploying AI on vision-native tasks should benchmark specifically against their actual use case. A 50%+ score on Humanity’s Last Exam tells you nothing about whether a model can reliably parse a scanned invoice, read a pressure gauge on industrial equipment, or interpret a radiology scan. The clock test is a reminder that “frontier model” is not a uniform capability guarantee — it is a jagged, uneven profile that requires task-specific evaluation before any production deployment.

GPT-5.4 Solves PhD Physics But Fails Clock Reading: AI’s Jagged Intelligence

What the AI Clock Reading Benchmark Actually Measures

Claude Opus 4.6’s 8.9%: Near Statistical Random Chance

Humanity’s Last Exam: The Flip Side of the Jagged Profile

Why Text Training Fails Spatial Tasks

Jagged Intelligence: The Concept Visualized in Data

Real-World Implications Beyond the Benchmark

What Closes the Gap

Enjoyed this story?

GPT-5.4 Solves PhD Physics But Fails Clock Reading: AI’s Jagged Intelligence

What the AI Clock Reading Benchmark Actually Measures

Claude Opus 4.6’s 8.9%: Near Statistical Random Chance

Humanity’s Last Exam: The Flip Side of the Jagged Profile

Why Text Training Fails Spatial Tasks

Jagged Intelligence: The Concept Visualized in Data

Real-World Implications Beyond the Benchmark

What Closes the Gap

Enjoyed this story?

Alberta Government Scanned 466M Lines of Code With Claude in 20 Hours

OpenAI’s Head of Safety Departs Amid a Broader Safety-Team Exodus

Researcher Used Claude to Uncover a Ticketing Flaw at Most US Festivals