METR Struggles to Measure Claude Mythos

AI evaluation org METR estimates Claude Mythos Preview’s 50% time horizon at 16+ hours (95% CI: 8.5-55 hours) — at the upper limit of METR’s existing test methodology.
Of METR’s 228 evaluation tasks, only 5 are 16 hours or longer, making measurements at this range “unstable and less meaningful.” METR is working on updated methods with longer tasks.
Palo Alto Networks said three weeks of model-based analysis with Mythos, GPT-5.5-Cyber, and Claude Opus 4.7 matched an entire year of manual penetration testing.
Palo Alto puts coding-efficiency improvement of current frontier models over predecessors at ~50% — “the threshold at which AI crosses from a helpful assistant into an autonomous operator.”

What Happened

Two independent assessments of Anthropic‘s Claude Mythos surfaced May 10, 2026. AI evaluation organization METR said its existing test methodology is hitting its measurement ceiling on Mythos. Cybersecurity firm Palo Alto Networks separately published an evaluation describing the latest frontier models as “a step-change in capability” with the time-from-initial-access-to-data-exfiltration shrinking to as little as 25 minutes in AI-supported scenarios.

Why It Matters

METR’s measurement ceiling is structurally significant: when an evaluation organization can no longer reliably measure a frontier model’s capability, the public information environment shifts to vendor-published numbers. Palo Alto’s “step-change” framing — combined with the company’s revised assessment that the six-month window before attackers gain comparable capabilities has “accelerated significantly” — adds operational urgency to the cyber-capability discussion that the Palisade Research self-replication paper (covered separately today) makes concrete.

Technical Details

METR evaluated an early version of Claude Mythos Preview during a limited time window in March 2026. The organization estimates a 50 percent time horizon of at least 16 hours, with a 95 percent confidence interval of 8.5 to 55 hours. The metric describes the task length at which the model has a 50 percent chance of completing a task that would take a human the specified amount of time. METR uses reference points like training a classifier (~45 minutes) or training an adversarially robust image model (~4 hours).

METR’s framing on the limit: “at the upper end of what we can measure without new tasks.” Of 228 tasks in the test suite, only 5 are 16 hours or longer. Measurements in this range are “unstable and less meaningful than at ranges with better task coverage.” METR therefore doesn’t provide precise estimates for models above this threshold. The organization notes its existing test suite “could still distinguish a much more capable model from current publicly-known state-of-the-art models” but the measurements aren’t robust enough for precise quantitative comparisons or extrapolations.

Palo Alto Networks’s parallel assessment: the company says it had “early, unbounded access to the latest frontier AI models” recently, including Mythos, OpenAI‘s GPT-5.5-Cyber, and Claude Opus 4.7. The models showed an “intuitive understanding of software vulnerabilities,” shifting AI’s role from assistant to autonomous agent “capable of discovering and chaining flaws at a scale that most defenders aren’t prepared for.” Three weeks of model-based analysis matched an entire year of manual penetration testing, with broader coverage. The models combined several individually low-rated vulnerabilities into critical attack paths. Time from initial access to data exfiltration can shrink to 25 minutes in AI-supported scenarios.

Palo Alto puts the coding-efficiency improvement of current frontier models over their predecessors at around 50 percent. “That number sounds incremental, but in practice, it’s the threshold at which AI crosses from a helpful assistant into an autonomous operator.” The company sees additional risk in the rapidly growing, unmonitored attack surface: “every desktop is effectively a server” as local AI agents become more common, and most organizations have no visibility into the code their employees are generating and deploying.

The independent UK AISI evaluation also confirmed a higher threat level. Together with METR and Palo Alto, three independent assessments now agree that frontier-AI cybersecurity capability has crossed a measurement and operational threshold — though the actual scope of the threat in the wild remains unclear.

Who’s Affected

Anthropic gains independent third-party evaluation supporting its restricted-access posture for Mythos. METR faces an obligation to update methodology before its evaluation work continues to be useful for the most-capable models. Palo Alto Networks gains commercial positioning as the AI-cyber-defense vendor with documented experience evaluating against frontier offensive capability. Cybersecurity buyers gain a clearer empirical picture: AI-driven offensive operations are no longer hypothetical. Anthropic’s Project Glasswing partners (the restricted Mythos cohort) and OpenAI’s GPT-5.5-Cyber Trusted Access defenders gain context for their own deployment decisions.

What’s Next

METR’s updated methodology with longer tasks is in development; the timeline for completion will determine how quickly evaluations resume meaningful precision on frontier models. Palo Alto Networks is likely to expand its public evaluation series with comparable assessments on additional models. The “six-month-window has accelerated significantly” framing implies Palo Alto expects compromise of frontier-equivalent capability by attackers within months rather than half a year — a forecast worth tracking against actual incident data.

METR Says It Can Barely Measure Claude Mythos; Palo Alto Calls Frontier Models ‘Step-Change’

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

METR Says It Can Barely Measure Claude Mythos; Palo Alto Calls Frontier Models ‘Step-Change’

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

GPT-5.6 Cheats on Software Tests More Than Any Model, METR Finds

Nobel laureate John Jumper leaves Google DeepMind for Anthropic

Accenture: 74% of Consumers Would Trust an AI Agent to Shop for Them