BENCHMARKS

Claude Mythos Tops ExploitBench, Develops V8 Browser Exploits at 12x GPT-5.5 Cost

J James Whitfield May 17, 2026 3 min read
Engine Score 8/10 — Important

tier-1 benchmarks

Editorial illustration for: Claude Mythos Tops ExploitBench, Develops V8 Browser Exploits at 12x GPT-5.5 Cost
  • Carnegie Mellon’s new ExploitBench measures how far AI agents can progress when exploiting real-world vulnerabilities in Google’s V8 JavaScript engine.
  • Claude Mythos Preview hit 9.90/16 with human nudges, reaching arbitrary-code-execution on 21 of 41 vulnerabilities; GPT-5.5 trailed at 5.51 (top tier on 2).
  • The full Mythos test run cost about $36,428; GPT-5.5 via Codex cost $3,075 — roughly 12x cheaper.
  • Mythos reproduced CVE-2024-0519, a vulnerability human researchers had failed to crack for over a year, per co-author Seunghyun Lee.

What Happened

Researchers at Carnegie Mellon University built ExploitBench, a new benchmark that scores how far AI agents progress when exploiting real-world vulnerabilities in Google’s V8 JavaScript engine. Unlike previous tests that check only whether a bug triggers, ExploitBench scores progress across five tiers, ending at arbitrary code execution. V8 powers Chrome, Edge, Node.js, and Cloudflare Workers.

Why It Matters

ExploitBench moves AI-cyber-capability evaluation from “can the model find a bug” to “can the model weaponise the bug” — the operationally relevant threshold. Anthropic’s Claude Mythos Preview led the pack with significant headroom over OpenAI‘s GPT-5.5, mirroring AISI’s UK assessment from last week that placed Mythos as the first model to clear both AISI cyber ranges. The pattern of Mythos leading in agentic cyber capability is now consistent across at least two independent benchmark frameworks.

The cost-versus-capability trade-off is also now quantified. Mythos’s lead came at roughly 12 times the per-episode cost of GPT-5.5 via Codex, suggesting OpenAI could close some of the gap by allowing higher inference budgets per task.

Technical Details

Claude Mythos Preview, with occasional human “nudges,” hit an average ExploitBench score of 9.90 out of 16 and reached the highest tier (arbitrary code execution, T1) on 21 of 41 tested vulnerabilities. OpenAI’s GPT-5.5 trailed at 5.51, reaching T1 on only two. In fully autonomous mode (no human nudges), Mythos still scored 9.55; GPT-5.5 via Codex managed 4.30. No other tested model achieved full T1 code execution.

The full Mythos test run across 122 episodes cost about $36,428, per ExploitBench’s accounting. GPT-5.5 via Codex ran 123 episodes for roughly $3,075 — about twelve times cheaper. ExploitBench co-author Seunghyun Lee, an experienced security researcher with over 20 reported browser vulnerabilities, manually reviewed the Mythos transcripts and described the model as functioning like a “fairly competent browser / JS engine security researcher.” In one case Mythos developed an exploit technique Lee and a colleague had previously dismissed as too complex; in another it reproduced CVE-2024-0519, which human researchers had failed to crack for over a year.

The benchmark’s bugs are publicly known, so models could theoretically draw on training data. But the dataset also includes vulnerabilities with no public exploit or bug report. ExploitBench is available on GitHub.

Who’s Affected

Anthropic and OpenAI gain a directly comparable external benchmark on agentic cyber capability — useful for both safety policy and commercial positioning. Defensive security teams at companies operating V8-dependent infrastructure — Google, Microsoft (Edge), Cloudflare, Vercel, Netlify, Node.js shops — get a sharper external signal on the offensive-AI threat model. The UK AISI and US AISI gain a third benchmark framework for their evaluation programmes. Other frontier-model providers — Google DeepMind, Mistral, xAI — face the question of whether to participate in ExploitBench-style evaluation.

What’s Next

The ExploitBench team plans to expand coverage beyond V8 in subsequent releases. Anthropic and OpenAI are expected to address the results in upcoming safety reports. The 12x cost gap is a near-term lever OpenAI may close by allowing larger inference budgets per task. Future benchmark iterations will likely measure the ability to discover novel vulnerabilities rather than reproduce known CVEs.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime