Anthropic (the AI safety company co-founded by Dario Amodei in 2021) has internally benchmarked Claude Mythos Preview, an unreleased frontier model that scored 93.9% on SWE-bench Verified and 94.6% on GPQA Diamond as of April 2026 — both the highest scores ever recorded on those evaluations. During controlled security testing, the model autonomously identified thousands of zero-day vulnerabilities across every major operating system and browser. The scale of that discovery has no precedent in the history of automated security research.
Neither result has been formally published on public leaderboards. What follows is based on Anthropic’s acknowledged benchmarks and the scope of security testing the company has internally disclosed.
SWE-bench Verified at 93.9%: Beyond the Senior Engineer Threshold
SWE-bench Verified measures whether a model can autonomously resolve real GitHub issues from production codebases — not synthetic tasks, not curated toy problems. A 93.9% resolution rate means Claude Mythos Preview correctly handled roughly 94 out of 100 real software engineering tasks drawn from active open-source repositories, without human guidance.
The previous high-water mark on SWE-bench Verified was under 80%. OpenAI’s o3, which set industry benchmarks when released in early 2025, scored approximately 71.7%. The jump from 71% to 93.9% is not incremental — it crosses a practical threshold where most real-world software engineering tasks no longer require human correction of AI output.
That distinction matters commercially and operationally. Models in the 70–85% range still require steering on edge cases, ambiguous requirements, and multi-file refactors. At 93.9%, the model operates effectively past the point where most engineering tasks need intervention. For enterprises evaluating AI coding infrastructure, 93.9% is the number that changes the cost model for software development.
GPQA Diamond at 94.6%: Outperforming the Experts Who Wrote the Questions
GPQA Diamond is a benchmark of graduate-level questions across physics, chemistry, and biology, designed specifically so that PhD-level domain experts in the relevant field score around 65–70%. Human experts outside the specific subfield score roughly 34%. It is, by design, a ceiling test.
Claude Mythos Preview scored 94.6% — comfortably above the median performance of the specialist researchers who wrote and validated the questions. GPT-4 scored approximately 35.7% on GPQA Diamond at launch. Google DeepMind’s Gemini Ultra 2 peaked around 90% in late 2025 testing. Mythos Preview’s 94.6% is the first score that materially clears the expert-level ceiling.
GPQA Diamond is not a coding benchmark. The questions require multi-step scientific reasoning across disparate domains — exactly the cognitive architecture that enables a model to navigate unfamiliar codebases, reason about complex vulnerability chains, and identify exploit conditions that a human researcher might miss after days of analysis.
Thousands of Zero-Days: What That Scale Actually Means
During Anthropic’s internal security testing, Claude Mythos Preview autonomously identified thousands of zero-day vulnerabilities — previously unknown security flaws with no existing patches — across every major operating system, including Windows, macOS, and Linux, and every major browser, including Chrome, Firefox, and Safari.
For context: NIST’s National Vulnerability Database logged approximately 29,000 total CVEs across all vendors and severity levels in 2023. The vast majority were previously known or disclosed by human researchers. Elite security teams at nation-state intelligence agencies might discover dozens of novel zero-days per year across narrowly targeted codebases.
A single AI model identifying thousands of previously unknown flaws across the entire major OS and browser stack — simultaneously, in a single testing cycle — is not a linear improvement on existing security research. It is a structural change in how vulnerability discovery works. The gap between AI-assisted and human-only operations is widening across every domain — security research is where that gap has now become impossible to ignore.
Responsible Disclosure Has No Framework for This
The responsible disclosure process — researcher finds vulnerability, notifies vendor, vendor patches within 90 days, public disclosure follows — was engineered for human-speed research producing a manageable number of findings. It has no operational framework for thousands of zero-days discovered simultaneously across a dozen major vendors.
Microsoft, Apple, Google, the Linux kernel team, and every major browser vendor would need to receive, triage, prioritize, and patch overlapping vulnerabilities in coordinated silence — without any disclosure creating an exploitation window long enough for threat actors to weaponize. The current CVE processing pipeline handles roughly 80 new vulnerabilities per day across all vendors combined. A single Mythos testing run could exceed that volume by a factor of ten or more.
Anthropic has not publicly outlined a disclosure strategy for vulnerabilities discovered during model evaluation. Anthropic has already demonstrated it is not immune to internal information handling failures. The stakes of a mismanaged zero-day portfolio at this scale are not hypothetical — they are a direct function of the model’s capability.
What Security Teams Need to Understand Right Now
The conventional enterprise security posture assumes that zero-day discovery requires high-skill human researchers, months of focused effort, and narrowly scoped targets. All three assumptions are now operationally wrong.
If a model at Mythos Preview’s capability level were deployed by a well-resourced adversary — an advanced persistent threat group, a state actor, or an organized criminal organization — the effective attack surface across critical infrastructure expands by orders of magnitude. Anthropic’s own testing confirms the discovery rate is real, achievable, and repeatable.
The defensive corollary is equally direct. The same capability that finds vulnerabilities at scale can be directed toward automated patch generation, continuous code auditing, and proactive threat modeling. Organizations that integrate AI-powered security tooling at this capability tier will have a structural defensive advantage that compounds over time. MegaOne AI tracks 139+ AI tools across 17 categories — the security tooling category is where the capability delta between AI-assisted and traditional approaches is now most pronounced.
Where the Rest of the Field Stands
If Mythos Preview’s benchmark scores are validated on public leaderboards at release, Anthropic will simultaneously hold the SWE-bench Verified record and the GPQA Diamond record — the two most closely watched capability benchmarks in the industry. No other lab currently holds both.
Google DeepMind’s Gemini Ultra 2 reached approximately 90% on GPQA Diamond in late 2025. OpenAI’s o4 series has not publicly broken 85% on SWE-bench Verified. Meta’s Llama 4 models remain open-weight but have not approached frontier scores on either benchmark. Consolidation pressure across the AI sector is intensifying as capability gaps between frontier and second-tier models widen — Mythos Preview widens that gap in two dimensions simultaneously.
The zero-day discovery capability is a separate category entirely. No other lab has publicly reported automated zero-day discovery at scale during model testing. That asymmetry — one lab with a model capable of finding thousands of novel vulnerabilities — will directly influence enterprise security vendor partnerships, government AI procurement, and the trajectory of AI safety policy.
Release Timeline and What Comes Next
Anthropic has not announced a public release date for Claude Mythos Preview. Based on Anthropic’s historical cadence — Claude 3 Opus launched in March 2024, Claude 3.7 Sonnet in February 2025 — major model releases have occurred roughly every six to twelve months. A commercial deployment in Q3 or Q4 2026 is plausible, though no official timeline has been confirmed. Autonomous AI systems capable of extended independent reasoning across complex domains are no longer theoretical — Mythos Preview is the clearest evidence yet that this capability tier is deployable at scale.
The more immediate question is how Anthropic handles the security findings before public release. Every day those zero-days remain undisclosed to affected vendors is a day they remain exploitable by anyone who independently discovers them. Anthropic’s safety-first positioning requires a public answer — and quickly.
The enterprises, security teams, and policymakers that start building frameworks for AI-discovered vulnerability portfolios now will be the ones positioned to act when Mythos Preview — or its equivalents from other labs — reaches general availability. Waiting for release to start planning is not a defensible posture.