- Lyptus Research found AI offensive cybersecurity capability has doubled every 5.7 months since 2024, accelerating from a 9.8-month doubling rate recorded since 2019.
- Claude Opus 4.6 and GPT-5.3 Codex each achieved a 50% task success rate on assignments rated at roughly three hours of professional security-expert effort, using a two-million-token compute budget.
- GPT-5.3 Codex extended its effective time horizon from 3.1 to 10.5 hours when its token budget was raised fivefold, from two million to ten million tokens.
- Open-source models currently trail closed-source frontier models by approximately 5.7 months on the same benchmark tasks.
What Happened
AI safety research firm Lyptus Research published a benchmark study in April 2026 showing that frontier AI models’ offensive cybersecurity capabilities have been doubling every 5.7 months since 2024, as reported by The Decoder. The study applied the METR time-horizon method — a framework that quantifies AI task performance in terms of equivalent hours of skilled human effort — and was calibrated against ten professional security experts across 291 tasks. The full dataset and report are publicly available on GitHub and Hugging Face.
Why It Matters
The 5.7-month doubling interval represents a meaningful acceleration from the longer-run trend: Lyptus Research found that AI offensive capability had been doubling every 9.8 months since 2019 before the pace quickened in 2024. The researchers stated that current benchmarking methodology likely underestimates the true rate of progress, because performance scales with token budget in ways the standard two-million-token test condition does not fully capture.
Technical Details
At a two-million-token compute budget, both Claude Opus 4.6 and GPT-5.3 Codex achieved a 50% success rate on tasks rated at approximately three hours of expert effort under the METR time-horizon scale. When GPT-5.3 Codex’s token budget was increased fivefold to ten million tokens, its effective time horizon expanded from 3.1 hours to 10.5 hours — a 3.4x increase in demonstrated capability from a 5x increase in compute allocation. Open-source models performed at a level consistent with where closed-source models stood approximately 5.7 months prior, indicating a consistent but trailing capability trajectory.
Who’s Affected
Enterprise security teams and critical infrastructure operators face the most direct exposure, as the 291 benchmark tasks reflect offensive techniques applicable to real-world penetration testing and vulnerability exploitation. Anthropic and OpenAI are the primary model developers whose systems were measured, with Claude Opus 4.6 and GPT-5.3 Codex identified as the leading performers on the benchmark. Organizations relying on open-source models for security-adjacent workflows should account for the roughly six-month capability lag relative to the closed-source frontier.
What’s Next
Lyptus Research made the full dataset publicly available on GitHub and Hugging Face for independent replication and analysis. The team’s caveat that existing token-budget ceilings understate capability growth points toward follow-on benchmarking at higher compute allocations and longer task-complexity horizons. No specific timeline for a follow-up study was disclosed in the published materials.