- The UK’s AI Security Institute (AISI) tested frontier models across seven benchmarks and found that fixed compute-budget caps systematically understate what AI agents can do.
- Success rates rose by up to about 25 percent when models were given more test-time compute, with the largest gains on cybersecurity and software-engineering tasks.
- The token budget a model needed scaled with how long a human expert would take on the same task.
- Newer models benefited disproportionately from larger compute budgets, which complicates single-number model comparisons.
What Happened
The UK’s AI Security Institute (AISI) reported that standard AI evaluations systematically underestimate agent capabilities because they cap the amount of compute a model may spend on each task. The study, covering seven benchmarks and first reported by The Decoder on July 3, 2026, found that raising the test-time compute limit lifted software-engineering success rates by roughly 25 percent.
AISI describes an agent’s performance as a curve that rises with test-time compute — the processing an agent is allowed to spend while working on a task. Cut the budget while that curve is still climbing, and the measured score understates what the model can actually do. The institute’s argument is that most published benchmark numbers are snapshots taken at an arbitrary point on that curve.
Why It Matters
Benchmark scores are the AI industry’s main yardstick for comparing models and deciding what is safe to deploy, so a systematic measurement bias affects procurement decisions, safety sign-off, and public capability claims alike. If a model looks weak only because it was capped, an evaluator can wrongly conclude it is safe, or a buyer can wrongly conclude it is not useful.
The finding extends a trend visible since the reasoning-model wave of 2024 and 2025, when systems that spent more compute at inference time improved markedly on hard tasks. AISI’s contribution is to quantify how much a fixed cap distorts the picture across several benchmarks at once, rather than for a single model or task, and to do so from the vantage point of a government safety body rather than a lab marketing its own system.
Technical Details
AISI tested frontier models across seven benchmarks while varying the compute budget available to each model. Success rates increased by up to about 25 percent under larger budgets, with particularly notable gains in cybersecurity and software-development tasks — two categories that carry direct safety and security weight. The study also found that the number of tokens a model required scaled with how long a human expert would need to complete the same task, which gives evaluators a rough, principled way to size a fair budget instead of picking a round number. A further complication: newer models gained disproportionately from additional compute. That means two models compared under the same cap may be separated as much by where the cap falls on each one’s curve as by their underlying ability, undermining head-to-head tables that assume a fixed budget is neutral.
Who’s Affected
The result matters most for the organizations that run and consume evaluations: AI safety institutes, frontier labs, and the policymakers and enterprise buyers who read benchmark tables as capability ceilings. Cybersecurity and software-engineering teams are directly implicated, since those were the task categories where measured performance moved most with extra compute. Governments that base deployment or export decisions on benchmark thresholds face the sharpest version of the problem, because a threshold set against capped scores may be far below real-world capability.
What’s Next
AISI’s framing points toward reporting capability as a function of compute — a curve — rather than as a single headline number. The practical limitation is cost: real deployments still cap compute per task for budget reasons, so the higher success rates describe potential under generous budgets rather than default behavior. Whether major evaluation suites adopt compute-scaled reporting will determine how much of this gap closes, and whether future safety cases are argued against a model’s ceiling rather than its capped score.