A paper submitted to arXiv on March 31, 2026, by Aaditya Khanal, Yangyang Tao, and Junxiu Zhou introduces a new evaluation framework specifically designed to measure the reliability of long-horizon LLM agents. The paper, titled “Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents,” argues that the field’s standard benchmark metric — pass@1, which scores whether a model succeeds on a single attempt — is insufficient for assessing whether models can perform consistently over tasks that span many steps or significant duration.
- The paper introduces four new metrics — RDC, VAF, GDS, and MOP — to measure reliability separately from capability in long-horizon agent evaluations.
- Software engineering task reliability dropped sharply with task duration: GDS fell from 0.90 to 0.44, while document processing remained nearly flat (0.74 to 0.71).
- Frontier models recorded the highest meltdown rates — up to 19% — due to their tendency to pursue ambitious multi-step strategies that sometimes spiral.
- Memory scaffolds hurt long-horizon performance across all 10 models tested, a universal negative result.
What Happened
Khanal, Tao, and Zhou submitted the paper to arXiv on March 31, 2026, presenting both a conceptual argument and a large-scale empirical evaluation. Their central claim is that capability — whether a model can succeed on a task — and reliability — whether it succeeds consistently across repeated attempts as task duration varies — “diverge systematically as task duration grows.” The standard metric of pass@1, they argue, is “structurally blind to this divergence” when applied to short tasks.
To test this, the authors evaluated 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. The scale of the evaluation — nearly 23,400 individual agent episodes — allows the framework’s metrics to surface patterns that would not be detectable in smaller-scale studies.
Why It Matters
The gap between benchmark performance and deployment behavior has long been a concern among AI practitioners. This paper provides a formal framework for measuring one specific aspect of that gap: how consistently a model completes tasks that require sustained, multi-step execution rather than a single response. Prior evaluations have focused predominantly on whether a model can perform a task at all, not whether it does so reliably across many attempts and varying task lengths.
The paper’s finding that capability and reliability rankings “diverge substantially, with multi-rank inversions at long horizons” has direct implications for how organizations select models for production use. A model that ranks first on a capability benchmark may rank significantly lower on a reliability benchmark — and the paper demonstrates this is not a marginal effect but a structural one.
Technical Details
The framework introduces four distinct metrics. The Reliability Decay Curve (RDC) tracks how a model’s success rate changes as task duration increases. The Variance Amplification Factor (VAF) measures how much outcome volatility grows with task length. The Graceful Degradation Score (GDS) captures whether performance declines smoothly rather than collapsing abruptly. The Meltdown Onset Point (MOP) identifies the task-duration threshold at which failure rates spike sharply.
Across domains, GDS behavior diverged substantially. In software engineering, GDS fell from 0.90 on shorter tasks to 0.44 on longer tasks — a decline of 0.46 points. Document processing remained nearly stable, with GDS dropping only from 0.74 to 0.71 across the same duration range. The authors describe this as “domain-stratified” reliability decay, indicating that certain task categories are structurally more vulnerable to performance collapse at longer horizons.
Frontier models — those with the highest capability scores — recorded the highest meltdown rates, reaching up to 19%. The authors attribute this to those models pursuing “ambitious multi-step strategies that sometimes spiral.” On the VAF metric, the paper found that high variance is a capability signature rather than an instability signal: stronger models show wider variance precisely because they attempt more complex solution paths, not because they are less controlled in execution.
A separate finding concerns memory scaffolds — external memory structures commonly used to extend agent context and recall. The study found that memory scaffolds universally degraded long-horizon performance across all 10 models evaluated. The paper does not test alternative scaffold architectures, so the result is bounded to the configurations included in the study.
Who’s Affected
Developers building production agentic systems — particularly in software engineering automation and document processing — are most directly affected by the domain-specific findings. Organizations currently using pass@1 or similar short-task benchmarks to select production models may be making decisions that do not reflect long-horizon reliability. Evaluation infrastructure teams at AI labs will need to consider whether standard benchmark suites should incorporate metrics like GDS or MOP alongside existing capability scores.
What’s Next
The authors argue for reliability to be treated as “a first-class evaluation dimension alongside capability.” The paper does not specify whether the 396-task benchmark, evaluation code, or model outputs will be made publicly available, which would be necessary for independent replication of the findings. The universal negative result on memory scaffolds invites follow-up work: it remains unclear whether different scaffold designs or retrieval strategies would show the same degradation effect, or whether the pattern holds across task domains beyond the three included in this evaluation.