ANALYSIS

Claude Opus 4.6 Scores Drop 15 Points on BridgeBench Hallucination Test

M Marcus Rivera Apr 13, 2026 3 min read
Engine Score 7/10 — Important
Editorial illustration for: Claude Opus 4.6 Scores Drop 15 Points on BridgeBench Hallucination Test
  • BridgeMind reported on April 13, 2026, that Claude Opus 4.6‘s accuracy on its BridgeBench hallucination benchmark fell from 83.3% to 68.3% between two test runs conducted approximately one week apart.
  • The decline moved the model from second to tenth place on BridgeBench’s public leaderboard.
  • The 15-percentage-point accuracy drop corresponds to an approximate doubling of the model’s error rate on the benchmark.
  • Anthropic has not publicly confirmed or explained the performance change; BridgeMind’s characterization of the model as “nerfed” is unverified.

What Happened

On April 13, 2026, BridgeMind — the team behind the BridgeBench hallucination evaluation platform — reported on X that Claude Opus 4.6 scored 68.3% accuracy on its most recent benchmark run, down from 83.3% in a test conducted approximately one week earlier. The result moved the model from second to tenth place on BridgeBench’s public leaderboard.

CLAUDE OPUS 4.6 IS NERFED. BridgeBench just proved it,” the BridgeMind account posted, asserting the model had experienced “reduced reasoning levels.” Anthropic had not issued a public statement addressing the score discrepancy as of the time of publication.

Why It Matters

Benchmark consistency is a prerequisite for meaningful model comparisons. When a model’s scores on a fixed evaluation shift substantially between runs — without a documented version change — it creates ambiguity about whether the evaluation is testing the same underlying system, or whether the model itself has been modified.

Third-party evaluators have documented similar variability before. In 2023 and 2024, independent researchers reported capability shifts across versions of GPT-4, which eventually led OpenAI to clarify its model versioning and update disclosure practices. The BridgeBench findings revisit that question with a current-generation model.

Technical Details

BridgeBench, operated by BridgeMind and accessible at bridgebench.ai, measures model accuracy on hallucination-specific tasks. According to the posted results, Claude Opus 4.6’s accuracy declined from 83.3% to 68.3% — a 15-percentage-point drop — across two tests separated by approximately seven days.

In error-rate terms, this represents a shift from roughly 16.7% incorrect responses to 31.7%, an approximate doubling of measured hallucination frequency on the benchmark’s task set. BridgeMind did not publish the specific task categories, prompt counts, or sampling parameters used in either test run in the original post.

The leaderboard position change — from second to tenth — implies that multiple other models held their scores or improved over the same period, though BridgeMind did not identify those models in the published announcement.

Who’s Affected

Developers and enterprise users who selected Claude Opus 4.6 based on its benchmark standing in factual accuracy face uncertainty if the model’s behavior has changed since their evaluations. Applications in legal research, medical information retrieval, and financial analysis — where hallucination rates are a primary selection criterion — are most sensitive to such variability.

Evaluation platform operators, including BridgeBench and competing leaderboards maintained by LMSYS and Hugging Face, are affected when widely-used models show inconsistent scores between test periods, as this can reduce confidence in their rankings as stable comparative references.

What’s Next

Independent replication by other evaluation groups would clarify whether the performance gap BridgeMind observed is consistent across different test administrations. Without third-party confirmation, the source of the score change — whether a model-side modification, sampling variance, or a shift in evaluation infrastructure — cannot be determined from the available information.

Anthropic does not currently maintain a formal public policy requiring disclosure of incremental capability changes within a named model release. Whether BridgeMind’s findings prompt a response from Anthropic or a formal audit from other benchmark operators has not been announced.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime