BENCHMARKS

BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection Rankings

N Nikhil B Mar 29, 2026 Updated Apr 7, 2026 4 min read
Engine Score 7/10 — Important

This story presents a novel benchmark offering actionable insights into Claude's reliability, which is highly impactful for users and developers. However, the source reliability via Reddit and lack of external verification slightly temper its overall score.

Editorial illustration for: BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection
  • BullshitBench v2, a benchmark testing whether AI models reject nonsensical questions, shows Anthropic’s Claude models holding all seven top positions on the leaderboard.
  • Claude Sonnet 4.6 scored 91% on clear pushback against broken premises, with a “red rate” (confidently swallowing nonsense) of just 3%.
  • Google’s Gemini models scored between 10% and 20%, while the best non-Anthropic model was Alibaba’s Qwen 3.5 at 78%.
  • The benchmark tests 100 professionally framed questions with broken premises across five domains: coding, medical, legal, finance, and physics.

What Happened

Peter Gostev, AI Capability Lead at Arena.ai, released results from BullshitBench v2, a benchmark designed to measure whether AI models can detect and reject plausible-sounding nonsense. The results reveal a stark divide: Anthropic’s Claude models dominate the leaderboard, while several competing models confidently answer fundamentally unanswerable questions.

Claude Sonnet 4.6 on high reasoning mode scored 91%, meaning it correctly rejected nonsensical premises 91 times out of 100. Claude Opus 4.5 followed closely at 90%. All seven top positions on the leaderboard belong to Anthropic models.

The only non-Anthropic model to score above 60% was Alibaba’s Qwen 3.5 397B A17B, which reached 78% and landed at position eight. Google’s Gemini 2.5 Pro scored 20%, Gemini 2.5 Flash scored 19%, and Gemini 3 Flash Preview pushed back on just 10% of the nonsensical prompts.

Why It Matters

Most AI benchmarks test whether models can produce correct answers. BullshitBench tests something different: whether models know when not to answer. This distinction matters because confidently answering a nonsensical question can be more dangerous than admitting ignorance, particularly in high-stakes domains like medicine, law, and finance where users may act on AI-generated advice.

The gap between Claude and Gemini is particularly striking. Google’s models are widely used across consumer products — Search, Gmail, Android — where confident nonsense could reach billions of users. A 10% to 20% pushback rate means these models accepted invalid premises roughly four out of five times.

The benchmark’s creator, Peter Gostev, has documented cases where models provide detailed technical justifications for completely invalid premises. In one example, a prompt asked how different screw types affect pantry food flavor. Claude rejected the nonsensical connection, while other models provided multi-paragraph technical explanations linking screw material composition to food taste — authoritative-sounding answers to a question that should have been refused outright.

Technical Details

BullshitBench v2 comprises 100 questions across five domains: coding (40 questions), medical (15), legal (15), finance (15), and physics (15). Each prompt uses legitimate terminology and professional framing but contains broken premises or causally disconnected elements designed to look plausible.

A three-judge panel scores each response into one of three categories. “Green” means clear rejection of the nonsense. “Amber” indicates hedging while still engaging with the flawed premise. “Red” means the model accepted and elaborated on an invalid premise. Claude Sonnet 4.6’s red rate was just 3.0%, meaning it almost never confidently swallowed a lie.

The benchmark specifically targets a failure mode that standard accuracy tests miss. A model can score highly on factual benchmarks while still being prone to generating authoritative-sounding responses to questions that have no valid answer. The coding category, with 40 of the 100 questions, is the most heavily weighted domain, reflecting the widespread use of AI coding assistants in production environments where confidently wrong code can introduce bugs or security vulnerabilities.

Who’s Affected

Enterprise teams evaluating AI models for deployment in regulated industries — healthcare, legal, financial services — should pay attention to these results. A model that confidently answers nonsensical medical or legal questions poses a real liability risk, regardless of how well it performs on standard benchmarks.

The results also matter for AI safety researchers studying sycophancy and hallucination. BullshitBench measures a specific dimension of reliability — the willingness to say “this question doesn’t make sense” — that is difficult to capture with traditional evaluation methods. Model providers may also need to consider this metric when marketing their products for professional use cases.

What’s Next

The BullshitBench project is open-source and available on GitHub, allowing researchers to test additional models or expand the question set. As AI models are increasingly deployed in professional settings where confident nonsense can cause real harm, benchmarks measuring epistemic honesty rather than just accuracy are likely to gain wider adoption.

One limitation: the benchmark tests only English-language prompts, and nonsense detection may vary significantly across languages and cultural contexts. The 100-question test set, while carefully constructed, is also relatively small compared to benchmarks like MMLU (14,000+ questions) or HumanEval (164 problems), which means individual question design choices could disproportionately affect rankings.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime