BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection Rankings

An independent benchmark called BullshitBench that measures whether AI models detect and reject nonsensical prompts rather than confidently answering them shows Anthropic’s Claude models occupying all seven top positions, with Claude Sonnet 4.6 achieving a 91 percent clear pushback rate on the highest reasoning setting.

The benchmark, created by researcher Peter Gostev, tests 100 prompts across five domains: software, finance, legal, medical, and physics. Each prompt sounds legitimate through professional framing and real terminology but contains a fundamentally broken premise that makes it unanswerable. The benchmark does not test whether models make up facts but whether they notice when a question itself is flawed.

Claude Opus 4.5 scored 90 percent, placing second. The remaining five top spots are all held by other Anthropic model variants. The only non-Anthropic entry above 60 percent is Alibaba’s Qwen 3.5 at 78 percent, which landed in eighth place.

OpenAI’s GPT-5.2 achieved a 38 percent clear pushback rate, with the GPT-5 family generally clustering between 20 and 50 percent. Google’s models performed the worst among major providers. Gemini 2.5 Pro scored 20 percent, Gemini 2.5 Flash scored 19 percent, and Gemini 3 Flash Preview pushed back on just 10 percent of nonsensical prompts.

Responses are categorized into three tiers. Clear pushback means the model explicitly rejects the flawed premise. Partial challenge means the model flags problems but still engages with the invalid assumption. Accepted nonsense means the model treats the broken prompt as legitimate and produces a confident answer. A three-judge panel consisting of Claude Sonnet 4.6, GPT-5.2, and Gemini 3.1 Pro evaluates each response.

The benchmark uses 13 nonsense techniques including plausible nonexistent frameworks, misapplied mechanisms, nested nonsense, and specificity traps. The gap between Anthropic’s models and competitors suggests a meaningful difference in how these model families handle epistemic uncertainty, with Claude models far more likely to refuse engagement with broken premises rather than generating plausible-sounding answers to impossible questions.

BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection Rankings

Enjoyed this story?

Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

Liquid AI Runs 24-Billion-Parameter Model at 50 Tokens Per Second in a Web Browser

Open-Source ATLAS System on a $500 GPU Outperforms Claude Sonnet on Coding Benchmarks

Before you go…