SOOHAK: AI Confidently Solves Unsolvable Math

SOOHAK is a new math benchmark by 64 mathematicians from Carnegie Mellon, EleutherAI, Seoul National University, and others — 439 original tasks.
It splits into a 340-problem ‘Challenge’ set (graduate/research level) and a 99-problem ‘Refusal’ set of intentionally flawed problems.
Gemini 3 Pro leads the Challenge set at 30%; GPT-5 at 26%; Claude Opus 4.5 at 10%; open-weight models stay below 15%.
On the Refusal set, no model clears 50%; Qwen3 collapses to under 3% on flagging broken problems.

What Happened

A consortium of 64 mathematicians has released SOOHAK, a new math benchmark designed to expose two weaknesses in current frontier models: research-level math and the ability to recognise unsolvable tasks. SOOHAK was developed at Carnegie Mellon University, EleutherAI, Seoul National University, and other institutions, and consists of 439 original tasks written from scratch.

Why It Matters

With frontier models hitting IMO Gold level on existing math benchmarks, AI research has needed new ceilings. SOOHAK provides two — research-level mathematics, and the meta-skill of recognising when a problem has no solution. The second dimension is operationally important: in safety-critical deployments, a model that confidently produces a numerical answer to an unsolvable problem is more dangerous than one that admits uncertainty.

SOOHAK’s results also crystallise the gap between proprietary and open-weight models on research-level tasks. The authors attribute this to training-data coverage — open-weight models tend to lack training material in niche research areas where proprietary labs invest in curated datasets.

Technical Details

SOOHAK splits into two sections. The Challenge set has 340 problems at graduate and research level. The Refusal set has 99 intentionally flawed problems that contain contradictions or do not allow a clear answer; a model only earns credit if it correctly identifies the flaw rather than producing a confident wrong answer. Every problem was written from scratch by a team of 38 professors, 25 PhD students and postdocs, and five IMO medalists. Contributors had to confirm they worked without AI assistance; anyone caught using LLM-generated content was excluded.

On the Challenge set, Google’s Gemini 3 Pro scored highest at 30%. GPT-5 family (5.1, 5.2) followed at 26%. Anthropic’s Claude Opus 4.5 dropped to 10%. Open-weight models including Kimi-2.5, Qwen3-235B, and GPT-OSS-120B stayed below 15%. Not a single tested model could solve 124 of the 340 Challenge tasks.

On the Refusal set, performance collapses across the board. No model clears 50%. The open-weight GLM-5 performs best at just under 50%, beating both GPT-5 and Gemini 3 Pro. The Qwen3 family collapses to under 3%, almost always failing to correctly flag a broken problem. The benchmark also has a SOOHAK-Mini companion set ranging from school olympiad to early college level, where scores cluster much higher and the proprietary–open gap narrows.

Who’s Affected

AI lab safety teams gain a benchmark that specifically tests epistemic humility — the ability to refuse rather than fabricate. This matters operationally for any deployment where wrong-with-confidence answers cause measurable harm: education, scientific research assistance, automated theorem proving, formal verification. Google DeepMind, OpenAI, Anthropic, and the open-weight ecosystem (Meta, Mistral, Alibaba’s Qwen, Zhipu’s GLM, Moonshot’s Kimi) all face the same Refusal-set challenge in roughly comparable degrees.

What’s Next

The SOOHAK authors describe detecting flawed problems as “a new optimization target that current models have not been trained for.” Expect lab safety and capability teams to begin training and fine-tuning runs aimed at the Refusal-set skill specifically. Subsequent versions of SOOHAK are anticipated, particularly as IMO Gold-level scores propagate beyond proprietary frontier models into open-weight space.

SOOHAK Benchmark: AI Models Confidently Solve Math Problems That Have No Solution

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

SOOHAK Benchmark: AI Models Confidently Solve Math Problems That Have No Solution

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

UK AISI: Benchmarks Underestimate AI Agents by Capping Compute

Claude Fable 5 Tops the Intelligence Index, at Twice the Cost for 5.7% More

SentinelBench Tests Whether AI Agents Can Wait Instead of Acting Constantly