BENCHMARKS

SOOHAK Benchmark: AI Models Confidently Solve Math Problems That Have No Solution

J James Whitfield May 17, 2026 3 min read
Engine Score 7/10 — Important

tier-1 funding

Editorial illustration for: SOOHAK Benchmark: AI Models Confidently Solve Math Problems That Have No Solution
  • SOOHAK is a new math benchmark by 64 mathematicians from Carnegie Mellon, EleutherAI, Seoul National University, and others — 439 original tasks.
  • It splits into a 340-problem ‘Challenge’ set (graduate/research level) and a 99-problem ‘Refusal’ set of intentionally flawed problems.
  • Gemini 3 Pro leads the Challenge set at 30%; GPT-5 at 26%; Claude Opus 4.5 at 10%; open-weight models stay below 15%.
  • On the Refusal set, no model clears 50%; Qwen3 collapses to under 3% on flagging broken problems.

What Happened

A consortium of 64 mathematicians has released SOOHAK, a new math benchmark designed to expose two weaknesses in current frontier models: research-level math and the ability to recognise unsolvable tasks. SOOHAK was developed at Carnegie Mellon University, EleutherAI, Seoul National University, and other institutions, and consists of 439 original tasks written from scratch.

Why It Matters

With frontier models hitting IMO Gold level on existing math benchmarks, AI research has needed new ceilings. SOOHAK provides two — research-level mathematics, and the meta-skill of recognising when a problem has no solution. The second dimension is operationally important: in safety-critical deployments, a model that confidently produces a numerical answer to an unsolvable problem is more dangerous than one that admits uncertainty.

SOOHAK’s results also crystallise the gap between proprietary and open-weight models on research-level tasks. The authors attribute this to training-data coverage — open-weight models tend to lack training material in niche research areas where proprietary labs invest in curated datasets.

Technical Details

SOOHAK splits into two sections. The Challenge set has 340 problems at graduate and research level. The Refusal set has 99 intentionally flawed problems that contain contradictions or do not allow a clear answer; a model only earns credit if it correctly identifies the flaw rather than producing a confident wrong answer. Every problem was written from scratch by a team of 38 professors, 25 PhD students and postdocs, and five IMO medalists. Contributors had to confirm they worked without AI assistance; anyone caught using LLM-generated content was excluded.

On the Challenge set, Google’s Gemini 3 Pro scored highest at 30%. GPT-5 family (5.1, 5.2) followed at 26%. Anthropic’s Claude Opus 4.5 dropped to 10%. Open-weight models including Kimi-2.5, Qwen3-235B, and GPT-OSS-120B stayed below 15%. Not a single tested model could solve 124 of the 340 Challenge tasks.

On the Refusal set, performance collapses across the board. No model clears 50%. The open-weight GLM-5 performs best at just under 50%, beating both GPT-5 and Gemini 3 Pro. The Qwen3 family collapses to under 3%, almost always failing to correctly flag a broken problem. The benchmark also has a SOOHAK-Mini companion set ranging from school olympiad to early college level, where scores cluster much higher and the proprietary–open gap narrows.

Who’s Affected

AI lab safety teams gain a benchmark that specifically tests epistemic humility — the ability to refuse rather than fabricate. This matters operationally for any deployment where wrong-with-confidence answers cause measurable harm: education, scientific research assistance, automated theorem proving, formal verification. Google DeepMind, OpenAI, Anthropic, and the open-weight ecosystem (Meta, Mistral, Alibaba’s Qwen, Zhipu’s GLM, Moonshot’s Kimi) all face the same Refusal-set challenge in roughly comparable degrees.

What’s Next

The SOOHAK authors describe detecting flawed problems as “a new optimization target that current models have not been trained for.” Expect lab safety and capability teams to begin training and fine-tuning runs aimed at the Refusal-set skill specifically. Subsequent versions of SOOHAK are anticipated, particularly as IMO Gold-level scores propagate beyond proprietary frontier models into open-weight space.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime