BENCHMARKS GENSTRAT Benchmark Generates 2,000 Card Games to Test LLM Strategy 7/10 2 min read 1 month ago
BENCHMARKS SOOHAK Benchmark: AI Models Confidently Solve Math Problems That Have No Solution 7/10 3 min read 1 month ago
RESEARCH GPT-5.2 and Claude Opus 4.6 Both Go Silent on ‘Ontologically Null’ Prompts, Study Finds 8/10 2 min read 3 months ago