BENCHMARKS

Kimi K2.6 and Xiaomi MiMo Beat Claude, GPT-5.5, Gemini in Word Gem Puzzle Coding Tournament

J James Whitfield May 3, 2026 3 min read
Engine Score 8/10 — Important

Kimi K2.6 beats Claude, GPT-5.5, Gemini on coding challenge

Editorial illustration for: Kimi K2.6 and Xiaomi MiMo Beat Claude, GPT-5.5, Gemini in Word Gem Puzzle Coding Tournament
  • In an independent AI Coding Contest’s day-12 Word Gem Puzzle, Kimi K2.6 (Moonshot AI) finished first with 22 match points, Xiaomi MiMo V2-Pro second with 20, GPT-5.5 third with 16.
  • Claude Opus 4.7 placed fifth, Gemini Pro 3.1 sixth, Grok Expert 4.2 seventh, DeepSeek V4 eighth — every Western-lab entry below the top two.
  • Kimi K2.6 won by aggressively sliding to unlock new positive-value words, accumulating the highest cumulative score (77) in the tournament.
  • The challenge, run by Rohana Rezel at thinkpol.ca, is one of a small set of independent tournaments using objective real-time scoring rather than vendor-published self-benchmarks.

What Happened

An independent AI Coding Contest published Day 12 results — the Word Gem Puzzle — on April 30, 2026, posted by Rohana Rezel. Ten frontier and open-weight models entered. Kimi K2.6 (Moonshot AI) won outright with 22 match points and a 7-1-0 record; Xiaomi MiMo V2-Pro placed second with 20 points; OpenAI’s GPT-5.5 placed third with 16 points. Anthropic’s Claude Opus 4.7 finished fifth, Google’s Gemini Pro 3.1 sixth, xAI’s Grok Expert 4.2 seventh, DeepSeek V4 eighth.

Why It Matters

Vendor-published benchmarks are the dominant input to model-quality discussions in 2026 — but every major lab has incentives to publish only flattering numbers. Independent tournaments where multiple models compete in real-time on objectively-scored tasks they have not seen are a structurally different signal. The Word Gem Puzzle is purpose-built for this: scoring is mechanical, the task is novel enough that no model has pre-existing training data, and identical conditions apply to every entrant. The result — two Chinese open-weight models beating every Western frontier lab on an objective puzzle — is uncomfortable for vendor-led narratives, but it is also narrower than a clean “China leads AI” claim. As Rezel notes, this isn’t China-beats-West; it’s two specific models that won.

Technical Details

The Word Gem Puzzle is a sliding-tile letter puzzle on grids of varying sizes (10×10 through 30×30) where bots claim valid English words formed in straight horizontal or vertical lines. Scoring penalizes short words (3-letter words cost 3 points; 5-letter words cost 1 point) and rewards long ones (length minus six for words 7+). Each pair of models played five rounds (one per grid size) with a 10-second wall-clock limit per round.

Final standings: 1. Kimi K2.6 (22 pts, 7-1-0); 2. MiMo V2-Pro (20 pts, 6-2-0); 3. GPT-5.5 (16 pts, 5-1-2); 4. GLM 5.1 (15 pts, 5-0-3); 5. Claude Opus 4.7 (12 pts, 4-0-4); 6. Gemini Pro 3.1 (9 pts, 3-0-5); 7. Grok Expert 4.2 (9 pts, 3-0-5); 8. DeepSeek V4 (3 pts, 1-0-7); 9. Muse Spark (0 pts). Nvidia’s Nemotron Super 3 produced code with a syntax error and never connected to the game server.

Strategic patterns: Kimi won by sliding aggressively — a greedy approach that scored each possible move by what new positive-value words it unlocked, then executed the best one. Cumulative score of 77 was the highest in the tournament. MiMo V2-Pro’s threshold for sliding never triggered, so it scanned the initial grid for 7+ letter words and claimed all in a single TCP packet — fast but brittle. Claude Opus 4.7 also did not slide, holding up on 25×25 boards where seed words were largely intact but collapsing on 30×30 where sliding was necessary. GPT-5.5 was conservative with a slide cap to avoid thrashing. GLM was the most aggressive slider with over 800,000 total slides. DeepSeek V4 sent malformed data every round.

Who’s Affected

Moonshot AI’s Kimi K2.6 — open-weights, available from Moonshot — gains independent validation on a task none of these models had seen. Xiaomi’s MiMo program gains independent validation alongside its self-published MiMo-V2.5-Pro release. Western frontier labs (Anthropic, OpenAI, Google, xAI) face an objective tournament result placing them below the Chinese open-weight cohort on this specific task. The broader takeaway for buyers — particularly those evaluating coding agents — is that open-weight Chinese models have crossed the threshold where they can win specific objective tasks against the best closed Western models, even if they remain behind on average across all categories.

What’s Next

Subsequent tournament rounds (the AI Coding Contest is ongoing) will test whether Kimi K2.6 and MiMo V2-Pro maintain their lead across other task types or whether the Word Gem Puzzle was an unusually favorable category. Watch for whether the independent tournament format gets adopted by larger evaluation organizations — Stanford HAI, MLCommons, or commercial benchmark providers. For Western labs, the result is a competitive signal that may pressure release schedules and pricing in the second half of 2026.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime