- Xiaomi released MiMo-V2.5-Pro, a 1.02-trillion-parameter mixture-of-experts model with 42 billion active parameters, targeting hours-long autonomous coding workloads.
- The model finished a Peking University CS course compiler project in 4.3 hours across 672 tool calls, scoring 233/233 on the hidden test suite.
- Reported benchmarks: 78.9 on SWE-bench Verified, 57.2 on SWE-Bench Pro, 68.4 on Terminal-Bench 2.0; 73.7 on Xiaomi’s MiMo Coding Bench (vs Claude Opus 4.6 at 77.1).
- Xiaomi claims 40-60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 to reach comparable agent benchmarks; main version supports 1M-token context.
What Happened
Xiaomi released MiMo-V2.5-Pro, an open-weight mixture-of-experts model designed for sustained autonomous coding tasks measured in hours rather than minutes. The headline demo: the model built a complete compiler from a Peking University computer-science course in 4.3 hours and 672 tool calls, scoring 233 of 233 on the hidden test suite. Xiaomi released three additional models alongside the flagship.
Why It Matters
Hours-long autonomous coding is the current frontier where agentic AI either closes the gap with senior engineers or reveals model-level failure modes. Anthropic’s Claude Opus and OpenAI’s reasoning models have been the public reference for sustained coding agents through 2025-2026. Xiaomi’s release lands with two distinct positioning angles: open weights (most direct competitors at this scale are closed) and aggressive token efficiency (40-60% fewer tokens to reach comparable scores, per Xiaomi). For deployments where inference cost dominates total cost of ownership, the efficiency gap is the more consequential claim.
Technical Details
MiMo-V2.5-Pro packs 1.02 trillion total parameters with 42 billion activated per token. The main version supports up to 1 million tokens of context; a base version without retraining caps at 256,000 tokens. Pre-training ran on 27 trillion tokens with the context window expanded in stages. Post-training uses a teacher-student setup: specialized models optimized separately for math, security, and tool use serve as teachers to a single student model that combines their skills. A mix of local and global attention reduces memory needs for long texts by nearly 7x; parallel token prediction triples output speed.
Three demos illustrate the hours-long capability. First: the compiler project completed in 4.3 hours / 672 tool calls (137 of 233 tests passed on first compile, regressions self-diagnosed and fixed during refactoring). Second: a desktop video editor with roughly 8,000 lines of code from a few prompts, 11.5 hours of autonomous runtime and 1,870 tool calls. Third: a voltage regulator design through a circuit simulator hooked up via Claude Code, hitting all six technical specs in under an hour with four specs beating the first draft by an order of magnitude.
Coding benchmarks: 78.9 on SWE-bench Verified, 57.2 on SWE-Bench Pro, 68.4 on Terminal-Bench 2.0, 73.7 on Xiaomi’s in-house MiMo Coding Bench (vs 77.1 for Claude Opus 4.6 and 67.8 for Gemini 3.1 Pro). Agent tasks: 1,581 Elo points on GDPVal-AA, 72.9 on tau3-bench. On OpenAI’s GraphWalks long-context benchmark at 1M tokens, MiMo-V2.5-Pro scores 0.37 on breadth-first searches and 0.62 on parent-node queries, where the previous MiMo-V2-Pro dropped to zero.
Xiaomi released three companion models: MiMo-V2.5 (310B total / 15B active multimodal supporting text, image, video, audio with 1M-token context, 87.7 on Video-MME — open weights on Hugging Face); MiMo-V2.5-TTS (three-variant text-to-speech family); and additional unspecified variants in the launch.
Who’s Affected
Anthropic, OpenAI, and Google face direct pressure on the hours-long agentic-coding category from a Chinese open-weight model with a self-reported efficiency advantage. Open-source AI deployment teams gain the largest open-weight coding-focused release of 2026 to date (1.02T parameters exceeds DeepSeek-V4-Pro’s 1.6T total but with fewer active parameters per token). Inference providers — Together, Fireworks, Hugging Face — gain a new flagship to host. Xiaomi itself extends from consumer hardware into frontier-tier AI publicly.
What’s Next
Independent benchmark validation of the 4.3-hour compiler claim, the 40-60% token-efficiency advantage, and the long-context performance figures will be the cleanest external test. Quantized versions and llama.cpp / Ollama support are likely within days from the open-source community. Watch for U.S. policy reactions given the increasingly visible Chinese open-weight cohort (DeepSeek V4, Kimi K2.6, Qwen, GLM, and now MiMo-V2.5-Pro), and for whether Xiaomi extends commercial offerings beyond the open-weight release.