- MiniMax released M2.7 on March 18, 2026, claiming the model handled 30-50% of its own reinforcement learning optimization during training
- M2.7 scored 56.22% on SWE-Pro, matching GPT-5.3-Codex, and achieved the highest ELO score (1495) among comparable models on GDPval-AA
- The model ran over 100 iterative self-improvement loops, autonomously analyzing failures and modifying its own training scaffold
- M2.7 remains closed-source with undisclosed architecture despite MiniMax’s history of open-weight releases
What Happened
Chinese AI company MiniMax released M2.7, its latest large language model, on March 18, 2026, with a claim that sets it apart from every other frontier model: M2.7 participated in 30 to 50 percent of its own reinforcement learning optimization during training. MiniMax describes this as “self-evolving” capability, where the model was used to build, monitor, and adjust the reinforcement learning harnesses that shaped its own behavior. The company’s internal research agent harness autonomously collected feedback, built evaluation sets, and iterated on its own architecture and memory mechanisms. “M2.7 is our first model deeply participating in its own evolution,” the company stated in its official announcement.
Why It Matters
The standard practice in AI training relies entirely on human-designed pipelines. Engineers specify reward functions, curate training data, and manually tune hyperparameters. MiniMax’s approach inverts part of this process by letting M2.7 execute an iterative loop of analyzing failure trajectories, planning changes, modifying scaffold code, running evaluations, and deciding whether to keep or revert changes. The model ran this cycle for over 100 rounds, achieving a 30 percent performance improvement on internal evaluation sets.
This raises fundamental questions about reproducibility and oversight. If a model participates in designing its own reward signals and training loops, the standard assumption that humans fully specify the optimization objective no longer holds. MiniMax has not published technical details on how it bounded the model’s influence over its own training or what safeguards prevented reward hacking.
Technical Details
On the SWE-Pro benchmark, which tests software engineering capabilities across multiple programming languages, M2.7 scored 56.22 percent, matching GPT-5.3-Codex. On GDPval-AA, a professional task evaluation, it achieved an ELO score of 1495, the highest among comparable models. Additional benchmark results include 55.6 percent on VIBE-Pro for repository-level code generation, 57.0 percent on Terminal Bench 2, and 76.5 percent on SWE Multilingual.
The model maintained a 97 percent skill adherence rate across more than 40 complex skills, each exceeding 2,000 tokens. MiniMax also reported that M2.7 reduced production debugging incident recovery time to under three minutes. On hallucination metrics, M2.7 scored +1 on the AA-Omniscience Index, a significant improvement over M2.5’s score of -40 on the same measure.
Who’s Affected
M2.7 is a proprietary model available through API access only, with closed weights and undisclosed architecture. This marks a departure from MiniMax’s history of releasing some models under open licenses. The company competes in a crowded field of Chinese AI labs alongside DeepSeek, Qwen from Alibaba, and Zhipu AI, all of which have released frontier models in 2026 with varying degrees of openness.
For enterprise users evaluating the model, the self-evolving training approach presents a novel due diligence challenge. Organizations accustomed to auditing training pipelines and reward specifications will find that M2.7’s training process is partially opaque by design, not just by proprietary choice. Regulated industries such as finance and healthcare, where model explainability is a compliance requirement, may face particular difficulty justifying the adoption of a model whose training loop is only partially human-directed.
Developers building agent-based systems may find M2.7’s capabilities compelling regardless. The model supports native multi-agent collaboration, complex dynamic tool search across more than 40 skills, and memory updating with self-feedback mechanisms. MiniMax reports that M2.7 can handle Word, Excel, and PowerPoint document processing within agent workflows, targeting enterprise productivity use cases where competing models have historically required separate integrations.
What’s Next
The competitive significance of self-evolving training depends on whether it produces compounding improvements across model generations. MiniMax has positioned M2.7 as a stepping stone, but the approach will require several more release cycles to validate. If subsequent models demonstrate accelerating gains from expanded self-optimization, the technique could become standard practice among frontier labs. If performance plateaus, it may remain a differentiating claim rather than a meaningful capability advantage.
The absence of published safeguard details for preventing reward hacking during self-optimization remains an open concern for AI safety researchers evaluating the technique. MiniMax has not indicated whether it plans to release a technical report detailing the boundaries placed on the model’s self-modification capabilities.