Three open-weight models — Moonshot AI’s Kimi K2.6, DeepSeek V4, and Meta’s Llama 4 — now define the practical frontier for teams that need frontier-class performance without frontier-class API bills. As of April 22, 2026, they sit within 2.7 percentage points of each other on MMLU, a gap Stanford’s Human-Centered AI Institute describes as the narrowest US-China model parity on record. Which one you deploy depends on whether you’re optimizing for agent capacity, raw benchmark performance, or infrastructure cost.
The Three Models
Kimi K2.6 is Moonshot AI’s fourth-generation open-weight release and its most architecturally ambitious. The headline feature is a 300 sub-agent orchestration layer built directly into the model’s inference stack — trained as a native capability, not bolted on via prompting. Total parameter count sits at approximately 1.04 trillion in a mixture-of-experts configuration, with roughly 72B active per forward pass. Context window: 200,000 tokens, with a reported 97.3% retrieval accuracy at 128K per Moonshot’s internal evaluations.
DeepSeek V4 is the most geopolitically significant model on this list. Trained on Huawei Ascend 910B clusters according to multiple industry sources — an architecture choice forced by US export restrictions on Nvidia H100/H200 hardware — it nonetheless matches or exceeds H100-trained competitors on six of eight standard benchmarks. DeepSeek’s parent company High-Flyer closed a $300 million funding round at a $10 billion valuation in February 2026, and its engineering discipline is extraordinary: V4’s reported training budget was under $6 million, compared to an estimated $30–80 million for comparable US-frontier runs.
Meta’s Llama 4 arrived in two open variants: Llama 4 Scout (dense, 109B parameters) and Llama 4 Maverick (MoE, 400B total / 52B active). Scout targets edge and on-device deployment; Maverick competes directly with V4 and K2.6 for enterprise workloads. Both ship under the Llama 4 Community License, which permits commercial use for organizations below 700 million monthly active users.
Kimi K2.6 vs DeepSeek V4 vs Llama 4: Benchmark Table
Figures below aggregate publicly reported scores as of April 22, 2026. OpenRouter pricing reflects current spot rates; training cost figures are third-party approximations where official numbers are unavailable.
| Metric | Kimi K2.6 | DeepSeek V4 | Llama 4 Maverick |
|---|---|---|---|
| Parameter count | 1.04T total / 72B active | 671B total / 37B active | 400B total / 52B active |
| Architecture | MoE | MoE | MoE |
| Context window | 200K tokens | 128K tokens | 512K tokens |
| Training compute | ~4.2×10²⁴ FLOPs (est.) | ~2.8×10²⁴ FLOPs (est.) | ~3.5×10²⁴ FLOPs (est.) |
| Training cost | ~$22M (est.) | ~$5.8M (reported) | ~$45M (est.) |
| License | Kimi Open License v1 | MIT | Llama 4 Community |
| SWE-bench Verified | 68.4% | 72.1% | 64.9% |
| MMLU | 88.7% | 89.4% | 87.2% |
| Cost/M tokens (OpenRouter) | $0.45 in / $1.60 out | $0.14 in / $0.28 out | $0.20 in / $0.60 out |
| Self-host feasibility | 8×H100 minimum | 8×H100 minimum | 4×H100 (Maverick) |
DeepSeek V4 leads on both SWE-bench and MMLU. Its 72.1% SWE-bench Verified score puts it within 4 points of GPT-4o’s April 2026 standing, at roughly one-fifth the API cost. Kimi K2.6 closes the gap on context-heavy retrieval tasks where its 200K window and sub-agent coordination deliver measurable gains. Llama 4 Maverick trails on code but leads on context: its 512K window is the widest of any open-weight model currently in production deployment.
Self-Hosting Reality
All three require a minimum of 4–8 H100 80GB GPUs for viable inference. At current cloud spot rates, that means $12,000–$18,000 per month before engineering overhead — a number most benchmark roundups omit.
Llama 4 Maverick is the most accessible: at 52B active parameters, it fits a 4×H100 node with INT4 quantization via llama.cpp or vLLM. DeepSeek V4’s 37B active count is smaller, but its MoE routing overhead means naive setups approach full 671B weight loading in VRAM — you need MoE-aware frameworks like SGLang or TensorRT-LLM. Kimi K2.6’s 300 sub-agent layer introduces the highest operational complexity: the orchestration server is open-sourced by Moonshot but not yet stable for third-party production use.
Infrastructure providers are moving fast. Nebius’s $10 billion Finland data center, announced in early 2026, will offer MoE-optimized H100 clusters with SGLang pre-installed — a direct pitch to V4 and K2.6 operators. For teams that want open-weight performance without the infrastructure burden, OpenRouter remains the pragmatic bridge.
License Terms
DeepSeek V4’s MIT license is the cleanest of the three. Fine-tune it, ship derivatives, embed it commercially, redistribute weights — no user cap, no field-of-use restriction, no attribution clause beyond standard MIT boilerplate.
Llama 4’s Community License adds a 700M MAU ceiling for free commercial use and explicitly prohibits using Llama outputs to train competing foundation models. Apple, Microsoft, and Amazon have reportedly negotiated enterprise agreements above that threshold, but the clause creates real legal exposure for AI labs building on Llama outputs. Kimi Open License v1 is the most restrictive: it bars applications competing with Moonshot’s own products and requires a commercial license above 100 million monthly requests.
For most enterprise deployments, all three are operationally viable. Larger organizations and AI labs should have counsel review the Kimi and Llama terms before committing infrastructure. As debates about AI openness and control intensify, license terms are increasingly becoming competitive moats — not legal boilerplate.
Best For
- Agentic coding and multi-step workflows: Kimi K2.6. The 300 sub-agent architecture handles repository-level tasks that single-model inference struggles with. Teams building autonomous exploration and discovery systems will find K2.6’s native coordination the most production-ready of the three.
- Cost-sensitive high-volume API use: DeepSeek V4. At $0.14 per million input tokens, it is 69% cheaper than Llama 4 Maverick and 68% cheaper than Kimi K2.6. Document processing, RAG pipelines, and large-scale classification favor V4 unambiguously.
- Long-context document analysis: Llama 4 Maverick. The 512K context window is operational, not marketing: Maverick scores 94.1% on the HELMET long-context benchmark at 256K tokens. Legal document review and book-length summarization favor Maverick.
- Budget self-hosting: Llama 4 Scout (dense, 109B). Feasible on 4×A100 40GB with INT8 quantization — hardware most mid-sized organizations already own.
Verdict
DeepSeek V4 wins on benchmark-per-dollar. Its MIT license, 72.1% SWE-bench, and $0.28 output token cost make it the default recommendation for any team without a specific architectural requirement driving them elsewhere. Training a frontier-class model for $5.8 million on Huawei Ascend hardware — under active export control restrictions — should recalibrate assumptions about the relationship between compute access and model quality.
Kimi K2.6 earns a specialist win for agentic deployments. If your workflow requires coordinating parallel sub-tasks at scale — the kind of multi-agent orchestration that has redefined software development since agent architectures went mainstream in 2025 — K2.6’s native orchestration is meaningfully ahead of prompt-engineered equivalents on V4 or Maverick.
Llama 4 Maverick is the enterprise default where legal predictability matters more than benchmark rank. Meta’s support ecosystem is the largest of the three, the Llama license is understood by procurement teams, and Maverick’s 512K context lead is real. MegaOne AI tracks 139+ AI tools across 17 categories; Maverick already leads enterprise adoption metrics among Fortune 500 deployments. Brand recognition still moves procurement in 2026.
FAQ
Which model performs best on coding tasks?
DeepSeek V4 leads with a 72.1% SWE-bench Verified score, followed by Kimi K2.6 at 68.4% and Llama 4 Maverick at 64.9%. For agentic coding — multi-file edits, repository-scale refactors — Kimi K2.6’s sub-agent architecture narrows the practical gap considerably.
What does the Stanford 2.7% US-China MMLU gap mean?
Stanford’s HAI Institute found the average MMLU gap between top US-origin and China-origin frontier models compressed to 2.7 percentage points as of Q1 2026, down from 11.4 points in Q1 2024. DeepSeek V4 at 89.4% and Kimi K2.6 at 88.7% now sit above several US-origin models. On general reasoning benchmarks, the gap has effectively closed.
Can I self-host DeepSeek V4 without H100s?
With INT4 quantization via llama.cpp, DeepSeek V4 is reportedly runnable on 8×A100 40GB GPUs — hardware 40–60% cheaper to rent than H100s. Expect roughly 30% throughput degradation versus FP16 on H100s. The MIT license permits all quantization and redistribution of modified weights without restriction.
Is Llama 4 truly open source?
No. The Llama 4 Community License is not OSI-approved open source. It prohibits using Llama outputs to train competing foundation models and caps free commercial use at 700M MAU. It is permissive for most deployments, but the “open source” label is a marketing claim that does not survive legal scrutiny.
Which model is cheapest for high-volume production?
DeepSeek V4 via OpenRouter: $0.14 per million input tokens, $0.28 per million output tokens. At one million output tokens per day — a reasonable RAG pipeline volume — V4 costs approximately $84/month versus Llama 4 Maverick’s $180/month and Kimi K2.6’s $480/month. Self-hosting V4 on owned hardware breaks even against API costs at roughly 8 million output tokens per day.