ANALYSIS

DeepSeek Releases V4 Preview with 1M-Token Contexts via Compressed Attention

A Anika Patel Apr 25, 2026 3 min read
Engine Score 10/10 — Critical
Editorial illustration for: DeepSeek Releases V4 Preview with 1M-Token Contexts via Compressed Attention
  • DeepSeek-AI released a preview of two DeepSeek-V4 models on April 24, 2026, targeting one-million-token context windows.
  • DeepSeek-V4-Pro has 1.6 trillion total parameters with 49 billion activated per token; DeepSeek-V4-Flash has 284 billion total with 13 billion activated.
  • Both models introduce new Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) mechanisms to lower inference costs at long contexts.
  • The release is a preview; DeepSeek-AI has not announced a general availability date or API pricing.

What Happened

DeepSeek-AI published a preview release of its DeepSeek-V4 model series on April 24, 2026, introducing two Mixture-of-Experts (MoE) language models built around the problem of making one-million-token context windows practical and affordable at inference time. The announcement was first covered by Marktechpost, which described the series as addressing “one core challenge: making one-million-token context windows practical and affordable at inference time.”

The series includes two variants: DeepSeek-V4-Pro, with 1.6 trillion total parameters and 49 billion activated per token, and DeepSeek-V4-Flash, with 284 billion total parameters and 13 billion activated per token. Both are preview releases; full production deployment timelines and benchmark disclosures have not yet been published.

Why It Matters

DeepSeek-V3, released in December 2024, supported a 128,000-token context with 671 billion total parameters and 37 billion activated per token. DeepSeek-V4 represents an eight-fold increase in context length to one million tokens, while keeping per-token activation counts in a similar range to the prior generation despite a more than doubling of total model capacity in the Pro variant.

Google’s Gemini 1.5 Pro established one million tokens as a long-context benchmark target in early 2024, and subsequent models from Anthropic, OpenAI, and others have competed at or near that scale. DeepSeek’s contribution is bringing a sparse, MoE-based inference architecture to that context length — a design that reduces per-token compute costs relative to dense models of equivalent output quality.

Technical Details

The central architectural contribution in DeepSeek-V4 is the introduction of two new attention mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA reduces the memory footprint of long-context inference by applying learned sparsity patterns to the key-value (KV) cache, allowing the model to selectively attend over a one-million-token window without proportional growth in memory consumption. HCA applies more aggressive compression to the KV cache, targeting the lower-latency Flash variant.

DeepSeek-V4-Pro activates 49 billion of its 1.6 trillion parameters per token — an activation ratio of approximately 3 percent — lower than the roughly 5.5 percent ratio in DeepSeek-V3. DeepSeek-V4-Flash activates 13 billion of its 284 billion parameters per token, a ratio of approximately 4.6 percent. The MoE design keeps per-token compute bounded even as total parameter count scales.

CSA and HCA appear to extend DeepSeek-V3’s Multi-Head Latent Attention (MLA) mechanism, which compressed key-value representations into a low-rank latent space. The V4 mechanisms add a sparsity dimension, enabling selective retrieval over sequences that are roughly eight times longer than V3 supported. Specific throughput figures, latency numbers, and benchmark comparisons had not been published as of April 25, 2026.

Who’s Affected

Enterprise developers building retrieval-augmented generation pipelines, legal document analysis tools, and long-codebase processing applications stand to benefit most directly. At one million tokens, a single context window can hold approximately 750,000 words — enough for a full software repository, months of email correspondence, or multiple book-length documents processed in a single pass.

Organizations currently using GPT-4o, Claude, or Gemini 1.5 Pro for long-context workloads will likely evaluate DeepSeek-V4 as a cost comparison once API access is available. Cloud inference providers that distribute DeepSeek models — including platforms that offered DeepSeek-V3 via API — will need to update their model catalogs once V4 exits preview.

What’s Next

DeepSeek-AI has not disclosed a timeline for moving V4 from preview to general availability, nor has it published API pricing. Independent benchmark evaluations on standard long-context tasks — such as RULER, LongBench, or needle-in-a-haystack retrieval at one million tokens — have not yet appeared as of publication.

DeepSeek’s release of V3 in late 2024 prompted third-party evaluations within days once weights became accessible; a similar evaluation cycle is plausible for V4 given the lab’s track record of releasing model weights publicly. The DeepSeek-AI GitHub organization at github.com/deepseek-ai is the expected location for weight releases and technical reports.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime