- Mistral Small 4 has 119 billion total parameters but activates only 6 billion per token through a mixture-of-experts architecture with 128 expert sub-networks, 4 active at a time.
- The model supports a 256,000-token context window and unifies fast instruction-following, configurable reasoning depth, and multimodal input in a single deployment.
- Mistral reports a 40 percent reduction in end-to-end completion time and 3x higher throughput compared to the previous Mistral Small 3.
- Released under the Apache 2.0 license, the model is fully open-source and available for commercial use, fine-tuning, and self-hosted deployment.
What Happened
Mistral AI released Mistral Small 4, a mixture-of-experts language model with 119 billion total parameters that activates only 6 billion parameters per token. Including embedding and output layers, the active parameter count reaches approximately 8 billion. The model is available through Mistral’s API, Hugging Face, NVIDIA’s build.nvidia.com, and multiple open-source deployment frameworks.
The release continues Mistral’s strategy of building models that compete with much larger dense systems while requiring significantly less compute at inference time. The company, founded in Paris in 2023 by former researchers from Google DeepMind and Meta, has positioned itself as the leading European AI lab and a key proponent of open-weight model releases.
Why It Matters
The gap between a model’s total parameter count and its active parameters per token is the key to understanding Mistral Small 4’s efficiency. A 119-billion-parameter dense model would require enormous GPU clusters to serve. By using mixture-of-experts, where only 4 of 128 expert sub-networks activate for each token, the model achieves the knowledge capacity of a large model with the inference cost of a much smaller one.
This architecture allows organizations to run a capable model on hardware that would be insufficient for a dense model of equivalent quality. Mistral reports that the model matches or surpasses GPT-OSS 120B, a comparable open-source model, across three key benchmarks while generating 20 percent shorter outputs. Shorter outputs translate directly to faster response times and lower per-query costs for production deployments.
Technical Details
The architecture uses 128 expert sub-networks with 4 active per token, routed dynamically based on the input. The context window extends to 256,000 tokens, enough to process book-length documents, entire codebases, or extended conversation histories in a single prompt. The model accepts both text and image inputs for multimodal use cases.
A configurable reasoning_effort parameter lets developers toggle between lightweight responses for simple everyday tasks and deep step-by-step chain-of-thought analysis for complex problems. This means a single model deployment can serve both quick chatbot responses and detailed technical reasoning without switching between different models or endpoints.
Performance benchmarks show the model scoring 0.72 on the LCR benchmark with only 1,600 characters of output, indicating high information density. On LiveCodeBench, it outperforms GPT-OSS 120B while producing 20 percent less output. Mistral reports a 40 percent reduction in end-to-end completion time in latency-optimized configurations and 3x more requests per second in throughput-optimized setups compared to Mistral Small 3.
Minimum hardware requirements are 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200. Recommended configurations double those minimums for production workloads. The model can be deployed through vLLM, llama.cpp, SGLang, and Hugging Face Transformers, and is available as an NVIDIA NIM container for standardized production deployment.
Who’s Affected
Organizations that want to self-host capable language models benefit the most from this release. The Apache 2.0 license allows commercial use, fine-tuning, and redistribution without restrictions or royalty requirements. Companies in regulated industries like healthcare, finance, and defense, where data cannot leave internal servers due to compliance requirements, can deploy Mistral Small 4 entirely on their own infrastructure.
Cloud API users also benefit from lower per-token costs compared to dense models of equivalent capability. Developers already using Mistral Small 3 can expect a direct upgrade path with substantially better throughput and latency characteristics.
What’s Next
Mistral Small 4 is available for free prototyping on NVIDIA’s build.nvidia.com platform, lowering the barrier for evaluation. The model’s main limitation is its hardware floor: even with mixture-of-experts efficiency, running it requires at least four high-end enterprise GPUs, putting self-hosted deployment out of reach for individual developers and small teams without cloud GPU access. The next question for Mistral is whether the company will apply the same 128-expert architecture to a larger frontier model that could compete directly with the top systems from OpenAI, Anthropic, and Google at the highest capability tier.
