- Mistral AI released Voxtral TTS, a 4-billion-parameter open-weight text-to-speech model available on Hugging Face under a CC BY NC 4.0 license.
- Human evaluations show Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while matching ElevenLabs v3 quality.
- The model supports nine languages, adapts to custom voices from as little as three seconds of reference audio, and achieves 70ms model latency.
- API pricing is set at $0.016 per 1,000 characters through Mistral Studio.
What Happened
Mistral AI released Voxtral TTS on March 26, 2026, the company’s first text-to-speech model. The full model weights are available on Hugging Face, allowing companies to download Voxtral TTS and run it on their own servers or even on a smartphone without sending audio data to a third party.
The model is also available through Mistral’s API on Mistral Studio at $0.016 per 1,000 characters and is integrated into Le Chat, Mistral’s consumer-facing assistant.
Voxtral TTS enters a market dominated by proprietary services like ElevenLabs and is one of the first competitive open-weight alternatives for production speech synthesis. The release follows Mistral’s earlier Voxtral speech-to-text model and extends the company’s portfolio into the full audio pipeline.
Why It Matters
Open-weight text-to-speech models have lagged behind proprietary offerings in quality. Voxtral TTS narrows that gap significantly. Human evaluations conducted by native speakers found that Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio latency. Against ElevenLabs v3, it reaches quality parity while adding emotion-steering capabilities.
The open-weight release under CC BY NC 4.0 means organizations can run the model entirely on their own infrastructure. This matters for industries like healthcare and financial services where sending audio to third-party APIs raises compliance concerns around patient data and financial records. Mistral explicitly positioned the model around this advantage, noting companies can operate it “without sending a single audio frame to a third party.”
The timing is notable as demand for voice AI accelerates. Customer support centers, real-time translation services, and voice agent platforms all require low-latency, natural-sounding speech synthesis. A competitive open-weight option could reduce vendor lock-in for organizations currently dependent on proprietary APIs.
Technical Details
Voxtral TTS is a 4-billion-parameter model built on a hybrid architecture combining autoregressive semantic generation with flow-matching for acoustic details. The architecture breaks down into three components: a 3.4-billion-parameter transformer decoder backbone based on Ministral 3B, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter symmetric neural audio codec.
The model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. It achieves a model latency of 70ms for a typical 10-second voice sample with 500 characters of input, with a real-time factor of approximately 9.7x. Native generation supports up to two minutes of continuous audio output.
Voice adaptation requires as little as three seconds of reference audio. The model captures not just the voice but nuances including accent, inflections, intonations, and disfluencies. It also supports zero-shot cross-lingual voice adaptation, meaning a voice cloned from English audio can generate speech in any of the nine supported languages. This cross-lingual capability is particularly relevant for global customer support operations where a single branded voice needs to operate across multiple markets.
Who’s Affected
Developers building voice agents for customer support, real-time translation, and speech-to-speech systems gain a self-hosted alternative to proprietary TTS APIs. Companies in regulated industries that cannot send audio data to third-party servers have a new option for on-premises deployment.
ElevenLabs and other commercial TTS providers face increased competitive pressure, particularly in the developer and enterprise segments where open weights and self-hosting are priorities. For ElevenLabs specifically, the human evaluation results showing Voxtral TTS surpassing its Flash v2.5 tier in naturalness challenge a key selling point of proprietary models.
What’s Next
Mistral has published a research paper on arXiv (2603.25551) detailing the model’s architecture and evaluation methodology. The evaluation used three native-speaker annotators per language pair conducting side-by-side preference tests on naturalness, accent adherence, and acoustic similarity.
The company is hosting a webinar on building conversational AI with Voxtral TTS. The CC BY NC 4.0 license restricts commercial use of the open weights, meaning companies deploying Voxtral TTS in production will need either to use Mistral’s paid API at $0.016 per 1,000 characters or negotiate a separate commercial license. The non-commercial restriction limits the model’s immediate impact for startups and smaller companies that cannot afford licensing negotiations.