Mistral AI Releases Voxtral TTS Open-Weight Speech Model With 70 Millisecond Latency and Nine Language Support

Mistral AI has released Voxtral TTS, a 4 billion parameter open-weight text-to-speech model that the company says matches or exceeds the quality of leading proprietary alternatives while allowing enterprises to run the model on their own infrastructure. The release on March 26, 2026 marks Mistral’s entry into the speech generation market.

Voxtral TTS is built on top of Mistral’s existing Ministral 3B language model and uses an autoregressive flow-matching architecture. The model breaks down into a 3.4 billion parameter transformer decoder backbone, a 390 million parameter flow-matching acoustic transformer, and a 300 million parameter neural audio codec. It supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

The model ships with 20 preset voices and can adapt to a custom voice from as little as a three-second reference clip, capturing accent, inflections, intonation, rhythm, and natural disfluencies. Cross-lingual voice adaptation works without additional training, meaning a French voice prompt can generate English speech with a natural French accent.

Performance benchmarks show 70 millisecond typical latency and approximately 90 millisecond time-to-first-audio for a standard request. On an Nvidia H200 GPU, the model achieves a real-time factor of roughly 9.7 times at single concurrency, scaling to 879 characters per second per GPU at 32 concurrent requests. The model requires a single GPU with 16 gigabytes or more of VRAM and runs through vLLM v0.18.0 or later.

In blind listening tests conducted by Mistral, evaluators preferred Voxtral TTS over ElevenLabs Flash v2.5 approximately 63 percent of the time on flagship voices and nearly 70 percent of the time on voice customization tasks. Performance was at parity with ElevenLabs v3, their highest-tier model.

The open-weight release is the key differentiator. Every major competitor in the text-to-speech market, including ElevenLabs, OpenAI, and Deepgram, operates on a proprietary API-first model where enterprises must send audio to a third party. Voxtral TTS allows companies to download and run the full model locally, from data center servers to devices with sufficient memory, without external data transmission. The model weights are available for commercial use.

Mistral AI Releases Voxtral TTS Open-Weight Speech Model With 70 Millisecond Latency and Nine Language Support

Enjoyed this story?

Bluesky Launches AI Assistant Attie for Custom Social Media Feed Creation

Chroma Releases Context-1: 20B Parameter Model for Multi-Hop Search

Developer Releases Miasma Tool to Trap AI Web Scrapers with Poisoned Data

Before you go…