ANALYSIS

Google DeepMind Releases Gemini 3.1 Flash TTS, Scoring 1,211 Elo on Speech Benchmark

A Anika Patel Apr 16, 2026 2 min read
Engine Score 8/10 — Important
Editorial illustration for: Google DeepMind Releases Gemini 3.1 Flash TTS, Scoring 1,211 Elo on Speech Benchmark
  • Gemini 3.1 Flash TTS launched April 16, 2026 in developer and enterprise preview via the Gemini API, Google AI Studio, and Vertex AI.
  • The model recorded an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, a benchmark based on thousands of blind human preference evaluations.
  • A new audio tags feature lets developers embed natural language commands directly into text input to control voice style, pacing, and tone — including mid-sentence adjustments.
  • All audio output is watermarked with SynthID, Google’s imperceptible AI-content detection system, to enable reliable identification of AI-generated speech.

What Happened

Google DeepMind on April 16, 2026 released Gemini 3.1 Flash TTS, a new text-to-speech model featuring native multi-speaker dialogue, support for more than 70 languages, and a new audio tags system for granular vocal control. The model is rolling out in preview for developers via the Gemini API and Google AI Studio, for enterprise customers via Vertex AI, and for Google Workspace subscribers through Google Vids.

Why It Matters

The commercial text-to-speech market has become increasingly contested, with ElevenLabs, OpenAI’s TTS API, and Microsoft Azure Neural Voice competing for developer and enterprise contracts. Artificial Analysis — an independent AI benchmarking organization — placed Gemini 3.1 Flash TTS in what it calls its “most attractive quadrant,” citing the model’s combination of high output quality and low cost. Mandatory SynthID watermarking on all generated audio adds a provenance layer that addresses content-authenticity requirements increasingly demanded by enterprise buyers and platform operators.

Technical Details

Gemini 3.1 Flash TTS recorded an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, which DeepMind describes as capturing “thousands of blind human preferences” across pairwise comparisons. The audio tags system works by embedding natural language commands directly inside the input text string, enabling developers to specify scene direction, accent, tone, and pacing at the sentence or sub-sentence level. Developers can assign unique Audio Profiles to individual speakers, then use inline tags to shift a character’s expression mid-sentence without resetting the overall scene context. Completed configurations export directly as Gemini API code, which DeepMind says ensures voice consistency across separate projects and platforms.

Who’s Affected

Developers building voice interfaces, podcast tools, or audiobook applications can begin testing Gemini 3.1 Flash TTS immediately through the Google AI Studio Playground. Enterprise customers on Google Cloud’s Vertex AI gain preview access under existing Cloud terms. Early testers, as reported in DeepMind’s launch blog post, described audio tags as delivering “a new level of creative precision, transforming simple text into a high-fidelity vocal performance.”

What’s Next

DeepMind described the April 16 rollout as a preview phase across all three access points, with no disclosed timeline for general availability or published pricing tiers. The company’s model card, linked from the launch post, covers its approach to safety and SynthID watermark detection for those evaluating the model for production use.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime