ANALYSIS

xAI Releases Grok STT and TTS APIs, Claims 5% Phone-Call Error Rate

A Anika Patel Apr 19, 2026 3 min read
Engine Score 7/10 — Important
Editorial illustration for: xAI Releases Grok STT and TTS APIs, Claims 5% Phone-Call Error Rate
  • xAI launched standalone Speech-to-Text and Text-to-Speech APIs on April 18, 2026, built on the same infrastructure that powers Grok Voice across its mobile apps, Tesla vehicles, and Starlink customer support.
  • The STT API supports 25 languages, both batch and streaming modes, speaker diarization, word-level timestamps, and 12 audio formats with a 500 MB per-request cap.
  • xAI’s own benchmarks report a 5.0% entity recognition error rate on phone calls, compared to 12.0% for ElevenLabs, 13.5% for Deepgram, and 21.3% for AssemblyAI — figures that have not been independently verified.
  • Pricing is set at $0.10 per hour for batch transcription and $0.20 per hour for streaming.

What Happened

xAI released two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — on April 18, 2026, making both generally available to enterprise developers. According to xAI’s official announcement, both APIs run on the same infrastructure that already handles Grok Voice functionality on mobile applications, Tesla’s in-vehicle assistant, and Starlink customer support channels. The release moves xAI into direct competition with established speech API providers including ElevenLabs, Deepgram, and AssemblyAI.

Why It Matters

The speech API market has been dominated by a small set of specialized vendors. ElevenLabs has built a strong position in high-fidelity TTS, while Deepgram and AssemblyAI have focused on transcription throughput and developer tooling. xAI’s entry is notable because it is leveraging production infrastructure already deployed across consumer and enterprise-facing Grok products, rather than launching a greenfield API offering. The move follows a broader pattern of AI model providers vertically integrating into infrastructure that was previously served by point-solution vendors.

Technical Details

The Grok STT API supports transcription across 25 languages and offers two operational modes: batch processing for pre-recorded audio files and streaming for real-time transcription as audio is captured. The API includes word-level timestamps, speaker diarization — which attributes transcript segments to individual speakers — multichannel audio support, and Inverse Text Normalization, which converts spoken forms such as “one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents” into structured output like “$167,983.15.”

The API accepts 12 audio formats: nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request. On company-reported benchmarks, xAI states the STT API achieved a 5.0% entity recognition error rate on phone call audio, compared to 12.0% for ElevenLabs, 13.5% for Deepgram, and 21.3% for AssemblyAI. On video and podcast content, xAI and ElevenLabs tied at a 2.4% error rate, with Deepgram at 3.0% and AssemblyAI at 3.2%. xAI also reports a 6.9% word error rate across general audio benchmarks. These figures are self-reported and have not been independently replicated.

Who’s Affected

Developers building voice agents, meeting transcription tools, call center analytics platforms, subtitle generation pipelines, and accessibility features are the primary target audience. The $0.10/hour batch pricing positions the API below some incumbent providers on cost for high-volume offline workloads, while the $0.20/hour streaming rate will be evaluated by teams for whom latency and real-time throughput are the primary constraints. ElevenLabs, Deepgram, and AssemblyAI face a competitor that can point to existing deployments across Tesla and Starlink as production-scale validation.

What’s Next

xAI has not published a roadmap for additional language support or model versioning cadence. Independent benchmarking by third parties — comparable to evaluations conducted on OpenAI Whisper and competing STT systems — will be necessary before enterprise buyers can assess whether the company-reported accuracy margins hold across diverse real-world audio conditions. The TTS API’s capabilities, including voice selection and latency characteristics, will also require independent testing before direct comparisons with ElevenLabs’ established voice library are possible.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime