- Google’s TurboQuant achieves up to 8x faster attention computation on H100 GPUs by quantizing model keys to 4 bits versus standard 32-bit precision.
- The method reduces key-value cache memory by 6x while maintaining zero accuracy loss on long-context benchmarks.
- TurboQuant uses a two-stage approach: PolarQuant for rotation-based normalization and QJL for 1-bit error correction.
- The technique was tested on Gemma and Mistral open-source models across five long-context benchmark suites.
What Happened
Google Research published TurboQuant, a compression algorithm that reduces the memory and compute requirements of large language models while maintaining their accuracy. According to the Google Research blog post, the method achieves up to 8x faster performance in attention logit computation on NVIDIA H100 GPUs by quantizing key vectors from 32 bits down to 4 bits.
The research team includes Amir Zandieh (Research Scientist), Vahab Mirrokni (VP and Google Fellow), along with collaborators Praneeth Kacham, Majid Hadian, Insu Han, Majid Daliri, Lars Gottesburen, and Rajesh Jayaram.
TurboQuant targets the key-value (KV) cache, a memory-intensive component that stores contextual information from previous tokens during text generation. By compressing this cache to 3 bits per parameter, the method reduces KV cache memory usage by 6x without requiring any additional model training or fine-tuning.
Why It Matters
Running large language models on phones, tablets, and edge devices is constrained primarily by memory and processing power. The KV cache grows linearly with sequence length, meaning longer conversations and documents consume proportionally more device memory. TurboQuant addresses this bottleneck directly by compressing the cache to a fraction of its original size.
An 8x speedup in attention computation and a 6x reduction in cache memory could make it practical to run capable language models on consumer hardware that currently cannot support them. This has implications for offline AI assistants, on-device translation, and privacy-sensitive applications where sending data to cloud servers is undesirable or impractical.
Unlike many compression techniques that trade quality for speed, TurboQuant achieves these gains without retraining the model or accepting accuracy degradation. The researchers report “zero accuracy loss” across their benchmark evaluations, which distinguishes this approach from methods that sacrifice output quality for reduced resource consumption.
Technical Details
TurboQuant operates through two stages. The first stage, called PolarQuant, randomly rotates data vectors and converts them to polar coordinates consisting of radius and angle values. This geometric transformation eliminates the expensive normalization steps that traditional quantization methods require, removing the associated memory overhead that makes other approaches less efficient.
The second stage applies the Quantized Johnson-Lindenstrauss Transform (QJL), which uses just 1 bit to correct residual errors introduced by the quantization process. The mathematical foundation of this transform guarantees that distances between vectors are approximately preserved even after compression, maintaining the model’s ability to attend to relevant context.
The combination of PolarQuant and QJL achieves accuracy without adding meaningful computational overhead. The researchers tested TurboQuant on Gemma and Mistral, two open-source large language models, across five benchmark suites: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. The method achieved what the team describes as “perfect downstream performance” across all long-context benchmarks.
In vector search evaluations using the GloVe dataset at dimension 200, TurboQuant showed superior recall ratios compared to existing quantization methods including Product Quantization (PQ) and RabbiQ.
Who’s Affected
Mobile device manufacturers and chipset designers stand to benefit most directly. If TurboQuant or similar techniques become standard, phone makers can offer on-device AI capabilities without requiring top-tier hardware specifications or dedicating excessive memory to model caches.
Cloud AI providers could reduce serving costs significantly. An 8x speedup in attention computation translates directly to lower GPU-hours per inference request, which affects pricing for API-based AI services and could make AI inference more accessible for smaller companies and developers.
AI researchers working on model efficiency now have a new baseline to compare against. TurboQuant’s claim of zero accuracy loss at extreme compression ratios sets a high bar that competing quantization approaches will need to match or exceed.
What’s Next
The research has been published but Google has not announced specific product integrations for TurboQuant. The technique’s applicability to Google’s proprietary Gemini models and its potential deployment in Android devices or Google Cloud services remain open questions. The method’s 8x benchmark figure was measured on NVIDIA H100 GPUs, so real-world performance on mobile processors with different architectures will likely differ from the published results.
