Google Research has introduced TurboQuant, a compression algorithm designed to reduce memory overhead in vector quantization for large language models and vector search engines. The research, led by Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow at Google Research, will be presented at ICLR 2026.
Vector quantization traditionally introduces memory overhead because most methods require calculating and storing quantization constants for every small block of data. “This overhead can add 1 or 2 extra bits per number, partially defeating the purpose of vector quantization,” according to the researchers’ blog post published March 24, 2026.
TurboQuant addresses this challenge through a two-stage process. The first stage uses PolarQuant, which randomly rotates data vectors to simplify their geometry before applying standard quantization to each vector component individually. The second stage applies the Quantized Johnson-Lindenstrauss (QJL) algorithm using just 1 bit to eliminate residual errors from the first compression stage. The QJL technique “acts as a mathematical error-checker that eliminates bias, leading to a more accurate attention score.”
The algorithm targets two critical AI bottlenecks: key-value cache compression and vector search optimization. Key-value caches serve as high-speed storage for frequently accessed information, while vector search powers similarity lookups in large-scale AI systems. Traditional quantization methods struggle with memory overhead that can partially negate compression benefits.
Google Research reports that TurboQuant achieves “high reduction in model size with zero accuracy loss” in testing. The technique will be presented alongside PolarQuant at AISTATS 2026, with both methods showing promise for reducing key-value bottlenecks without sacrificing AI model performance. The researchers indicate the work has “potentially profound implications for all compression-reliant use cases, including and especially in the domains of search and AI.”
