Google Research released TurboQuant on March 25, 2026, a software-only algorithm suite that compresses the key-value cache in large language models by up to six times while delivering an eight-fold performance improvement in computing attention. The technique requires no retraining, works on any existing model, and can cut enterprise AI serving costs by more than 50 percent.
The KV cache is one of the primary bottlenecks in running large language models at scale. Every time an LLM generates text, it stores key-value pairs for all previous tokens in the conversation, consuming memory that grows linearly with context length. For models handling long documents or multi-turn conversations, this cache can consume more GPU memory than the model weights themselves. TurboQuant compresses the cache to just three bits per value, down from the standard 16 bits.
The algorithm combines two mathematical frameworks. PolarQuant converts data vectors into polar coordinates to eliminate memory overhead from outlier values that normally force quantization schemes to reserve extra bits. Quantized Johnson-Lindenstrauss, or QJL, applies a one-bit random projection that acts as an error-checking layer, catching accuracy losses from the extreme compression. Together, they maintain model quality while dramatically reducing memory footprint.
On NVIDIA H100 accelerators, TurboQuant’s four-bit implementation achieved an eight-times performance boost in attention computation. The practical impact for enterprises running AI inference at scale is substantial: the same GPU fleet can serve significantly more concurrent users, or organizations can achieve the same throughput with fewer GPUs. For companies spending millions monthly on AI infrastructure, a 50 percent cost reduction changes the economics of which use cases are viable.
The market registered the implications quickly. Stock prices for major memory suppliers including Micron, Western Digital, and SanDisk declined following the announcement, reflecting investor concern that algorithmic efficiency improvements could dampen demand for High Bandwidth Memory. If software can achieve similar throughput gains to what previously required hardware upgrades, the growth trajectory for AI memory chips faces a new variable.
TurboQuant also enhances vector search — the similarity-matching technology that powers Google Search, YouTube recommendations, and advertising targeting. Community adoption has already begun, with developers porting the algorithm to MLX for Apple Silicon and llama.cpp for local AI inference. Both PolarQuant and QJL will be presented at AISTATS 2026 in Tangier, and TurboQuant itself will appear at ICLR 2026 in Rio de Janeiro.
