A developer working on an open-source TurboQuant implementation for KV cache compression in llama.cpp has achieved a 22.8 percent decode speed improvement at 32K context length by exploiting attention sparsity to skip unnecessary dequantization work. The optimization, tested on a Qwen3.5-35B-A3B model running on Apple’s M5 Max, required approximately three lines of kernel code.
The bottleneck was clear: at long context lengths, dequantization of the key-value cache was consuming roughly 40 percent of total decode time. The developer tested 14 different approaches to accelerate dequantization directly, including register lookup tables, SIMD tricks, fused kernels, and branchless math. None outperformed the baseline because the hardware was already operating at its computational limit.
The breakthrough came from rethinking the problem. Flash attention computes softmax weights before accessing the value cache, and at long context lengths, the vast majority of those weights are effectively zero. Instead of making dequantization faster, the optimization simply skips V dequantization entirely for positions where attention weights are negligible. The approach eliminates roughly 90 percent of dequantization work without affecting output quality.
Benchmarks show perplexity remained unchanged while Needle-in-a-Haystack accuracy improved from 7/9 to 9/9 with TurboQuant KV (turbo3). The technique also benefits standard q8_0 KV cache, where it delivered a 5 percent decode speedup with identical perplexity and NIAH scores. On an M2 Pro system, combining a 4-magnitude lookup table on the key side with the sparse value dequantization stack brought turbo3 performance from 0.45x to 0.73x relative to q8_0.
The optimization is not specific to TurboQuant. It leverages a general property of attention distributions at long context — that most positions contribute negligibly to the output — making it applicable to any quantized KV cache implementation. A CUDA port is reportedly being tested independently, which would extend the benefit to GPU-based inference. The code and benchmarks are available in the turboquant_plus repository on GitHub.
