TOOL UPDATES

TurboQuant Optimization Achieves 22.8 Percent Decode Speedup in llama.cpp by Skipping Redundant KV Dequantization

R Ryan Matsuda Mar 27, 2026 Updated Apr 7, 2026 3 min read
Engine Score 8/10 — Important

This story reports a significant performance optimization for llama.cpp, a widely used local LLM inference engine, offering a substantial speedup for users. The information is highly actionable for the local LLM community, despite originating from a Reddit post.

Editorial illustration for: TurboQuant Optimization Achieves 22.8 Percent Decode Speedup in llama.cpp by Skipping Redundant K
  • A developer achieved a 22.8% decode speed improvement in llama.cpp‘s TurboQuant KV cache compression by skipping dequantization for attention positions with negligible weights.
  • The optimization required approximately three lines of kernel code and eliminates roughly 90% of unnecessary dequantization work at long context lengths.
  • Perplexity remained unchanged while Needle-in-a-Haystack accuracy improved from 7/9 to 9/9, with the technique applicable to any quantized KV cache implementation.
  • TurboQuant itself, published by Google Research at ICLR 2026, compresses KV cache 3.8-6.4x using rotation and quantization with near-zero quality loss.

What Happened

A developer working on an open-source TurboQuant implementation for llama.cpp found a way to achieve a 22.8 percent decode speed improvement at 32K context length. The optimization exploits attention sparsity to skip unnecessary dequantization of the key-value cache during inference. Testing was performed on a Qwen3.5-35B-A3B model running on Apple’s M5 Max.

TurboQuant itself originates from a Google Research paper by Zandieh et al., published at ICLR 2026. The algorithm compresses KV cache vectors down to 3-4 bits per coordinate using a combination of randomized Walsh-Hadamard Transform rotation and Lloyd-Max scalar quantization, achieving 3.8x to 6.4x compression with near-zero quality loss and no calibration step.

Why It Matters

At long context lengths, dequantization of the key-value cache consumes roughly 40 percent of total decode time, making it a critical bottleneck for local LLM inference. The developer tested 14 different approaches to accelerate dequantization directly — including register lookup tables, SIMD tricks, fused kernels, and branchless math — but none outperformed the baseline because the hardware was already operating at its computational limit.

The breakthrough came from reconsidering the problem entirely. Flash attention computes softmax weights before accessing the value cache, and at long context lengths, the vast majority of those weights are effectively zero. Rather than making dequantization faster, the optimization simply skips value dequantization for positions where attention weights contribute nothing meaningful to the output. The fix required approximately three lines of kernel code — a minimal change that yielded disproportionate performance gains by eliminating work the hardware was doing for no measurable benefit.

Technical Details

The sparse dequantization approach eliminates roughly 90 percent of value cache dequantization work without affecting output quality. Benchmarks show perplexity remained unchanged while Needle-in-a-Haystack accuracy actually improved from 7/9 to 9/9 with the TurboQuant turbo3 format.

The technique also benefits standard q8_0 KV cache quantization, where it delivered a 5 percent decode speedup with identical perplexity and NIAH scores. On an M2 Pro system, combining a 4-magnitude lookup table on the key side with sparse value dequantization brought turbo3 performance from 0.45x to 0.73x relative to q8_0 baseline.

Community testing also revealed that modern LLMs exhibit extreme magnitude differences between keys and values — up to a 182x ratio in Qwen 2.5-1.5B — requiring asymmetric bit allocation for optimal compression. Independent testing found that deterministic Walsh-Hadamard transforms outperform random rotations by 59x in perplexity at 4-bit MSE quantization.

Who’s Affected

The optimization is most relevant to developers running large language models locally on consumer hardware, particularly Apple Silicon systems where memory bandwidth is the primary performance constraint. The technique is not specific to TurboQuant — it leverages a general property of attention distributions at long context, making it applicable to any quantized KV cache implementation.

Multiple community forks have already implemented working versions across Metal, CUDA, and Vulkan backends. Hardware validated includes Apple M5 Max and M1 Ultra, NVIDIA RTX 5090 and RTX 3090, AMD Radeon 7900XTX, and Intel Strix Halo processors. On the memory side, TurboQuant reduces a Qwen 3.5-35B model’s KV buffer from 768 MB at FP16 to 216 MB at 3-bit precision — a 72 percent reduction that enables 700K token contexts on hardware with just 32GB of VRAM.

What’s Next

A CUDA port is reportedly being tested independently, which would extend the speedup to GPU-based inference. The code and benchmarks are available in the turboquant_plus repository on GitHub. Mainline llama.cpp integration remains pending code review against the project’s contribution guidelines for new quantization types, which require registration of new GGML data formats (TQ3_0 and TQ4_0) and modifications to KV cache read/write paths.

Google’s official TurboQuant implementation is expected around Q2 2026. Until then, the community forks provide functional but unofficial support. The sparse dequantization technique itself is implementation-agnostic and could be adopted by other inference frameworks beyond llama.cpp.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime