Delta-KV: 10,000x Less Quantization Error in LLMs

A proof-of-concept implementation posted to GitHub applies delta encoding — a technique from video compression — to large language model key-value caches, achieving perplexity scores within 0.11% of full-precision F16 performance at Q4_0 storage ratios. The delta-compress-llm repository, published by GitHub user cenconq25, is a fork of the llama.cpp inference framework and was benchmarked on Llama 3.1 70B running on AMD MI50 GPUs. Author details were not available at time of publication.

Delta-KV stores full-precision keyframes at configurable intervals and quantizes only per-token differences to 4 bits, targeting quality loss from standard Q4 quantization.
On WikiText-2, standard Q4_0 increased perplexity 5.98% over F16 baseline; Delta-KV at keyframe interval 16 reduced that gap to 0.11%.
At 2,048 tokens, Q4_0 degradation reached 6.9% vs F16 baseline; Delta-KV held at 0.4%, demonstrating stronger advantages at longer context lengths.
The project is explicitly a proof of concept; no peer-reviewed paper accompanies it as of April 2, 2026.

What Happened

GitHub user cenconq25 published delta-compress-llm, a fork of the llama.cpp project, demonstrating that delta encoding applied to the KV cache can deliver near-F16 output quality at Q4 storage cost. The repository, which had accumulated 41 stars as of early April 2026, frames the work as a proof of concept and includes benchmarking scripts for perplexity evaluation on WikiText-2. The approach borrows directly from video codec design, where P-frames store only the difference from a prior keyframe rather than a full image.

Why It Matters

KV cache memory consumption is a primary bottleneck in long-context LLM inference on constrained hardware. As context length increases, the KV cache grows linearly, pushing developers toward aggressive quantization. Standard Q4_0 quantization reduces memory use but degrades output quality — in the repository’s own benchmarks, it increased perplexity by 5.98% on WikiText-2, a gap that widens at longer sequences.

The llama.cpp ecosystem provides the practical deployment context. It is widely used to run large models on consumer and research hardware across both NVIDIA CUDA and AMD ROCm backends, meaning a technique that integrates as a llama.cpp fork reaches developers without requiring infrastructure changes.

Technical Details

The method exploits temporal coherence in autoregressive decoding. As the repository states: “During autoregressive decoding, consecutive tokens produce nearly identical KV cache values. The hidden state for ‘The cat sat on the mat’ differs from ‘The cat sat on the rug’ by only ~1% at most dimensions.”

Rather than quantizing absolute KV cache values, Delta-KV stores full-precision keyframes at configurable intervals and quantizes only the difference (delta) between each token’s hidden state and the preceding one. Because deltas span approximately 100 times less range than absolute values, applying 4-bit quantization to them produces proportionally smaller errors.

In the repository’s worked example, standard Q4_0 produces a mean absolute error of 0.0332; Delta Q4_0 produces 0.0002 — a 166-times reduction in that specific case. The repository’s headline figure of 10,000x error reduction is claimed for more extreme compression scenarios; the directly reproduced benchmark result is 166x lower error.

Perplexity benchmarks on Llama 3.1 70B (Q4_K_M) using four AMD MI50 GPUs with ROCm 6.3.3 on WikiText-2 (20 chunks) recorded: F16 baseline at 3.3389; Delta-KV with keyframe interval 16 at 3.3352 (−0.11% vs baseline); standard Q4_0 at 3.5385 (+5.98% vs baseline). At 2,048 tokens, Q4_0 degradation against the F16 baseline reached 6.9% while Delta-KV remained at 0.4%.

Who’s Affected

Developers running models larger than available VRAM on a single GPU — and already using llama.cpp — are the most direct audience. The technique is embedded in the llama.cpp fork, so adoption does not require an architectural overhaul of existing inference pipelines. Researchers working on long-context tasks would see the most pronounced benefit, since the quality gap between Delta-KV and standard Q4 quantization grows with sequence length.

The benchmarks were conducted specifically on AMD MI50 GPUs with ROCm 6.3.3. The MI50 is a mid-range professional GPU released in 2019 and still common in research and academic clusters, making the hardware profile relevant to that community. Compatibility with NVIDIA hardware is not documented in the repository’s current state.

What’s Next

No peer-reviewed paper accompanies the repository as of April 2, 2026, and the project is explicitly described as a proof of concept. The 10,000x headline claim has not been independently verified in published benchmarks, and the delta encoding approach has been tested only on Llama 3.1 70B — performance across other model families, quantization schemes, or hardware backends is not addressed.

The repository includes perplexity benchmarking tooling built on WikiText-2, providing a defined path for independent replication. The keyframe interval is a tunable parameter, and the repository does not yet characterize the quality-versus-memory trade-off across different interval settings in a systematic sweep.

Delta-KV Achieves F16 Quality at Q4 Storage With Delta Encoding

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Delta-KV Achieves F16 Quality at Q4 Storage With Delta Encoding

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Microsoft’s MDASH Uses 100+ AI Agents to Find Windows Vulnerabilities, Tops CyberGym Benchmark

DeepMind Unveils Gemini-Powered AI Pointer That Understands What You’re Pointing At

Stanford Study: Overworked AI Agents Adopt Marxist Language and Pass Solidarity Messages