RESEARCH

Researchers Apply Video Compression to LLM KV Cache, Cut Quantization Error 10,000x

M megaone_admin Mar 23, 2026 2 min read
Engine Score 8/10 — Important

This story details a novel research approach to significantly improve LLM efficiency by applying video compression techniques to KV cache, promising substantial error reduction. Its high actionability for developers and potential industry impact make it important, despite being early-stage research without external verification.

Editorial illustration for: Researchers Apply Video Compression to LLM KV Cache, Cut Quantization Error 10,000x

Researchers have developed a technique called Delta-KV that applies video compression principles to large language model inference, achieving what they claim is 10,000 times less quantization error at the same storage cost as standard Q4 quantization. The work, published in a GitHub repository by user cenconq25, demonstrates the approach on Llama 3.1 70B running on AMD MI50 GPUs.

The technique exploits temporal coherence in LLM inference by compressing differences between consecutive tokens rather than absolute KV cache values. “During autoregressive decoding, consecutive tokens produce nearly identical KV cache values,” the researchers write. “The hidden state for ‘The cat sat on the mat’ differs from ‘The cat sat on the rug’ by only ~1% at most dimensions.”

In their benchmarks on Llama 3.1 70B (Q4_K_M) using 4x AMD MI50 GPUs with ROCm 6.3.3, Delta-KV achieved perplexity scores nearly identical to F16 baseline performance. On WikiText-2 with 20 chunks, F16 baseline scored 3.3389 perplexity, while Delta-KV with keyframe interval 16 scored 3.3352 (-0.11% vs baseline). Standard Q4_0 quantization scored 3.5385 (+5.98% vs baseline), representing significant quality degradation.

The approach works by storing keyframes at regular intervals and compressing only the small differences between consecutive tokens to 4 bits. The researchers demonstrate that quantization error is proportional to the range of values being quantized, and since deltas have “100x smaller range than absolute values,” the same 4 bits preserve significantly more information. In their example, standard Q4_0 produces 0.0332 error while Delta Q4_0 produces 0.0002 error—166 times less.

Long context testing showed Delta-KV maintains performance advantages as context length increases. At 2048 tokens, standard Q4_0 showed 6.9% degradation from F16 baseline while Delta-KV showed only 0.4% degradation. The implementation is built as a fork of the llama.cpp project and includes benchmarking tools for perplexity evaluation on WikiText-2 datasets.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy