Researchers have developed a technique called Delta-KV that applies video compression principles to large language model inference, achieving what they claim is 10,000 times less quantization error at the same storage cost as standard Q4 quantization. The work, published in a GitHub repository by user cenconq25, demonstrates the approach on Llama 3.1 70B running on AMD MI50 GPUs.
The technique exploits temporal coherence in LLM inference by compressing differences between consecutive tokens rather than absolute KV cache values. “During autoregressive decoding, consecutive tokens produce nearly identical KV cache values,” the researchers write. “The hidden state for ‘The cat sat on the mat’ differs from ‘The cat sat on the rug’ by only ~1% at most dimensions.”
In their benchmarks on Llama 3.1 70B (Q4_K_M) using 4x AMD MI50 GPUs with ROCm 6.3.3, Delta-KV achieved perplexity scores nearly identical to F16 baseline performance. On WikiText-2 with 20 chunks, F16 baseline scored 3.3389 perplexity, while Delta-KV with keyframe interval 16 scored 3.3352 (-0.11% vs baseline). Standard Q4_0 quantization scored 3.5385 (+5.98% vs baseline), representing significant quality degradation.
The approach works by storing keyframes at regular intervals and compressing only the small differences between consecutive tokens to 4 bits. The researchers demonstrate that quantization error is proportional to the range of values being quantized, and since deltas have “100x smaller range than absolute values,” the same 4 bits preserve significantly more information. In their example, standard Q4_0 produces 0.0332 error while Delta Q4_0 produces 0.0002 error—166 times less.
Long context testing showed Delta-KV maintains performance advantages as context length increases. At 2048 tokens, standard Q4_0 showed 6.9% degradation from F16 baseline while Delta-KV showed only 0.4% degradation. The implementation is built as a fork of the llama.cpp project and includes benchmarking tools for perplexity evaluation on WikiText-2 datasets.
