---
datasets:
- nvidia/Nemotron-Pretraining-Dataset-sample
library_name: transformers
license: apache-2.0
pipeline_tag: other
tags:
- nvidia
- pytorch
track_downloads: true
---

# KVzap

[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![GitHub](https://img.shields.io/badge/GitHub-kvpress-blue?logo=github)](https://github.com/NVIDIA/kvpress/tree/main/kvzap)
[![KVzap collection](https://img.shields.io/badge/🤗%20Hugging%20Face-Collection-orange)](https://huggingface.co/collections/nvidia/kvzap)
[![arXiv](https://img.shields.io/badge/arXiv-2601.07891-b31b1b.svg)](https://huggingface.co/papers/2506.05345)

[KVzap](https://arxiv.org/abs/2601.07891) is a fast, adaptive, and faithful KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold, following the Dynamic Memory Sparsification ([DMS](https://huggingface.co/papers/2506.05345)) inference strategy.

The method was introduced in the paper [KVzap: Fast, Adaptive, and Faithful KV Cache Pruning](https://huggingface.co/papers/2601.07891).

KVzap is trained as a fast approximation of [KVzip+](https://arxiv.org/abs/2505.23416), using 1.2M samples from [Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample). Training code is available in the [kvpress repository](https://github.com/NVIDIA/kvpress/blob/main/kvzap).

## Usage

KVzap can be used with the [kvpress](https://github.com/NVIDIA/kvpress) library, through the custom `KVPressTextGenerationPipeline`, which is automatically registered as a transformers pipeline with the name `kv-press-text-generation` when `kvpress` is imported:

```python
import requests
from transformers import pipeline
from kvpress import KVzapPress, DMSPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
press = DMSPress(KVzapPress(model_type="mlp"), threshold=-4)

# Prefilling compression only, thinking disabled
press.decoding = False
context = requests.get("https://arxiv.org/abs/2601.07891").text
question = "\n What is this article about in 2 sentences ?"
answer = pipe(context, question=question, press=press)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

# Prefilling and decoding compression, thinking enabled
press.decoding = True
prompt = "What is the best hardware to run LLMs and why ?"
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")
```

## Citation

If you use KVzap in your research, please cite the following paper:

```bibtex
@article{jegou2025kvzap,
  title={KVzap: Fast, Adaptive, and Faithful KV Cache Pruning},
  author={Jegou, Simon and Jeblick, Maximilian},
  journal={arXiv preprint arXiv:2601.07891},
  year={2025},
  url={https://arxiv.org/abs/2601.07891}
}
```