--- datasets: - nvidia/Nemotron-Pretraining-Dataset-sample library_name: transformers license: apache-2.0 pipeline_tag: other tags: - nvidia - pytorch track_downloads: true --- # KVzap [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![GitHub](https://img.shields.io/badge/GitHub-kvpress-blue?logo=github)](https://github.com/NVIDIA/kvpress/tree/main/kvzap) [![KVzap collection](https://img.shields.io/badge/🤗%20Hugging%20Face-Collection-orange)](https://huggingface.co/collections/nvidia/kvzap) [![arXiv](https://img.shields.io/badge/arXiv-2601.07891-b31b1b.svg)](https://huggingface.co/papers/2506.05345) [KVzap](https://arxiv.org/abs/2601.07891) is a fast, adaptive, and faithful KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold, following the Dynamic Memory Sparsification ([DMS](https://huggingface.co/papers/2506.05345)) inference strategy. The method was introduced in the paper [KVzap: Fast, Adaptive, and Faithful KV Cache Pruning](https://huggingface.co/papers/2601.07891). KVzap is trained as a fast approximation of [KVzip+](https://arxiv.org/abs/2505.23416), using 1.2M samples from [Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample). Training code is available in the [kvpress repository](https://github.com/NVIDIA/kvpress/blob/main/kvzap). ## Usage KVzap can be used with the [kvpress](https://github.com/NVIDIA/kvpress) library, through the custom `KVPressTextGenerationPipeline`, which is automatically registered as a transformers pipeline with the name `kv-press-text-generation` when `kvpress` is imported: ```python import requests from transformers import pipeline from kvpress import KVzapPress, DMSPress model = "Qwen/Qwen3-8B" pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto") press = DMSPress(KVzapPress(model_type="mlp"), threshold=-4) # Prefilling compression only, thinking disabled press.decoding = False context = requests.get("https://arxiv.org/abs/2601.07891").text question = "\n What is this article about in 2 sentences ?" answer = pipe(context, question=question, press=press)["answer"] print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}") # Prefilling and decoding compression, thinking enabled press.decoding = True prompt = "What is the best hardware to run LLMs and why ?" answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"] print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}") ``` ## Citation If you use KVzap in your research, please cite the following paper: ```bibtex @article{jegou2025kvzap, title={KVzap: Fast, Adaptive, and Faithful KV Cache Pruning}, author={Jegou, Simon and Jeblick, Maximilian}, journal={arXiv preprint arXiv:2601.07891}, year={2025}, url={https://arxiv.org/abs/2601.07891} } ```