---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# GlimpsePrune: A Dynamic Visual Token Pruning for Large Vision-Language Models

**GlimpsePrune** is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). This model was presented in the paper [A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models](https://huggingface.co/papers/2508.01548).

Existing methods for visual token compression typically adopt fixed compression ratios, which cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. Inspired by human cognition, GlimpsePrune addresses this issue by taking a data-driven "glimpse" and pruning irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

For the official code and more details, please refer to the [GitHub repository](https://github.com/HVision-NKU/GlimpsePrune).

<div align="center">
  <img src="https://github.com/HVision-NKU/GlimpsePrune/raw/main/assets/case1.png" width="80%">
  <img src="https://github.com/HVision-NKU/GlimpsePrune/raw/main/assets/case2.png" width="80%">
  <br>
  <em>GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.</em>
</div>

## ✨ Key Features

-   **High Pruning Rate**: Prunes over **90%** of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
-   **Robust Performance**: Stable performance when processing high-resolution images and handling complex **free-form VQA** tasks.
-   **Lightweight Training**: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
-   **Broad Compatibility**: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

## 🖼️ Framework Overview

The core idea of GlimpsePrune is to introduce a **glimpse token** and a lightweight **Visual tokens Important Predictor (VIP)** that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

<div align="center">
  <img src="https://github.com/HVision-NKU/GlimpsePrune/raw/main/assets/framework.png" width="70%">
</div>

## 📊 Performance Results

We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.

<p align="center">
  <b>Free-form VQA Benchmarks</b><br>
  <img src="https://github.com/HVision-NKU/GlimpsePrune/raw/main/assets/freeform_results.png" width="90%">
</p>

<p align="center">
  <b>Short-form VQA Benchmarks</b><br>
  <img src="https://github.com/HVision-NKU/GlimpsePrune/raw/main/assets/shortform_results.png" width="90%">
</p>

## 📦 Models and Data

### Model Download
All models can be automatically downloaded from the Hugging Face Hub. `<new_module>` are the weights of the extra glimpse token and VIP modules we trained.

|`<base_model>`| `<new_module>` |
|:---:|:---:|
|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|[ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct](https://huggingface.co/ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct)|
|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|[ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct](https://huggingface.co/ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct)|
|[liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b)|[ashun989/GlimpsePrune_LLaVA-1.5-7B](https://huggingface.co/ashun989/GlimpsePrune_LLaVA-1.5-7B)|
|[liuhaotian/llava-v1.5-13b](https://huggingface.co/liuhaotian/llava-v1.5-13b)|[ashun989/GlimpsePrune_LLaVA-1.5-13B](https://huggingface.co/ashun989/GlimpsePrune_LLaVA-1.5-13B)|

## ▶️ How to Use

You can use `GlimpsePrune` with the `transformers_gp`, which is located in the [GitHub repository](https://github.com/HVision-NKU/GlimpsePrune).


```python
from transformers_gp.models.qwen2_5_vl import (
    Qwen2_5_VL_GP_ForConditionalGeneration,
    Qwen2_5_VL_GP_Processor
)
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

# Load the model and processor
base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
new_model_name = "ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct"

model = Qwen2_5_VL_GP_ForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map={"": "cuda:0"},
)
processor = Qwen2_5_VL_GP_Processor.from_pretrained(base_model)
model.load_new_modules(new_modules_dir)
model.eval()

# Prepare messages (image and text input)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "../examples/people.png", # Placeholder: replace with your image path
            },
            {"type": "text", "text": "What kind of a tie is the groom wearing?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Generate output
model.reset_image_tokens_cache()  # NOTE: reset the cache before inference
with torch.inference_mode():
  generated_ids = model.generate(**inputs, max_new_tokens=1024, do_selection=True)  # Enable glimpse prune by do_selection=True

# Decode and print the response
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
print(f"User: {question}
Assistant: {output_text[0]}")
```

## 🖊️ Citation

If you find our work helpful, please consider citing our paper:
```bibtex
@misc{zeng2025glimpseprune,
      title={A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models}, 
      author={Quan-Sheng Zeng and Yunheng Li and Qilong Wang and Peng-Tao Jiang and Zuxuan Wu and Ming-Ming Cheng and Qibin Hou},
      year={2025},
      eprint={2508.01548},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.01548}, 
}
```