--- library_name: transformers license: mit base_model: Qwen/Qwen2.5-VL-3B-Instruct arxiv: 2602.21202 datasets: - vidore/colpali_train_set language: - en tags: - visual-document-retrieval - multi-vector - late-interaction - colbert - index-compression - memory-tokens - text-to-image - feature-extraction --- # MemTok-Qwen2.5-VL-3B (ViDoRe) This model uses **Memory Tokens (MemTok)** to compress multi-vector visual document representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and finetuned on the [ColPali train set](https://huggingface.co/datasets/vidore/colpali_train_set) for text-to-visual-document retrieval with bidirectional attention. MemTok compresses ~1300 visual document token vectors into a fixed budget of **64 vectors** (95.1% compression) via learnable memory tokens that aggregate document information through attention. [![arXiv](https://img.shields.io/badge/arXiv-2602.21202-b31b1b.svg)](https://arxiv.org/abs/2602.21202) [![GitHub](https://img.shields.io/badge/GitHub-omni--col--press-blue?logo=github)](https://github.com/hanxiangqin/omni-col-press) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) ## Method Overview MemTok appends a set of *m* learnable memory tokens to the document token sequence. The concatenated sequence is encoded with a bidirectional transformer; after self-attention, each memory token has attended over the full document. The final hidden states of the *m* memory tokens form the compressed multi-vector representation used for ColBERT-style MaxSim retrieval. ![Method](https://cdn-uploads.huggingface.co/production/uploads/683d3519fabaa8b10395bf8a/eC_E_E9kwuj_hp-5xwc2B.png) ## Results on ViDoRe v2 | Method | Tokens | nDCG@5 (Avg) | Bio | Econ | ESG-R | ESG-H | |---|---|---|---|---|---|---| | ColPali | – | 53.3 | 56.5 | 49.9 | 55.7 | 51.1 | | ColQwenOmni | – | 56.5 | 56.5 | 53.2 | 54.2 | 62.2 | | MetaEmbed | 64 | 58.8 | 58.7 | 55.5 | 57.4 | 63.7 | | Baseline (Ours, uncompressed) | 1297 | 60.0 | 61.4 | 53.9 | 57.0 | 67.6 | | SeqResize | 64 | 51.7 | 54.7 | 53.5 | 45.2 | 53.5 | | **MemTok (This model)** | **64** | **54.3** | **56.8** | **53.0** | **46.4** | **61.4** | | H-Pool | 64 | 56.4 | 59.6 | 52.1 | 53.4 | 60.6 | | AGC | 64 | 56.7 | 59.0 | 54.5 | 55.8 | 57.3 | ## Model Details | | | |---|---| | **Initial weights** | Qwen2.5-VL-3B-Instruct | | **Architecture** | Qwen2.5-VL with bidirectional attention | | **Hidden dimension** | 2048 | | **Compression method** | MemTok (memory tokens) | | **Memory tokens** | 64 learned tokens (`<\|mem0\|>` – `<\|mem63\|>`) appended to document | | **Budget** | 64 vectors per document | | **Scoring** | ColBERT-style MaxSim (late interaction) | | **Normalization** | L2-normalized embeddings | | **Query prefix** | `"Query: "` | | **Passage prefix** | `"Passage: "` | | **Precision** | bfloat16 | | **Max image tokens** | 1280 | ## Usage ```python import torch from transformers import AutoProcessor from qwen_vl_utils import process_vision_info from src.arguments import ModelArguments from src.encoder.multivec_encoder import MultiVecEncoder from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding from src.utils import get_appending_token_strings MODEL_ID = "hltcoe/MemTok_qwen2.5-vl_colpali" IMAGE_PATH = "PLACEHOLDER" NUM_MEMORY_TOKENS = 64 APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_MEMORY_TOKENS)) # --- Setup --- model_args = ModelArguments( model_name_or_path=MODEL_ID, pooling="memory", normalize=True, num_appending_token=NUM_MEMORY_TOKENS, use_parametric_appending_tokens=True, attn_implementation="flash_attention_2", ) processor = AutoProcessor.from_pretrained(MODEL_ID) model = MultiVecEncoder.load( Qwen2_5ForEmbedding, model_args, attn_implementation=model_args.attn_implementation, dtype=torch.bfloat16, ) model = model.to("cuda").eval() # --- Encode an image document --- passage_messages = [ { "role": "user", "content": [ {"type": "text", "text": "Passage: "}, {"type": "image", "image": IMAGE_PATH, "max_pixels": 1003520, "min_pixels": 614656}, ], } ] text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False) text += APPENDING_SUFFIX image_inputs, video_inputs = process_vision_info(passage_messages) passage_inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to("cuda") with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16): with torch.inference_mode(): doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False) print(doc_embeddings.shape) # doc_embeddings: (1, 64, 2048) — 64 MemTok vectors # --- Encode a text query --- query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: What types of tissues are unable to regenerate spontaneously?"}]}] query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False) query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda") with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16): with torch.inference_mode(): query_embeddings, query_mask = model.encode(query_inputs, is_query=True) print(query_embeddings.shape) # --- ColBERT MaxSim scoring --- score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask) print(f"Similarity score: {score.item():.4f}") ``` ## Command line usage For running inference and evaluation from the command line, see the [Quick Start](https://github.com/HanxiangQin/omni-col-press?tab=readme-ov-file#quick-start-1) section. ## Citation ```bibtex @misc{qin2026multivectorindexcompressionmodality, title={Multi-Vector Index Compression in Any Modality}, author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme}, year={2026}, eprint={2602.21202}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2602.21202}, } ```