Image-Text-to-Text
Safetensors
ColPali
sauerkrautlm-colpali
qwen3_vl
document-retrieval
vision-language-model
multi-vector
late-interaction
visual-retrieval
qwen3-vl
mteb
vidore
Instructions to use VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ColPali
How to use VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1 with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - de | |
| - fr | |
| - es | |
| - it | |
| - pt | |
| license: apache-2.0 | |
| library_name: sauerkrautlm-colpali | |
| tags: | |
| - document-retrieval | |
| - vision-language-model | |
| - multi-vector | |
| - colpali | |
| - late-interaction | |
| - visual-retrieval | |
| - qwen3-vl | |
| - mteb | |
| - vidore | |
| base_model: Qwen/Qwen3-VL-2B | |
| pipeline_tag: image-text-to-text | |
| datasets: | |
| - vidore/colpali_train_set | |
| - openbmb/VisRAG-Ret-Train-In-domain-data | |
| - llamaindex/vdr-multilingual-train | |
| metrics: | |
| - ndcg_at_5 | |
| # SauerkrautLM-ColQwen3-2b-v0.1 | |
| <p align="center"> | |
| <img src="https://vago-solutions.ai/wp-content/uploads/2025/12/Sauerkrautlm-colpali-scaled.png" alt="VAGO Solutions Logo" width="75%"/> | |
| </p> | |
| **๐ฅ Best 128-dim Model in Medium (1-3B) Category** | **+1.01 over ColQwen2** | |
| SauerkrautLM-ColQwen3-2b-v0.1 achieves **90.24 NDCG@5** on ViDoRe v1, making it the **#1 in the Medium (1-3B) category** among 128-dim models - a significant **+1.01 improvement** over the baseline ColQwen2-v1.0. | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/VAGOsolutions/sauerkrautlm-colpali/main/assets/benchmark_128dim_v1.png" alt="ViDoRe v1 Benchmark - 128-dim Models" width="100%"/> | |
| </p> | |
| ## ๐ฏ Why Visual Document Retrieval? | |
| Traditional OCR-based retrieval **loses layout, tables, and visual context**. Our visual approach: | |
| - โ **No OCR errors** - Direct visual understanding | |
| - โ **Layout-aware** - Understands tables, forms, charts | |
| - โ **End-to-end** - Single model, no pipeline complexity | |
| ## ๐ Key Achievements | |
| | Benchmark | Score | Rank (128-dim) | | |
| |-----------|-------|----------------| | |
| | **ViDoRe v1** | **90.24** | **#5** | | |
| | MTEB v1+v2 | 81.02 | #6 | | |
| | ViDoRe v3 | 54.32 | #5 | | |
| ### Medium Category Comparison (1-3B, 128-dim) | |
| | Model | Params | Dim | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 | | |
| |-------|--------|-----|-----------|------------|-----------| | |
| | **SauerkrautLM-ColQwen3-2b-v0.1** โญ | 2.2B | 128 | **90.24** | **81.02** | **54.32** | | |
| | colqwen2-v1.0 | 2.2B | 128 | 89.23 | 79.74 | 44.18 | | |
| | SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 | 1.7B | 128 | 88.89 | 77.94 | 48.76 | | |
| *#1 in Medium category on all three benchmarks!* | |
| ### Detailed Benchmark Results | |
| <details> | |
| <summary><b>๐ ViDoRe v1 (NDCG@5) - Click to expand</b></summary> | |
| | Task | Score | | |
| |------|-------| | |
| | ArxivQA | 91.24 | | |
| | DocVQA | 65.06 | | |
| | InfoVQA | 93.14 | | |
| | ShiftProject | 88.74 | | |
| | SyntheticDocQA-AI | 99.63 | | |
| | SyntheticDocQA-Energy | 96.91 | | |
| | SyntheticDocQA-Gov | 96.08 | | |
| | SyntheticDocQA-Health | 99.26 | | |
| | TabFQuAD | 90.32 | | |
| | TATDQA | 82.06 | | |
| | **Average** | **90.24** | | |
| </details> | |
| <details> | |
| <summary><b>๐ MTEB v1+v2 (NDCG@5) - Click to expand</b></summary> | |
| **ViDoRe v1 Tasks:** | |
| | Task | Score | | |
| |------|-------| | |
| | ArxivQA | 91.24 | | |
| | DocVQA | 65.06 | | |
| | InfoVQA | 93.14 | | |
| | ShiftProject | 88.74 | | |
| | SyntheticDocQA-AI | 99.63 | | |
| | SyntheticDocQA-Energy | 96.91 | | |
| | SyntheticDocQA-Gov | 96.08 | | |
| | SyntheticDocQA-Health | 99.26 | | |
| | TabFQuAD | 90.32 | | |
| | TATDQA | 82.06 | | |
| **ViDoRe v2 Tasks (Multilingual):** | |
| | Task | Score | | |
| |------|-------| | |
| | ViDoRe-v2-2BioMed | 58.62 | | |
| | ViDoRe-v2-2Econ | 54.64 | | |
| | ViDoRe-v2-2ESG-HL | 68.13 | | |
| | ViDoRe-v2-2ESG | 50.40 | | |
| | **Combined Average** | **81.02** | | |
| </details> | |
| <details> | |
| <summary><b>๐ ViDoRe v3 (NDCG@10) - Click to expand</b></summary> | |
| | Task | Score | | |
| |------|-------| | |
| | ViDoRe-v3-CS | 73.70 | | |
| | ViDoRe-v3-Energy | 61.21 | | |
| | ViDoRe-v3-FinanceEn | 54.30 | | |
| | ViDoRe-v3-FinanceFr | 40.18 | | |
| | ViDoRe-v3-HR | 52.97 | | |
| | ViDoRe-v3-Industry | 44.01 | | |
| | ViDoRe-v3-Pharma | 60.64 | | |
| | ViDoRe-v3-Physics | 47.57 | | |
| | **Average** | **54.32** | | |
| </details> | |
| ### Improvement over Baseline | |
| | Metric | ColQwen3-2b | ColQwen2-v1.0 | Improvement | | |
| |--------|-------------|---------------|-------------| | |
| | ViDoRe v1 | **90.24** | 89.23 | **+1.01** | | |
| | MTEB v1+v2 | **81.02** | 79.74 | **+1.28** | | |
| | ViDoRe v3 | **54.32** | 44.18 | **+10.14** | | |
| ## ๐ Summary Tables | |
| ### 128-dim Models Comparison | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/VAGOsolutions/sauerkrautlm-colpali/main/assets/table_summary_128dim.png" alt="128-dim Models Summary" width="100%"/> | |
| </p> | |
| ### Comparison vs High-dim Models | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/VAGOsolutions/sauerkrautlm-colpali/main/assets/table_summary_highdim_comparison.png" alt="High-dim Comparison" width="100%"/> | |
| </p> | |
| ## โจ Key Features | |
| - **๐ฅ #1 in Medium Category**: Best 1-3B model among 128-dim models | |
| - **๐ +1.01 over ColQwen2**: Significant improvement over baseline | |
| - **๐พ Consumer GPU Ready**: Only ~4.4GB VRAM | |
| - **โก Compact Embeddings**: 128-dimensional | |
| - **๐ Multilingual**: 6 languages (EN, DE, FR, ES, IT, PT) | |
| ## Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | **Base Model** | [Qwen/Qwen3-VL-2B](https://huggingface.co/Qwen/Qwen3-VL-2B) | | |
| | **Parameters** | 2.2B | | |
| | **Embedding Dimension** | 128 | | |
| | **VRAM (bfloat16)** | ~4.4 GB | | |
| | **Max Context Length** | 262,144 tokens | | |
| | **License** | Apache 2.0 | | |
| ## Training | |
| ### Hardware & Configuration | |
| | Setting | Value | | |
| |---------|-------| | |
| | **GPUs** | 4x NVIDIA RTX 6000 Ada (48GB) | | |
| | **Effective Batch Size** | 256 | | |
| | **Precision** | bfloat16 | | |
| ### Datasets | |
| | Dataset | Type | Description | | |
| |---------|------|-------------| | |
| | [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) | Public | ColPali training data | | |
| | [openbmb/VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data) | Public | Visual RAG training data | | |
| | [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) | Public | Multilingual document retrieval | | |
| | VAGO Multilingual Dataset 1 | **In-house** | Proprietary multilingual document-query pairs | | |
| | VAGO Multilingual Dataset 2 | **In-house** | Proprietary multilingual document-query pairs | | |
| ## Installation & Usage | |
| > โ ๏ธ **Important**: Install our package first before loading the model: | |
| ```bash | |
| pip install git+https://github.com/VAGOsolutions/sauerkrautlm-colpali | |
| ``` | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from sauerkrautlm_colpali.models import ColQwen3, ColQwen3Processor | |
| model_name = "VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1" | |
| model = ColQwen3.from_pretrained( | |
| model_name, | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| device_map="cuda:0", | |
| ).eval() | |
| processor = ColQwen3Processor.from_pretrained(model_name) | |
| images = [Image.open("document.png")] | |
| queries = ["What is the main topic?"] | |
| batch_images = processor.process_images(images).to(model.device) | |
| batch_queries = processor.process_queries(queries).to(model.device) | |
| with torch.no_grad(): | |
| image_embeddings = model(**batch_images) | |
| query_embeddings = model(**batch_queries) | |
| scores = processor.score(query_embeddings, image_embeddings) | |
| ``` | |
| ## ๐ Additional Benchmark Visualizations | |
| ### MTEB v1+v2 Benchmark (128-dim Models) | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/VAGOsolutions/sauerkrautlm-colpali/main/assets/benchmark_128dim_v1v2.png" alt="MTEB v1+v2 Benchmark - 128-dim Models" width="100%"/> | |
| </p> | |
| ### ViDoRe v3 Benchmark (128-dim Models) | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/VAGOsolutions/sauerkrautlm-colpali/main/assets/benchmark_128dim_v3.png" alt="ViDoRe v3 Benchmark - 128-dim Models" width="100%"/> | |
| </p> | |
| ### Our Models vs High-dim Models | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/VAGOsolutions/sauerkrautlm-colpali/main/assets/benchmark_ours_vs_highdim_v1.png" alt="ViDoRe v1 - Our Models vs High-dim" width="100%"/> | |
| </p> | |
| ## Citation | |
| ```bibtex | |
| @misc{sauerkrautlm-colpali-2025, | |
| title={SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models}, | |
| author={David Golchinfar}, | |
| organization={VAGO Solutions}, | |
| year={2025}, | |
| url={https://github.com/VAGOsolutions/sauerkrautlm-colpali} | |
| } | |
| ``` | |
| ## Contact | |
| - **VAGO Solutions**: [https://vago-solutions.ai](https://vago-solutions.ai) | |
| - **GitHub**: [https://github.com/VAGOsolutions](https://github.com/VAGOsolutions) |