Feature Extraction
Transformers
Safetensors
English
Korean
multilingual
qwen3_vl
vision-language
embedding
multimodal-embedding
mmeb
digital-forensics
custom_code
Instructions to use Urock-AI/Eddy-vl_embedding_1.9B_v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Urock-AI/Eddy-vl_embedding_1.9B_v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Urock-AI/Eddy-vl_embedding_1.9B_v1", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("Urock-AI/Eddy-vl_embedding_1.9B_v1", trust_remote_code=True) model = AutoModel.from_pretrained("Urock-AI/Eddy-vl_embedding_1.9B_v1", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
File size: 8,440 Bytes
c890b56 a7b3927 97bd399 a7b3927 c890b56 5b3a3b7 c890b56 a7b3927 3b2c1f2 a7b3927 c890b56 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | ---
license: apache-2.0
language:
- en
- ko
- multilingual
tags:
- vision-language
- embedding
- multimodal-embedding
- mmeb
- digital-forensics
library_name: transformers
pipeline_tag: feature-extraction
base_model:
- Qwen/Qwen3-VL-Embedding-2B
---
<p align="center">
<img src="logo.png" alt="Eddy-VL" width="100%">
</p>
# Eddy-VL Embedding 1.9B
[Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) · License: Apache 2.0
> **Eddy-VL is a multimodal embedding model light enough to run on edge devices.** It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images and documents.
---
## The story behind Eddy
At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face.
And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: *"find me everything that looks like this."*
Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics.
That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room.
Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully **slimmed it down**, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is **Eddy-VL 1.9B**: lighter and faster, while landing within roughly 10% of the original on the standard benchmark.
---
## Why it might be a good fit
- **Runs on the edge.** Light and fast enough to deploy directly on edge devices, no datacenter required.
- **Faithful, not just smaller.** Tuned to preserve the original model's retrieval behavior rather than chase raw compression.
- **One space for everything.** Images, text, and document pages all map into a shared embedding space, so you can search across them with a single query.
- **Korean-aware.** Curated with Korean image–text data, for retrieval that isn't limited to English-only models.
---
## What it is (at a glance)
| | |
|---|---|
| **Type** | Multimodal embedding model (image · text · document → vector) |
| **Size** | 1.93B parameters · 3.85 GB checkpoint |
| **Speed** | ~1.1× faster than the base model |
| **Quality** | within ~10% of the base on [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) (in-house, identical settings) |
| **Embedding** | 2048-d, with shorter dimensions available for cheaper search |
| **Context** | up to 8192 tokens; flexible image resolution |
---
## How well it does
Eddy-VL is validated on **[MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2)** across image, video, and document retrieval. Selected per-task results:
**Image** (hit@1)
| Task | Score |
|:-----|------:|
| DocVQA | 92.4% |
| RefCOCO | 92.4% |
| RefCOCO-Matching | 91.9% |
| WebQA | 85.6% |
| GQA | 85.5% |
| TextVQA | 85.0% |
| ImageNet-R | 84.4% |
| Visual7W-Pointing | 84.0% |
| VOC2007 | 82.7% |
| EDIS | 82.0% |
**Video** (hit@1)
| Task | Score |
|:-----|------:|
| ActivityNetQA | 70.0% |
| MSVD | 66.9% |
| QVHighlight | 66.9% |
| UCF101 | 65.0% |
| HMDB51 | 60.3% |
| NExTQA | 59.8% |
| Something-Something V2 | 59.5% |
**VisDoc** (nDCG@5)
| Task | Score |
|:-----|------:|
| ViDoRe · Synthetic DocQA (AI) | 94.6% |
| ViDoRe · Synthetic DocQA (Healthcare) | 93.9% |
| ViDoRe · Synthetic DocQA (Gov. reports) | 93.8% |
| VisRAG · SlideVQA | 91.5% |
| ViDoRe · TabFQuAD | 90.0% |
| ViDoRe · Synthetic DocQA (Energy) | 89.9% |
| VisRAG · InfoVQA | 86.6% |
| ViDoSeek · doc | 82.2% |
| ViDoRe · InfoVQA | 82.4% |
| VisRAG · ChartQA | 81.0% |
**MMEB-V2 overall (mean across all datasets, in-house eval):**
| Model | Overall |
|:------|--------:|
| Qwen3-VL-Embedding-2B (public leaderboard) | 73.0 |
| Qwen3-VL-Embedding-2B (our environment) | 68.9 |
| **Eddy-VL 1.9B** | **63.2** |
The public leaderboard and our in-house pipeline differ in setup, so we compare Eddy against the teacher re-run in the same environment (68.9 → 63.2). That is roughly a **8% relative gap** while being smaller and ~1.1× faster.
---
## Installation
```bash
pip install "transformers>=5.0" safetensors torch pillow torchvision huggingface_hub
pip install decord # video input (or `av`)
```
**Download the model** (recommended — no git-lfs required):
```python
from huggingface_hub import snapshot_download
path = snapshot_download("Urock-AI/Eddy-vl_embedding_1.9B_v1")
```
Or with git ([git-lfs](https://git-lfs.com/) required for weights):
```bash
git lfs install
git clone https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1
cd Eddy-vl_embedding_1.9B_v1
```
Inference code ships with the repo (not a pip package):
| File | Role |
|------|------|
| `vl_embedding_v1.py` | `VLEmbedder` — load weights, encode text / image / video |
| `processing_vl.py` | `VLProcessor` — multimodal tokenization & preprocessing |
| `vl_utils/` | Image & video loading / resizing (bundled) |
## How to use it
Clone the repo first, then run from inside the repo folder:
```python
import torch
from PIL import Image
from vl_embedding_v1 import VLEmbedder
instruction = "Represent this input for retrieval."
# load from local repo checkout (weights in ./model.safetensors)
embedder = VLEmbedder(".", torch_dtype=torch.bfloat16, default_instruction=instruction)
# text / image / video → 2048-d vectors (L2-normalized)
text_vec = embedder.process([{"text": "a photo of a cat"}])[0]
image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0]
video_vec = embedder.process([{"video": "clip.mp4"}])[0]
# cosine similarity = dot product
score = (text_vec @ image_vec.T).item()
```
You can also pass the Hub repo id (`"Urock-AI/Eddy-vl_embedding_1.9B_v1"`) to download weights automatically, but you still need the cloned Python files on `PYTHONPATH`.
> `trust_remote_code=True` is used internally for `processing_vl.py` (`VLProcessor`).
---
## Good to know before you rely on it
- **It finds, it doesn't decide.** Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision.
- **It's an embedder, not a reranker.** For the final ordering of results, pair it with a dedicated reranker.
- **English benchmarks, broader goals.** MMEB is English; Korean and domain-specific performance aren't separately measured in this release yet (that's where we're headed next).
- **Smaller has a cost.** A modest, expected quality dip versus the full base model comes with the smaller size.
---
## What's next
This is the first step. The next release pushes Eddy-VL toward **Korean and domain specialization**, with evaluation built around fine-grained, reasoning-heavy retrieval. More to come.
---
## Training data
Curated by Urock-AI Lab from public sources — including **MS COCO**, **ko-coco** ([kms7530/ko-coco-bal](https://huggingface.co/datasets/kms7530/ko-coco-bal)), **SUN**, **RVL-CDIP**, **CORD v2**, and **AI Hub** multimodal datasets — alongside an internally curated image set.
---
## Citation & license
Eddy-VL is derived from **[Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)**; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., *VLM2Vec*, ICLR 2025).
```bibtex
@misc{eddy_vl_embedding_19b,
title = {Eddy-VL Embedding 1.9B},
author = {Urock-AI Lab},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}}
}
```
---
**Urock-AI Lab** — Digital Forensic AI · [huggingface.co/Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/)
|