---
license: apache-2.0
language:
- en
- ko
- multilingual
tags:
- vision-language
- embedding
- multimodal-embedding
- mmeb
- digital-forensics
library_name: transformers
pipeline_tag: feature-extraction
base_model:
- Qwen/Qwen3-VL-Embedding-2B
---

# Eddy-VL Embedding 1.9B

[Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) · License: Apache 2.0

> **Eddy-VL is a multimodal embedding model light enough to run on edge devices.** It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images and documents.

---

## The story behind Eddy

At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face.

And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: *"find me everything that looks like this."*

Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics.

That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room.

Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully **slimmed it down**, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is **Eddy-VL 1.9B**: lighter and faster, while landing within roughly 10% of the original on the standard benchmark.

---

## Why it might be a good fit

- **Runs on the edge.** Light and fast enough to deploy directly on edge devices, no datacenter required.
- **Faithful, not just smaller.** Tuned to preserve the original model's retrieval behavior rather than chase raw compression.
- **One space for everything.** Images, text, and document pages all map into a shared embedding space, so you can search across them with a single query.
- **Korean-aware.** Curated with Korean image–text data, for retrieval that isn't limited to English-only models.

---

## What it is (at a glance)

| | |
|---|---|
| **Type** | Multimodal embedding model (image · text · document → vector) |
| **Size** | 1.93B parameters · 3.85 GB checkpoint |
| **Speed** | ~1.1× faster than the base model |
| **Quality** | within ~10% of the base on [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) (in-house, identical settings) |
| **Embedding** | 2048-d, with shorter dimensions available for cheaper search |
| **Context** | up to 8192 tokens; flexible image resolution |

---

## Installation

```bash
pip install "transformers>=5.0" safetensors torch pillow torchvision
pip install decord   # video input (or `av`)
```

Clone this repo (or add it to `PYTHONPATH`) — inference code ships with the model:

| File | Role |
|------|------|
| `vl_embedding_v1.py` | `VLEmbedder` — load weights, encode text / image / video |
| `processing_vl.py` | `VLProcessor` — multimodal tokenization & preprocessing |
| `vl_utils/` | Image & video loading / resizing (bundled) |

## How to use it

```python
import torch
from PIL import Image
from vl_embedding_v1 import VLEmbedder

model_id = "Urock-AI/Eddy-vl_embedding_1.9B_v1"
instruction = "Represent this input for retrieval."

embedder = VLEmbedder(model_id, torch_dtype=torch.bfloat16, default_instruction=instruction)

# text / image / video → 2048-d vectors (L2-normalized)
text_vec = embedder.process([{"text": "a photo of a cat"}])[0]
image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0]
video_vec = embedder.process([{"video": "clip.mp4"}])[0]

# cosine similarity = dot product
score = (text_vec @ image_vec.T).item()
```

> `trust_remote_code=True` is required for `processing_vl.py` (`VLProcessor`) in this repo.

---

## How well it does

Eddy-VL is validated on **[MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2)** across image, video, and document retrieval. Selected per-task results:

**Image** (hit@1)

| Task | Score |
|:-----|------:|
| DocVQA | 92.4% |
| RefCOCO | 92.4% |
| RefCOCO-Matching | 91.9% |
| WebQA | 85.6% |
| GQA | 85.5% |
| TextVQA | 85.0% |
| ImageNet-R | 84.4% |
| Visual7W-Pointing | 84.0% |
| VOC2007 | 82.7% |
| EDIS | 82.0% |

**Video** (hit@1)

| Task | Score |
|:-----|------:|
| ActivityNetQA | 70.0% |
| MSVD | 66.9% |
| QVHighlight | 66.9% |
| UCF101 | 65.0% |
| HMDB51 | 60.3% |
| NExTQA | 59.8% |
| Something-Something V2 | 59.5% |

**VisDoc** (nDCG@5)

| Task | Score |
|:-----|------:|
| ViDoRe · Synthetic DocQA (AI) | 94.6% |
| ViDoRe · Synthetic DocQA (Healthcare) | 93.9% |
| ViDoRe · Synthetic DocQA (Gov. reports) | 93.8% |
| VisRAG · SlideVQA | 91.5% |
| ViDoRe · TabFQuAD | 90.0% |
| ViDoRe · Synthetic DocQA (Energy) | 89.9% |
| VisRAG · InfoVQA | 86.6% |
| ViDoSeek · doc | 82.2% |
| ViDoRe · InfoVQA | 82.4% |
| VisRAG · ChartQA | 81.0% |

**On reading the numbers:** the public MMEB leaderboard lists the base model around 73.0, but re-running it ourselves in the same environment gives 68.9. Measured that way — apples to apples — Eddy-VL lands within about 10% of the base model overall, while being smaller and faster. We report the in-house baseline (68.9) so the comparison is fair rather than flattering.

---

## Good to know before you rely on it

- **It finds, it doesn't decide.** Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision.
- **It's an embedder, not a reranker.** For the final ordering of results, pair it with a dedicated reranker.
- **English benchmarks, broader goals.** MMEB is English; Korean and domain-specific performance aren't separately measured in this release yet (that's where we're headed next).
- **Smaller has a cost.** A modest, expected quality dip versus the full base model comes with the smaller size.

---

## What's next

This is the first step. The next release pushes Eddy-VL toward **Korean and domain specialization**, with evaluation built around fine-grained, reasoning-heavy retrieval. More to come.

---

## Training data

Curated by Urock-AI Lab from public sources — including **MS COCO**, **ko-coco** ([kms7530/ko-coco-bal](https://huggingface.co/datasets/kms7530/ko-coco-bal)), **SUN**, **RVL-CDIP**, **CORD v2**, and **AI Hub** multimodal datasets — alongside an internally curated image set.

---

## Citation & license

Eddy-VL is derived from **[Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)**; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., *VLM2Vec*, ICLR 2025).

```bibtex
@misc{eddy_vl_embedding_19b,
  title        = {Eddy-VL Embedding 1.9B},
  author       = {Urock-AI Lab},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}}
}
```

---

**Urock-AI Lab** — Digital Forensic AI · [huggingface.co/Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/)