File size: 8,440 Bytes

---
license: apache-2.0
language:
- en
- ko
- multilingual
tags:
- vision-language
- embedding
- multimodal-embedding
- mmeb
- digital-forensics
library_name: transformers
pipeline_tag: feature-extraction
base_model:
- Qwen/Qwen3-VL-Embedding-2B
---

<p align="center">
  <img src="logo.png" alt="Eddy-VL" width="100%">
</p>

# Eddy-VL Embedding 1.9B

[Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) · License: Apache 2.0

> **Eddy-VL is a multimodal embedding model light enough to run on edge devices.** It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images and documents.

---

## The story behind Eddy

At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face.

And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: *"find me everything that looks like this."*

Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics.

That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room.

Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully **slimmed it down**, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is **Eddy-VL 1.9B**: lighter and faster, while landing within roughly 10% of the original on the standard benchmark.

---

## Why it might be a good fit

- **Runs on the edge.** Light and fast enough to deploy directly on edge devices, no datacenter required.
- **Faithful, not just smaller.** Tuned to preserve the original model's retrieval behavior rather than chase raw compression.
- **One space for everything.** Images, text, and document pages all map into a shared embedding space, so you can search across them with a single query.
- **Korean-aware.** Curated with Korean image–text data, for retrieval that isn't limited to English-only models.

---

## What it is (at a glance)

| | |
|---|---|
| **Type** | Multimodal embedding model (image · text · document → vector) |
| **Size** | 1.93B parameters · 3.85 GB checkpoint |
| **Speed** | ~1.1× faster than the base model |
| **Quality** | within ~10% of the base on [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) (in-house, identical settings) |
| **Embedding** | 2048-d, with shorter dimensions available for cheaper search |
| **Context** | up to 8192 tokens; flexible image resolution |

---

## How well it does

Eddy-VL is validated on **[MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2)** across image, video, and document retrieval. Selected per-task results:

**Image** (hit@1)

| Task | Score |
|:-----|------:|
| DocVQA | 92.4% |
| RefCOCO | 92.4% |
| RefCOCO-Matching | 91.9% |
| WebQA | 85.6% |
| GQA | 85.5% |
| TextVQA | 85.0% |
| ImageNet-R | 84.4% |
| Visual7W-Pointing | 84.0% |
| VOC2007 | 82.7% |
| EDIS | 82.0% |

**Video** (hit@1)

| Task | Score |
|:-----|------:|
| ActivityNetQA | 70.0% |
| MSVD | 66.9% |
| QVHighlight | 66.9% |
| UCF101 | 65.0% |
| HMDB51 | 60.3% |
| NExTQA | 59.8% |
| Something-Something V2 | 59.5% |

**VisDoc** (nDCG@5)

| Task | Score |
|:-----|------:|
| ViDoRe · Synthetic DocQA (AI) | 94.6% |
| ViDoRe · Synthetic DocQA (Healthcare) | 93.9% |
| ViDoRe · Synthetic DocQA (Gov. reports) | 93.8% |
| VisRAG · SlideVQA | 91.5% |
| ViDoRe · TabFQuAD | 90.0% |
| ViDoRe · Synthetic DocQA (Energy) | 89.9% |
| VisRAG · InfoVQA | 86.6% |
| ViDoSeek · doc | 82.2% |
| ViDoRe · InfoVQA | 82.4% |
| VisRAG · ChartQA | 81.0% |

**MMEB-V2 overall (mean across all datasets, in-house eval):**

| Model | Overall |
|:------|--------:|
| Qwen3-VL-Embedding-2B (public leaderboard) | 73.0 |
| Qwen3-VL-Embedding-2B (our environment) | 68.9 |
| **Eddy-VL 1.9B** | **63.2** |

The public leaderboard and our in-house pipeline differ in setup, so we compare Eddy against the teacher re-run in the same environment (68.9 → 63.2). That is roughly a **8% relative gap** while being smaller and ~1.1× faster.

---

## Installation

```bash
pip install "transformers>=5.0" safetensors torch pillow torchvision huggingface_hub
pip install decord   # video input (or `av`)
```

**Download the model** (recommended — no git-lfs required):

```python
from huggingface_hub import snapshot_download
path = snapshot_download("Urock-AI/Eddy-vl_embedding_1.9B_v1")
```

Or with git ([git-lfs](https://git-lfs.com/) required for weights):

```bash
git lfs install
git clone https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1
cd Eddy-vl_embedding_1.9B_v1
```

Inference code ships with the repo (not a pip package):

| File | Role |
|------|------|
| `vl_embedding_v1.py` | `VLEmbedder` — load weights, encode text / image / video |
| `processing_vl.py` | `VLProcessor` — multimodal tokenization & preprocessing |
| `vl_utils/` | Image & video loading / resizing (bundled) |

## How to use it

Clone the repo first, then run from inside the repo folder:

```python
import torch
from PIL import Image
from vl_embedding_v1 import VLEmbedder

instruction = "Represent this input for retrieval."

# load from local repo checkout (weights in ./model.safetensors)
embedder = VLEmbedder(".", torch_dtype=torch.bfloat16, default_instruction=instruction)

# text / image / video → 2048-d vectors (L2-normalized)
text_vec = embedder.process([{"text": "a photo of a cat"}])[0]
image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0]
video_vec = embedder.process([{"video": "clip.mp4"}])[0]

# cosine similarity = dot product
score = (text_vec @ image_vec.T).item()
```

You can also pass the Hub repo id (`"Urock-AI/Eddy-vl_embedding_1.9B_v1"`) to download weights automatically, but you still need the cloned Python files on `PYTHONPATH`.

> `trust_remote_code=True` is used internally for `processing_vl.py` (`VLProcessor`).

---

## Good to know before you rely on it

- **It finds, it doesn't decide.** Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision.
- **It's an embedder, not a reranker.** For the final ordering of results, pair it with a dedicated reranker.
- **English benchmarks, broader goals.** MMEB is English; Korean and domain-specific performance aren't separately measured in this release yet (that's where we're headed next).
- **Smaller has a cost.** A modest, expected quality dip versus the full base model comes with the smaller size.

---

## What's next

This is the first step. The next release pushes Eddy-VL toward **Korean and domain specialization**, with evaluation built around fine-grained, reasoning-heavy retrieval. More to come.

---

## Training data

Curated by Urock-AI Lab from public sources — including **MS COCO**, **ko-coco** ([kms7530/ko-coco-bal](https://huggingface.co/datasets/kms7530/ko-coco-bal)), **SUN**, **RVL-CDIP**, **CORD v2**, and **AI Hub** multimodal datasets — alongside an internally curated image set.

---

## Citation & license

Eddy-VL is derived from **[Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)**; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., *VLM2Vec*, ICLR 2025).

```bibtex
@misc{eddy_vl_embedding_19b,
  title        = {Eddy-VL Embedding 1.9B},
  author       = {Urock-AI Lab},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}}
}
```

---

**Urock-AI Lab** — Digital Forensic AI · [huggingface.co/Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/)