Eddy-VL Embedding 1.9B

Urock-AI · urock.kr · License: Apache 2.0

Eddy-VL is a multimodal embedding model light enough to run on edge devices. It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images, video, and documents.

The story behind Eddy

At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face.

And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: "find me everything that looks like this."

Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics.

That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room.

Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully slimmed it down, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is Eddy-VL 1.9B: lighter and faster, while landing within roughly 10% of the original on the standard benchmark.

Why it might be a good fit

Runs on the edge. Light and fast enough to deploy directly on edge devices, no datacenter required.
Faithful, not just smaller. Tuned to preserve the original model's retrieval behavior rather than chase raw compression.
One space for everything. Images, video, text, and document pages all map into a shared embedding space, so you can search across them with a single query.
Korean-aware. Curated with Korean image–text data, for retrieval that isn't limited to English-only models.

What it is (at a glance)


Type	Multimodal embedding model (image · video · text · document → vector)
Size	1.93B parameters · 3.85 GB checkpoint
Speed	~1.1× faster than the base model
Quality	within ~10% of the base on MMEB-V2 (in-house, identical settings)
Embedding	2048-d, with shorter dimensions available for cheaper search
Context	up to 8192 tokens; flexible image resolution

How well it does

Eddy-VL is validated on MMEB-V2 across image, video, and document retrieval. Selected per-task results:

Image (hit@1)

Task	Score
DocVQA	92.4%
RefCOCO	92.4%
RefCOCO-Matching	91.9%
WebQA	85.6%
GQA	85.5%
TextVQA	85.0%
ImageNet-R	84.4%
Visual7W-Pointing	84.0%
VOC2007	82.7%
EDIS	82.0%

Video (hit@1)

Task	Score
ActivityNetQA	70.0%
MSVD	66.9%
QVHighlight	66.9%
UCF101	65.0%
HMDB51	60.3%
NExTQA	59.8%
Something-Something V2	59.5%

VisDoc (nDCG@5)

Task	Score
ViDoRe · Synthetic DocQA (AI)	94.6%
ViDoRe · Synthetic DocQA (Healthcare)	93.9%
ViDoRe · Synthetic DocQA (Gov. reports)	93.8%
VisRAG · SlideVQA	91.5%
ViDoRe · TabFQuAD	90.0%
ViDoRe · Synthetic DocQA (Energy)	89.9%
VisRAG · InfoVQA	86.6%
ViDoSeek · doc	82.2%
ViDoRe · InfoVQA	82.4%
VisRAG · ChartQA	81.0%

MMEB-V2 overall (mean across all datasets, in-house eval):

Model	Overall
Qwen3-VL-Embedding-2B (public leaderboard)	73.0
Qwen3-VL-Embedding-2B (our environment)	68.9
Eddy-VL 1.9B	63.2

The public leaderboard and our in-house pipeline differ in setup, so we compare Eddy against the teacher re-run in the same environment (68.9 → 63.2). That is roughly a 8% relative gap while being smaller and ~1.1× faster.

Fine-grained understanding (retained from the base model)

Capability retained vs. base 2B model

Compression usually costs compositional ability — telling "the door left of the shirt" from "the shirt left of the door." We checked this on four benchmarks that probe attribute / relation / binding and reasoning-driven retrieval, comparing Eddy against the model it was distilled from (in-house eval):

Benchmark	What it probes	Eddy-VL 1.9B	Base 2B
SugarCrepe	attribute / object / relation discrimination	86.1	86.4
ARO	attribute & relation matching	59.5	60.4
MR²-Bench (nDCG@10)	reasoning-intensive retrieval	24.5	24.7
Winoground (group)	text–image binding (hardest)	6.8	8.5

On attribute/relation and reasoning retrieval, Eddy stays within ~1 point of the base model — so the compression preserves fine-grained understanding, not just coarse retrieval. Winoground's group score (a deliberately hard 2×2 binding test where most models land in the single digits) shows the largest relative drop.

Installation

pip install "transformers>=5.0" safetensors torch pillow torchvision huggingface_hub
pip install decord   # video input (or `av`)

Download the model (recommended — no git-lfs required):

from huggingface_hub import snapshot_download
path = snapshot_download("Urock-AI/Eddy-vl_embedding_1.9B_v1")

Or with git (git-lfs required for weights):

git lfs install
git clone https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1
cd Eddy-vl_embedding_1.9B_v1

Inference code ships with the repo (not a pip package):

File	Role
`vl_embedding_v1.py`	`VLEmbedder` — load weights, encode text / image / video
`processing_vl.py`	`VLProcessor` — multimodal tokenization & preprocessing
`vl_utils/`	Image & video loading / resizing (bundled)

How to use it

Clone the repo first, then run from inside the repo folder:

import torch
from PIL import Image
from vl_embedding_v1 import VLEmbedder

instruction = "Represent this input for retrieval."

# load from local repo checkout (weights in ./model.safetensors)
embedder = VLEmbedder(".", torch_dtype=torch.bfloat16, default_instruction=instruction)

# text / image / video → 2048-d vectors (L2-normalized)
text_vec = embedder.process([{"text": "a photo of a cat"}])[0]
image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0]
video_vec = embedder.process([{"video": "clip.mp4"}])[0]

# cosine similarity = dot product
score = (text_vec @ image_vec.T).item()

You can also pass the Hub repo id ("Urock-AI/Eddy-vl_embedding_1.9B_v1") to download weights automatically, but you still need the cloned Python files on PYTHONPATH.

trust_remote_code=True is used internally for processing_vl.py (VLProcessor).

Good to know before you rely on it

It finds, it doesn't decide. Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision.
It's an embedder, not a reranker. For the final ordering of results, pair it with a dedicated reranker.
English benchmarks, broader goals. MMEB is English; multilingual and domain-specific performance aren't separately measured in this release yet (that's where we're headed next).
Smaller has a cost. A modest, expected quality dip versus the full base model comes with the smaller size.

What's next

This is the first step. Where we're taking Eddy-VL from here:

Broader language coverage — moving beyond English-centric evaluation toward strong retrieval across many languages.
Ontology-aware retrieval — richer semantic search grounded in structured object / attribute / relation understanding.
Domain specialization — tuning for the kinds of fine-grained, reasoning-heavy search real-world workflows demand.
On-device efficiency — pushing further on size and speed so the model runs comfortably on edge hardware.

More to come.

Training data

Curated by Urock-AI Lab from public sources — including MS COCO, ko-coco (kms7530/ko-coco-bal), SUN, RVL-CDIP, CORD v2, and AI Hub multimodal datasets — alongside an internally curated image set.

Citation & license

Eddy-VL is derived from Qwen3-VL-Embedding-2B; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., VLM2Vec, ICLR 2025).

@misc{eddy_vl_embedding_19b,
  title        = {Eddy-VL Embedding 1.9B},
  author       = {Urock-AI Lab},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}}
}

Urock-AI Lab — Digital Forensic AI · huggingface.co/Urock-AI · urock.kr

Downloads last month: 94

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Urock-AI/Eddy-vl_embedding_1.9B_v1

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

Qwen/Qwen3-VL-Embedding-2B

Finetuned

(9)

this model