--- license: apache-2.0 language: - en - ko - multilingual tags: - vision-language - embedding - multimodal-embedding - mmeb - digital-forensics library_name: transformers pipeline_tag: feature-extraction base_model: - Qwen/Qwen3-VL-Embedding-2B ---

Eddy-VL

# Eddy-VL Embedding 1.9B [Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) · License: Apache 2.0 > **Eddy-VL is a multimodal embedding model light enough to run on edge devices.** It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images and documents. --- ## The story behind Eddy At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face. And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: *"find me everything that looks like this."* Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics. That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room. Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully **slimmed it down**, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is **Eddy-VL 1.9B**: lighter and faster, while landing within roughly 10% of the original on the standard benchmark. --- ## Why it might be a good fit - **Runs on the edge.** Light and fast enough to deploy directly on edge devices, no datacenter required. - **Faithful, not just smaller.** Tuned to preserve the original model's retrieval behavior rather than chase raw compression. - **One space for everything.** Images, text, and document pages all map into a shared embedding space, so you can search across them with a single query. - **Korean-aware.** Curated with Korean image–text data, for retrieval that isn't limited to English-only models. --- ## What it is (at a glance) | | | |---|---| | **Type** | Multimodal embedding model (image · text · document → vector) | | **Size** | 1.93B parameters · 3.85 GB checkpoint | | **Speed** | ~1.1× faster than the base model | | **Quality** | within ~10% of the base on [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) (in-house, identical settings) | | **Embedding** | 2048-d, with shorter dimensions available for cheaper search | | **Context** | up to 8192 tokens; flexible image resolution | --- ## How well it does Eddy-VL is validated on **[MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2)** across image, video, and document retrieval. Selected per-task results: **Image** (hit@1) | Task | Score | |:-----|------:| | DocVQA | 92.4% | | RefCOCO | 92.4% | | RefCOCO-Matching | 91.9% | | WebQA | 85.6% | | GQA | 85.5% | | TextVQA | 85.0% | | ImageNet-R | 84.4% | | Visual7W-Pointing | 84.0% | | VOC2007 | 82.7% | | EDIS | 82.0% | **Video** (hit@1) | Task | Score | |:-----|------:| | ActivityNetQA | 70.0% | | MSVD | 66.9% | | QVHighlight | 66.9% | | UCF101 | 65.0% | | HMDB51 | 60.3% | | NExTQA | 59.8% | | Something-Something V2 | 59.5% | **VisDoc** (nDCG@5) | Task | Score | |:-----|------:| | ViDoRe · Synthetic DocQA (AI) | 94.6% | | ViDoRe · Synthetic DocQA (Healthcare) | 93.9% | | ViDoRe · Synthetic DocQA (Gov. reports) | 93.8% | | VisRAG · SlideVQA | 91.5% | | ViDoRe · TabFQuAD | 90.0% | | ViDoRe · Synthetic DocQA (Energy) | 89.9% | | VisRAG · InfoVQA | 86.6% | | ViDoSeek · doc | 82.2% | | ViDoRe · InfoVQA | 82.4% | | VisRAG · ChartQA | 81.0% | **MMEB-V2 overall (mean across all datasets, in-house eval):** | Model | Overall | |:------|--------:| | Qwen3-VL-Embedding-2B (public leaderboard) | 73.0 | | Qwen3-VL-Embedding-2B (our environment) | 68.9 | | **Eddy-VL 1.9B** | **63.2** | The public leaderboard and our in-house pipeline differ in setup, so we compare Eddy against the teacher re-run in the same environment (68.9 → 63.2). That is roughly a **8% relative gap** while being smaller and ~1.1× faster. --- ## Installation ```bash pip install "transformers>=5.0" safetensors torch pillow torchvision huggingface_hub pip install decord # video input (or `av`) ``` **Download the model** (recommended — no git-lfs required): ```python from huggingface_hub import snapshot_download path = snapshot_download("Urock-AI/Eddy-vl_embedding_1.9B_v1") ``` Or with git ([git-lfs](https://git-lfs.com/) required for weights): ```bash git lfs install git clone https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1 cd Eddy-vl_embedding_1.9B_v1 ``` Inference code ships with the repo (not a pip package): | File | Role | |------|------| | `vl_embedding_v1.py` | `VLEmbedder` — load weights, encode text / image / video | | `processing_vl.py` | `VLProcessor` — multimodal tokenization & preprocessing | | `vl_utils/` | Image & video loading / resizing (bundled) | ## How to use it Clone the repo first, then run from inside the repo folder: ```python import torch from PIL import Image from vl_embedding_v1 import VLEmbedder instruction = "Represent this input for retrieval." # load from local repo checkout (weights in ./model.safetensors) embedder = VLEmbedder(".", torch_dtype=torch.bfloat16, default_instruction=instruction) # text / image / video → 2048-d vectors (L2-normalized) text_vec = embedder.process([{"text": "a photo of a cat"}])[0] image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0] video_vec = embedder.process([{"video": "clip.mp4"}])[0] # cosine similarity = dot product score = (text_vec @ image_vec.T).item() ``` You can also pass the Hub repo id (`"Urock-AI/Eddy-vl_embedding_1.9B_v1"`) to download weights automatically, but you still need the cloned Python files on `PYTHONPATH`. > `trust_remote_code=True` is used internally for `processing_vl.py` (`VLProcessor`). --- ## Good to know before you rely on it - **It finds, it doesn't decide.** Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision. - **It's an embedder, not a reranker.** For the final ordering of results, pair it with a dedicated reranker. - **English benchmarks, broader goals.** MMEB is English; Korean and domain-specific performance aren't separately measured in this release yet (that's where we're headed next). - **Smaller has a cost.** A modest, expected quality dip versus the full base model comes with the smaller size. --- ## What's next This is the first step. The next release pushes Eddy-VL toward **Korean and domain specialization**, with evaluation built around fine-grained, reasoning-heavy retrieval. More to come. --- ## Training data Curated by Urock-AI Lab from public sources — including **MS COCO**, **ko-coco** ([kms7530/ko-coco-bal](https://huggingface.co/datasets/kms7530/ko-coco-bal)), **SUN**, **RVL-CDIP**, **CORD v2**, and **AI Hub** multimodal datasets — alongside an internally curated image set. --- ## Citation & license Eddy-VL is derived from **[Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)**; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., *VLM2Vec*, ICLR 2025). ```bibtex @misc{eddy_vl_embedding_19b, title = {Eddy-VL Embedding 1.9B}, author = {Urock-AI Lab}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}} } ``` --- **Urock-AI Lab** — Digital Forensic AI · [huggingface.co/Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/)