--- license: apache-2.0 language: - en - ko - multilingual tags: - vision-language - embedding - multimodal-embedding - mmeb - digital-forensics library_name: transformers pipeline_tag: feature-extraction base_model: - Qwen/Qwen3-VL-Embedding-2B --- # Eddy-VL Embedding 1.9B [Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) · License: Apache 2.0 > **Eddy-VL is a multimodal embedding model light enough to run on edge devices.** It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images and documents. --- ## The story behind Eddy At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face. And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: *"find me everything that looks like this."* Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics. That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room. Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully **slimmed it down**, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is **Eddy-VL 1.9B**: lighter and faster, while landing within roughly 10% of the original on the standard benchmark. --- ## Why it might be a good fit - **Runs on the edge.** Light and fast enough to deploy directly on edge devices, no datacenter required. - **Faithful, not just smaller.** Tuned to preserve the original model's retrieval behavior rather than chase raw compression. - **One space for everything.** Images, text, and document pages all map into a shared embedding space, so you can search across them with a single query. - **Korean-aware.** Curated with Korean image–text data, for retrieval that isn't limited to English-only models. --- ## What it is (at a glance) | | | |---|---| | **Type** | Multimodal embedding model (image · text · document → vector) | | **Size** | 1.93B parameters · 3.85 GB checkpoint | | **Speed** | ~1.1× faster than the base model | | **Quality** | within ~10% of the base on [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) (in-house, identical settings) | | **Embedding** | 2048-d, with shorter dimensions available for cheaper search | | **Context** | up to 8192 tokens; flexible image resolution | --- ## Installation ```bash pip install "transformers>=5.0" safetensors torch pillow torchvision pip install decord # video input (or `av`) ``` Clone this repo (or add it to `PYTHONPATH`) — inference code ships with the model: | File | Role | |------|------| | `vl_embedding_v1.py` | `VLEmbedder` — load weights, encode text / image / video | | `processing_vl.py` | `VLProcessor` — multimodal tokenization & preprocessing | | `vl_utils/` | Image & video loading / resizing (bundled) | ## How to use it ```python import torch from PIL import Image from vl_embedding_v1 import VLEmbedder model_id = "Urock-AI/Eddy-vl_embedding_1.9B_v1" instruction = "Represent this input for retrieval." embedder = VLEmbedder(model_id, torch_dtype=torch.bfloat16, default_instruction=instruction) # text / image / video → 2048-d vectors (L2-normalized) text_vec = embedder.process([{"text": "a photo of a cat"}])[0] image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0] video_vec = embedder.process([{"video": "clip.mp4"}])[0] # cosine similarity = dot product score = (text_vec @ image_vec.T).item() ``` > `trust_remote_code=True` is required for `processing_vl.py` (`VLProcessor`) in this repo. --- ## How well it does Eddy-VL is validated on **[MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2)** across image, video, and document retrieval. Selected per-task results: **Image** (hit@1) | Task | Score | |:-----|------:| | DocVQA | 92.4% | | RefCOCO | 92.4% | | RefCOCO-Matching | 91.9% | | WebQA | 85.6% | | GQA | 85.5% | | TextVQA | 85.0% | | ImageNet-R | 84.4% | | Visual7W-Pointing | 84.0% | | VOC2007 | 82.7% | | EDIS | 82.0% | **Video** (hit@1) | Task | Score | |:-----|------:| | ActivityNetQA | 70.0% | | MSVD | 66.9% | | QVHighlight | 66.9% | | UCF101 | 65.0% | | HMDB51 | 60.3% | | NExTQA | 59.8% | | Something-Something V2 | 59.5% | **VisDoc** (nDCG@5) | Task | Score | |:-----|------:| | ViDoRe · Synthetic DocQA (AI) | 94.6% | | ViDoRe · Synthetic DocQA (Healthcare) | 93.9% | | ViDoRe · Synthetic DocQA (Gov. reports) | 93.8% | | VisRAG · SlideVQA | 91.5% | | ViDoRe · TabFQuAD | 90.0% | | ViDoRe · Synthetic DocQA (Energy) | 89.9% | | VisRAG · InfoVQA | 86.6% | | ViDoSeek · doc | 82.2% | | ViDoRe · InfoVQA | 82.4% | | VisRAG · ChartQA | 81.0% | **On reading the numbers:** the public MMEB leaderboard lists the base model around 73.0, but re-running it ourselves in the same environment gives 68.9. Measured that way — apples to apples — Eddy-VL lands within about 10% of the base model overall, while being smaller and faster. We report the in-house baseline (68.9) so the comparison is fair rather than flattering. --- ## Good to know before you rely on it - **It finds, it doesn't decide.** Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision. - **It's an embedder, not a reranker.** For the final ordering of results, pair it with a dedicated reranker. - **English benchmarks, broader goals.** MMEB is English; Korean and domain-specific performance aren't separately measured in this release yet (that's where we're headed next). - **Smaller has a cost.** A modest, expected quality dip versus the full base model comes with the smaller size. --- ## What's next This is the first step. The next release pushes Eddy-VL toward **Korean and domain specialization**, with evaluation built around fine-grained, reasoning-heavy retrieval. More to come. --- ## Training data Curated by Urock-AI Lab from public sources — including **MS COCO**, **ko-coco** ([kms7530/ko-coco-bal](https://huggingface.co/datasets/kms7530/ko-coco-bal)), **SUN**, **RVL-CDIP**, **CORD v2**, and **AI Hub** multimodal datasets — alongside an internally curated image set. --- ## Citation & license Eddy-VL is derived from **[Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)**; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., *VLM2Vec*, ICLR 2025). ```bibtex @misc{eddy_vl_embedding_19b, title = {Eddy-VL Embedding 1.9B}, author = {Urock-AI Lab}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}} } ``` --- **Urock-AI Lab** — Digital Forensic AI · [huggingface.co/Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/)