Feature Extraction
Transformers
Safetensors
English
Korean
multilingual
qwen3_vl
vision-language
embedding
multimodal-embedding
mmeb
digital-forensics
custom_code
Instructions to use Urock-AI/Eddy-vl_embedding_1.9B_v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Urock-AI/Eddy-vl_embedding_1.9B_v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Urock-AI/Eddy-vl_embedding_1.9B_v1", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("Urock-AI/Eddy-vl_embedding_1.9B_v1", trust_remote_code=True) model = AutoModel.from_pretrained("Urock-AI/Eddy-vl_embedding_1.9B_v1", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - ko | |
| - multilingual | |
| tags: | |
| - vision-language | |
| - embedding | |
| - multimodal-embedding | |
| - mmeb | |
| - digital-forensics | |
| library_name: transformers | |
| pipeline_tag: feature-extraction | |
| base_model: | |
| - Qwen/Qwen3-VL-Embedding-2B | |
| # Eddy-VL Embedding 1.9B | |
| [Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) · License: Apache 2.0 | |
| > **Eddy-VL is a multimodal embedding model light enough to run on edge devices.** It keeps the retrieval quality of a 2B-class vision-language embedder in a lighter, faster package — built at Urock-AI Lab for real-world multimodal search over images and documents. | |
| --- | |
| ## The story behind Eddy | |
| At Urock-AI, our work is focused on one thing: solving the real problems that forensic investigators face. | |
| And those problems start with the environment. Investigative work is closed by nature — often offline, frequently air-gapped, running on networks deliberately isolated from the outside world. An investigator rarely has a clean dataset and a datacenter. They have a seized drive, thousands of images and scanned documents, and a question that sounds simple: *"find me everything that looks like this."* | |
| Most strong multimodal embedders can answer that question well, but they're heavy: slow to run, expensive to serve, and impossible to use where the data can never leave the room. That's exactly the case in forensics. | |
| That's why, at Urock-AI, we build dedicated forensic hardware — including offline, edge-deployable devices meant to run right next to the evidence. Eddy-VL is designed for exactly that setting: we wanted the retrieval quality of a 2B-class model in something light enough to live on an edge device, with nothing leaving the room. | |
| Rather than train a new model from scratch, we started from one of the best open multimodal embedders available and carefully **slimmed it down**, teaching the smaller model to stay faithful to the original's sense of what's similar to what. The result is **Eddy-VL 1.9B**: lighter and faster, while landing within roughly 10% of the original on the standard benchmark. | |
| --- | |
| ## Why it might be a good fit | |
| - **Runs on the edge.** Light and fast enough to deploy directly on edge devices, no datacenter required. | |
| - **Faithful, not just smaller.** Tuned to preserve the original model's retrieval behavior rather than chase raw compression. | |
| - **One space for everything.** Images, text, and document pages all map into a shared embedding space, so you can search across them with a single query. | |
| - **Korean-aware.** Curated with Korean image–text data, for retrieval that isn't limited to English-only models. | |
| --- | |
| ## What it is (at a glance) | |
| | | | | |
| |---|---| | |
| | **Type** | Multimodal embedding model (image · text · document → vector) | | |
| | **Size** | 1.93B parameters · 3.85 GB checkpoint | | |
| | **Speed** | ~1.1× faster than the base model | | |
| | **Quality** | within ~10% of the base on [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) (in-house, identical settings) | | |
| | **Embedding** | 2048-d, with shorter dimensions available for cheaper search | | |
| | **Context** | up to 8192 tokens; flexible image resolution | | |
| --- | |
| ## Installation | |
| ```bash | |
| git clone https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1 | |
| cd Eddy-vl_embedding_1.9B_v1 | |
| pip install "transformers>=5.0" safetensors torch pillow torchvision | |
| pip install decord # video input (or `av`) | |
| ``` | |
| Inference code ships with the repo (not a pip package): | |
| | File | Role | | |
| |------|------| | |
| | `vl_embedding_v1.py` | `VLEmbedder` — load weights, encode text / image / video | | |
| | `processing_vl.py` | `VLProcessor` — multimodal tokenization & preprocessing | | |
| | `vl_utils/` | Image & video loading / resizing (bundled) | | |
| ## How to use it | |
| Clone the repo first, then run from inside the repo folder: | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from vl_embedding_v1 import VLEmbedder | |
| instruction = "Represent this input for retrieval." | |
| # load from local repo checkout (weights in ./model.safetensors) | |
| embedder = VLEmbedder(".", torch_dtype=torch.bfloat16, default_instruction=instruction) | |
| # text / image / video → 2048-d vectors (L2-normalized) | |
| text_vec = embedder.process([{"text": "a photo of a cat"}])[0] | |
| image_vec = embedder.process([{"image": Image.open("photo.jpg")}])[0] | |
| video_vec = embedder.process([{"video": "clip.mp4"}])[0] | |
| # cosine similarity = dot product | |
| score = (text_vec @ image_vec.T).item() | |
| ``` | |
| You can also pass the Hub repo id (`"Urock-AI/Eddy-vl_embedding_1.9B_v1"`) to download weights automatically, but you still need the cloned Python files on `PYTHONPATH`. | |
| > `trust_remote_code=True` is used internally for `processing_vl.py` (`VLProcessor`). | |
| --- | |
| ## How well it does | |
| Eddy-VL is validated on **[MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2)** across image, video, and document retrieval. Selected per-task results: | |
| **Image** (hit@1) | |
| | Task | Score | | |
| |:-----|------:| | |
| | DocVQA | 92.4% | | |
| | RefCOCO | 92.4% | | |
| | RefCOCO-Matching | 91.9% | | |
| | WebQA | 85.6% | | |
| | GQA | 85.5% | | |
| | TextVQA | 85.0% | | |
| | ImageNet-R | 84.4% | | |
| | Visual7W-Pointing | 84.0% | | |
| | VOC2007 | 82.7% | | |
| | EDIS | 82.0% | | |
| **Video** (hit@1) | |
| | Task | Score | | |
| |:-----|------:| | |
| | ActivityNetQA | 70.0% | | |
| | MSVD | 66.9% | | |
| | QVHighlight | 66.9% | | |
| | UCF101 | 65.0% | | |
| | HMDB51 | 60.3% | | |
| | NExTQA | 59.8% | | |
| | Something-Something V2 | 59.5% | | |
| **VisDoc** (nDCG@5) | |
| | Task | Score | | |
| |:-----|------:| | |
| | ViDoRe · Synthetic DocQA (AI) | 94.6% | | |
| | ViDoRe · Synthetic DocQA (Healthcare) | 93.9% | | |
| | ViDoRe · Synthetic DocQA (Gov. reports) | 93.8% | | |
| | VisRAG · SlideVQA | 91.5% | | |
| | ViDoRe · TabFQuAD | 90.0% | | |
| | ViDoRe · Synthetic DocQA (Energy) | 89.9% | | |
| | VisRAG · InfoVQA | 86.6% | | |
| | ViDoSeek · doc | 82.2% | | |
| | ViDoRe · InfoVQA | 82.4% | | |
| | VisRAG · ChartQA | 81.0% | | |
| **MMEB-V2 overall (mean across all datasets, in-house eval):** | |
| | Model | Overall | | |
| |:------|--------:| | |
| | Qwen3-VL-Embedding-2B (public leaderboard) | 73.0 | | |
| | Qwen3-VL-Embedding-2B (our environment) | 68.9 | | |
| | **Eddy-VL 1.9B** | **63.2** | | |
| The public leaderboard and our in-house pipeline differ in setup, so we compare Eddy against the teacher re-run in the same environment (68.9 → 63.2). That is roughly a **8% relative gap** while being smaller and ~1.1× faster. | |
| --- | |
| ## Good to know before you rely on it | |
| - **It finds, it doesn't decide.** Eddy-VL surfaces candidates for a human to review; it shouldn't be the sole basis for any high-stakes decision. | |
| - **It's an embedder, not a reranker.** For the final ordering of results, pair it with a dedicated reranker. | |
| - **English benchmarks, broader goals.** MMEB is English; Korean and domain-specific performance aren't separately measured in this release yet (that's where we're headed next). | |
| - **Smaller has a cost.** A modest, expected quality dip versus the full base model comes with the smaller size. | |
| --- | |
| ## What's next | |
| This is the first step. The next release pushes Eddy-VL toward **Korean and domain specialization**, with evaluation built around fine-grained, reasoning-heavy retrieval. More to come. | |
| --- | |
| ## Training data | |
| Curated by Urock-AI Lab from public sources — including **MS COCO**, **ko-coco** ([kms7530/ko-coco-bal](https://huggingface.co/datasets/kms7530/ko-coco-bal)), **SUN**, **RVL-CDIP**, **CORD v2**, and **AI Hub** multimodal datasets — alongside an internally curated image set. | |
| --- | |
| ## Citation & license | |
| Eddy-VL is derived from **[Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)**; please honor its license terms (Apache 2.0) and cite Urock-AI when you use this model. Evaluation uses the MMEB benchmark (Jiang et al., *VLM2Vec*, ICLR 2025). | |
| ```bibtex | |
| @misc{eddy_vl_embedding_19b, | |
| title = {Eddy-VL Embedding 1.9B}, | |
| author = {Urock-AI Lab}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/Urock-AI/Eddy-vl_embedding_1.9B_v1}} | |
| } | |
| ``` | |
| --- | |
| **Urock-AI Lab** — Digital Forensic AI · [huggingface.co/Urock-AI](https://huggingface.co/Urock-AI) · [urock.kr](https://urock.kr/) | |