Instructions to use onnx-community/jina-embeddings-v5-omni-nano-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use onnx-community/jina-embeddings-v5-omni-nano-ONNX with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('feature-extraction', 'onnx-community/jina-embeddings-v5-omni-nano-ONNX');
jina-embeddings-v5-omni-nano ยท ONNX
ONNX exports of jinaai/jina-embeddings-v5-omni-nano for in-browser inference via transformers.js v4 with WebGPU.
jina-embeddings-v5-omni-nano is a ~1.04 B-parameter multimodal embedding model that maps text, images, audio, and video into a single shared 768-dimensional L2-normalized space, enabling cross-modal retrieval without reindexing.
Architecture
The base model is a LLaVA-style composition of three frozen encoders plus small trainable projectors, with per-task LoRA adapters. This repo's exports have the retrieval task adapter merged into the static weights.
| Tower | Backbone | Role |
|---|---|---|
| Text | EuroBERT-210m (loaded as a bidirectional Llama) | Text encoder |
| Vision | Qwen3-VL vision tower + spatial merger | Image and video frames |
| Audio | Whisper-large-v3 encoder + Qwen2.5-Omni audio adapter | Audio (16 kHz mono) |
Each tower is bundled with its retrieval projector AND a copy of the language model (which performs the final cross-modal fusion + last-token pooling), so the three ONNX graphs are independently loadable.
Files
Three ONNX graphs, three quantization variants each. Every variant ships as a small .onnx schema + a large .onnx.data or .onnx_data external-weight sidecar (HF Hub-friendly layout, no 2 GB protobuf limits).
| Modality | fp32 | fp16 | q4f16 | Verified parity vs fp32 (fp16) |
|---|---|---|---|---|
| Text | 849 MB | 424 MB | 263 MB | cos = 1.000000 |
| Vision | 1247 MB | 622 MB | 460 MB | cos = 0.999998 |
| Audio | 3400 MB | 1700 MB | 1465 MB | (fp16 doesn't load on CPU EP; verify on WebGPU) |
For desktop WebGPU, use q4f16 โ it's the smallest and runs natively on shader-int4 hardware. Use fp16 if you need higher numerical fidelity or your GPU lacks int4 paths.
Every graph outputs a single tensor sentence_embedding of shape [batch, 768], already L2-normalized โ cosine similarity reduces to a dot product.
Input constraints
Text (text_model*.onnx): inputs input_ids and attention_mask, both [batch, seq] with seq fully dynamic. Apply the asymmetric retrieval prefix convention before tokenizing: Query: โฆ for queries, Document: โฆ for corpus items.
Vision (vision_model*.onnx): the graph is traced at a fixed image-patch layout. image_grid_thw is folded as a constant (it drives torch.linspace inside Qwen3-VL's fast_pos_embed_interpolate, which dynamo cannot symbolicate). Resize every image to 224ร224 before passing through LlavaEuroBertProcessor โ that yields the exact shapes the graph expects:
input_ids [batch, 271] int64
attention_mask [batch, 271] int64
pixel_values [1024, 1536] float32
image_grid_thw is NOT an ONNX input (it's a constant). Don't pass it.
Audio (audio_model*.onnx): traced with a 5-second 16 kHz mono clip. The graph expects exactly:
input_ids [batch, 125] int64 (audio_token_id placeholders)
attention_mask [batch, 125] int64
input_features [1, 128, 3000] float32 (Whisper log-mel, 30s padded)
feature_attention_mask [1, 3000] int64 (frame-level, 500 ones for 5s of real audio)
Pad or truncate every clip to 5 s. Longer-clip chunking (sliding 5 s windows, average pooled embeddings) is a v2 follow-up.
Use with transformers.js v4
Text
import { AutoTokenizer } from "@huggingface/transformers";
import * as ort from "onnxruntime-web";
const REPO = "shreyask/jina-embeddings-v5-omni-nano-ONNX";
const tok = await AutoTokenizer.from_pretrained(REPO);
const sess = await ort.InferenceSession.create(
`https://huggingface.co/${REPO}/resolve/main/onnx/text_model_q4f16.onnx`,
{ executionProviders: ["webgpu", "wasm"] },
);
const { input_ids, attention_mask } = await tok(
"Query: a saxophone solo",
{ return_tensors: "ort" },
);
const { sentence_embedding } = await sess.run({ input_ids, attention_mask });
// Float32Array of length 768, already L2-normalized.
Vision
import { AutoProcessor } from "@huggingface/transformers";
import * as ort from "onnxruntime-web";
const proc = await AutoProcessor.from_pretrained(REPO);
const sess = await ort.InferenceSession.create(
`https://huggingface.co/${REPO}/resolve/main/onnx/vision_model_q4f16.onnx`,
{ executionProviders: ["webgpu", "wasm"] },
);
// Resize to 224x224 before passing in.
const inputs = await proc.apply_chat_template(
[{ role: "user", content: [{ type: "image", image: imageBlob }] }],
{ add_generation_prompt: false, tokenize: true, return_dict: true, return_tensors: "ort" },
);
const { sentence_embedding } = await sess.run({
input_ids: inputs.input_ids,
attention_mask: inputs.attention_mask,
pixel_values: inputs.pixel_values,
});
Audio
Audio preprocessing isn't bundled in LlavaEuroBertProcessor โ use Whisper's feature extractor directly and stamp audio_token_id placeholders into input_ids. See the reference implementation for the exact mel-spec โ placeholder count plumbing.
Cross-modal retrieval
All three towers project into the same 768-dim space, so a text query can rank images / audio / video corpus items (and vice versa) without re-indexing. Embeddings are L2-normalized, so cosine similarity is a dot product:
const score = textVec.reduce((s, v, i) => s + v * imageVec[i], 0);
How these were exported
torch2.11,transformers5.8,onnx1.21,onnxruntime1.26- Text:
torch.onnx.export(..., dynamo=True). The LlamaModel-based encoder exports cleanly through the dynamo path - Vision and audio:
torch.onnx.export(..., dynamo=False)(legacy TorchScript tracer). Dynamo refuses to specialize Qwen3-VL's data-dependenttorch.linspace, and the GQA-aware SDPA is monkey-patched with a manual MatMul+Softmax for the trace's duration - PEFT LoRA fused via
merge_and_unload(safe_merge=True)so theretrievaltask adapter is baked in - fp16 cast via
onnxruntime.transformers.float16.convert_float_to_float16(theonnxconverter_commonpath mishandles dynamo's_to_copynodes) - 4-bit quant via
onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer(bits=4, block_size=32, accuracy_level=4) - Two graph post-patches required for the audio tower's ORT-loadability:
Cast(to=int64)inserted before everySliceindex input (366 inserts), andUnsqueeze(0)/Squeeze(0)wrapped around the rank-2-inputAveragePool(1 site)
License
Inherited from the base model: CC BY-NC 4.0. Commercial use requires reaching out to sales@jina.ai.
Citation
@misc{jina-embeddings-v5-omni-nano,
author = {Jina AI},
title = {jina-embeddings-v5-omni-nano},
year = {2025},
url = {https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano}
}
- Downloads last month
- 54
Model tree for onnx-community/jina-embeddings-v5-omni-nano-ONNX
Base model
jinaai/jina-embeddings-v5-omni-nano