jina-embeddings-v5-omni-nano ยท ONNX

ONNX exports of jinaai/jina-embeddings-v5-omni-nano for in-browser inference via transformers.js v4 with WebGPU.

jina-embeddings-v5-omni-nano is a ~1.04 B-parameter multimodal embedding model that maps text, images, audio, and video into a single shared 768-dimensional L2-normalized space, enabling cross-modal retrieval without reindexing.

Architecture

The base model is a LLaVA-style composition of three frozen encoders plus small trainable projectors, with per-task LoRA adapters. This repo's exports have the retrieval task adapter merged into the static weights.

Tower Backbone Role
Text EuroBERT-210m (loaded as a bidirectional Llama) Text encoder
Vision Qwen3-VL vision tower + spatial merger Image and video frames
Audio Whisper-large-v3 encoder + Qwen2.5-Omni audio adapter Audio (16 kHz mono)

Each tower is bundled with its retrieval projector AND a copy of the language model (which performs the final cross-modal fusion + last-token pooling), so the three ONNX graphs are independently loadable.

Files

Three ONNX graphs, three quantization variants each. Every variant ships as a small .onnx schema + a large .onnx.data or .onnx_data external-weight sidecar (HF Hub-friendly layout, no 2 GB protobuf limits).

Modality fp32 fp16 q4f16 Verified parity vs fp32 (fp16)
Text 849 MB 424 MB 263 MB cos = 1.000000
Vision 1247 MB 622 MB 460 MB cos = 0.999998
Audio 3400 MB 1700 MB 1465 MB (fp16 doesn't load on CPU EP; verify on WebGPU)

For desktop WebGPU, use q4f16 โ€” it's the smallest and runs natively on shader-int4 hardware. Use fp16 if you need higher numerical fidelity or your GPU lacks int4 paths.

Every graph outputs a single tensor sentence_embedding of shape [batch, 768], already L2-normalized โ€” cosine similarity reduces to a dot product.

Input constraints

Text (text_model*.onnx): inputs input_ids and attention_mask, both [batch, seq] with seq fully dynamic. Apply the asymmetric retrieval prefix convention before tokenizing: Query: โ€ฆ for queries, Document: โ€ฆ for corpus items.

Vision (vision_model*.onnx): the graph is traced at a fixed image-patch layout. image_grid_thw is folded as a constant (it drives torch.linspace inside Qwen3-VL's fast_pos_embed_interpolate, which dynamo cannot symbolicate). Resize every image to 224ร—224 before passing through LlavaEuroBertProcessor โ€” that yields the exact shapes the graph expects:

input_ids        [batch, 271]    int64
attention_mask   [batch, 271]    int64
pixel_values     [1024, 1536]    float32

image_grid_thw is NOT an ONNX input (it's a constant). Don't pass it.

Audio (audio_model*.onnx): traced with a 5-second 16 kHz mono clip. The graph expects exactly:

input_ids                [batch, 125]      int64   (audio_token_id placeholders)
attention_mask           [batch, 125]      int64
input_features           [1, 128, 3000]    float32 (Whisper log-mel, 30s padded)
feature_attention_mask   [1, 3000]         int64   (frame-level, 500 ones for 5s of real audio)

Pad or truncate every clip to 5 s. Longer-clip chunking (sliding 5 s windows, average pooled embeddings) is a v2 follow-up.

Use with transformers.js v4

Text

import { AutoTokenizer } from "@huggingface/transformers";
import * as ort from "onnxruntime-web";

const REPO = "shreyask/jina-embeddings-v5-omni-nano-ONNX";

const tok = await AutoTokenizer.from_pretrained(REPO);
const sess = await ort.InferenceSession.create(
  `https://huggingface.co/${REPO}/resolve/main/onnx/text_model_q4f16.onnx`,
  { executionProviders: ["webgpu", "wasm"] },
);

const { input_ids, attention_mask } = await tok(
  "Query: a saxophone solo",
  { return_tensors: "ort" },
);
const { sentence_embedding } = await sess.run({ input_ids, attention_mask });
// Float32Array of length 768, already L2-normalized.

Vision

import { AutoProcessor } from "@huggingface/transformers";
import * as ort from "onnxruntime-web";

const proc = await AutoProcessor.from_pretrained(REPO);
const sess = await ort.InferenceSession.create(
  `https://huggingface.co/${REPO}/resolve/main/onnx/vision_model_q4f16.onnx`,
  { executionProviders: ["webgpu", "wasm"] },
);

// Resize to 224x224 before passing in.
const inputs = await proc.apply_chat_template(
  [{ role: "user", content: [{ type: "image", image: imageBlob }] }],
  { add_generation_prompt: false, tokenize: true, return_dict: true, return_tensors: "ort" },
);
const { sentence_embedding } = await sess.run({
  input_ids: inputs.input_ids,
  attention_mask: inputs.attention_mask,
  pixel_values: inputs.pixel_values,
});

Audio

Audio preprocessing isn't bundled in LlavaEuroBertProcessor โ€” use Whisper's feature extractor directly and stamp audio_token_id placeholders into input_ids. See the reference implementation for the exact mel-spec โ†’ placeholder count plumbing.

Cross-modal retrieval

All three towers project into the same 768-dim space, so a text query can rank images / audio / video corpus items (and vice versa) without re-indexing. Embeddings are L2-normalized, so cosine similarity is a dot product:

const score = textVec.reduce((s, v, i) => s + v * imageVec[i], 0);

How these were exported

  • torch 2.11, transformers 5.8, onnx 1.21, onnxruntime 1.26
  • Text: torch.onnx.export(..., dynamo=True). The LlamaModel-based encoder exports cleanly through the dynamo path
  • Vision and audio: torch.onnx.export(..., dynamo=False) (legacy TorchScript tracer). Dynamo refuses to specialize Qwen3-VL's data-dependent torch.linspace, and the GQA-aware SDPA is monkey-patched with a manual MatMul+Softmax for the trace's duration
  • PEFT LoRA fused via merge_and_unload(safe_merge=True) so the retrieval task adapter is baked in
  • fp16 cast via onnxruntime.transformers.float16.convert_float_to_float16 (the onnxconverter_common path mishandles dynamo's _to_copy nodes)
  • 4-bit quant via onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer (bits=4, block_size=32, accuracy_level=4)
  • Two graph post-patches required for the audio tower's ORT-loadability: Cast(to=int64) inserted before every Slice index input (366 inserts), and Unsqueeze(0)/Squeeze(0) wrapped around the rank-2-input AveragePool (1 site)

License

Inherited from the base model: CC BY-NC 4.0. Commercial use requires reaching out to sales@jina.ai.

Citation

@misc{jina-embeddings-v5-omni-nano,
  author = {Jina AI},
  title  = {jina-embeddings-v5-omni-nano},
  year   = {2025},
  url    = {https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano}
}
Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for onnx-community/jina-embeddings-v5-omni-nano-ONNX

Quantized
(1)
this model

Space using onnx-community/jina-embeddings-v5-omni-nano-ONNX 1