jina-embeddings-v5-omni-nano on AXERA NPU

Ready-to-run AX650 multi-task embedding package for jinaai/jina-embeddings-v5-omni-nano.

This package uses one shared LLM base graph, one shared media tower per modality, and task-specific adapter/mapper patches. The validated tasks are:

Task ID	Intended use	Status
`retrieval`	Query/document embeddings for retrieval and similarity search	Default, validated
`clustering`	Embeddings for clustering workloads	Validated

The upstream model also contains classification and text-matching adapters. This package revision does not include validated AX650 artifacts for those two adapters.

The API is OpenAI-compatible /v1/embeddings for text, image, 8-second audio, and frame-directory video embedding. The output embedding shape is always [1, 768]. Runtime inference does not require the original Hugging Face safetensors files.

Supported Platform

AX650 / NPU3
Board runtime with AXEngine shared-weight adapter patch support.
Required AXEngine symbols: AX_ENGINE_CreateHandleV4 and AX_ENGINE_PatchHandleWBT.

This multi-task package uses one shared LLM base graph plus task-specific adapter patches. It cannot run with an older /soc/lib/libax_engine.so that does not export the shared-weight patch APIs.

Download

Run on a Linux host with the Hugging Face CLI installed:

mkdir -p AXERA-TECH/jina-embeddings-v5-omni-nano
cd AXERA-TECH/jina-embeddings-v5-omni-nano
hf download AXERA-TECH/jina-embeddings-v5-omni-nano --local-dir .

Package Layout

.
├── README.md
├── config.json
├── bin/axllm
├── runtime/
│   └── lib/
│       └── libax_engine.so
├── delta/
│   ├── llama_p128_l0_together.base_weight.bin
│   ├── llama_p128_l0_together.adapter_delta.bin
│   ├── llama_p128_l0_together.adapter_delta.json
│   └── ...
├── delta_clustering/
│   ├── llama_p128_l0_together.adapter_delta.bin
│   ├── llama_p128_l0_together.adapter_delta.json
│   └── ...
├── llama_p128_l0_together.axmodel
├── ...
├── llama_p128_l11_together.axmodel
├── llama_post.axmodel
├── model.embed_tokens.weight.bfloat16.bin
├── jina_v5_omni_tokenizer/
├── jina_v5_omni_tokenizer.txt
├── jina_v5_omni_nano_vision_tower_256x256.axmodel
├── jina_v5_omni_nano_vision_merger_retrieval_256x256.axmodel
├── jina_v5_omni_nano_vision_merger_clustering_256x256.axmodel
├── jina_v5_omni_nano_audio_tower_8s.axmodel
├── jina_v5_omni_nano_audio_projector_retrieval_8s.axmodel
├── jina_v5_omni_nano_audio_projector_clustering_8s.axmodel
├── python/
└── assets/

The package contains one LLM graph set at the package root. Task-specific LLM differences are stored as adapter delta files under delta/ and delta_clustering/.

The vision and audio towers are shared by retrieval and clustering. The small merger/projector AXModels are task-specific.

Start the Service

Run on the AX650 board from the package root. Only the port is exposed:

chmod +x ./start_axllm.sh
./start_axllm.sh 8000

start_axllm.sh sets the required runtime environment automatically:

AXENGINE_SHARED_WEIGHT_LIB_DIR=$(pwd)/runtime/lib
LD_LIBRARY_PATH=$(pwd)/runtime/lib:/soc/lib:...
AXLLM_RELEASE_AXMODEL_BUFFER_AFTER_INIT=1

It also checks that the active libax_engine.so exports the required shared-weight symbols before starting the service:

AX_ENGINE_CreateHandleV4
AX_ENGINE_PatchHandleWBT

This package includes the compatible shared-weight AXEngine runtime at runtime/lib/libax_engine.so. Do not overwrite /soc/lib just for this package. The package-local LD_LIBRARY_PATH method is process-local and is the recommended deployment method when the board image has not yet integrated the shared-weight AXEngine runtime.

Health checks:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Expected model id:

AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047

OpenAI-Compatible Examples

Use task_id to select the adapter. If omitted, the service uses the default retrieval task.

The active adapter is process-global. Do not send mixed task_id requests concurrently to the same service process. Use serial task switching, or run separate service processes if your application needs concurrent retrieval and clustering traffic.

Run the example commands below serially. The default packaged service is validated with max_concurrency = 1.

Task switching is request-driven. Start axllm serve once, then set one of these fields in each /v1/embeddings request body:

task_id: recommended field.
adapter_task: accepted alias.
task: accepted alias.

Supported values in this package are retrieval and clustering. The runtime switches the active adapter before executing the request. The next request may use a different task_id; no service restart is required.

For text-only embedding requests, task switching patches only the LLM adapter. For image, audio, or video requests, the runtime also switches the task-specific media mapper/projector while reusing the shared media tower.

Minimal retrieval request:

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047",
    "task_id": "retrieval",
    "prompt_name": "query",
    "input": "Which planet is known as the Red Planet?",
    "encoding_format": "float"
  }'

Minimal clustering request:

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047",
    "task_id": "clustering",
    "prompt_name": "document",
    "input": "Text embeddings convert sentences into dense vectors for clustering.",
    "encoding_format": "float"
  }'

Text:

python3 python/openai_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --input "Which planet is known as the Red Planet?"

Image:

python3 python/openai_multimodal_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --media-type image \
  --media-path assets/sample.png

Audio:

python3 python/openai_multimodal_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --media-type audio \
  --media-path assets/audio_test_chunk0_8s.wav

Video:

python3 python/openai_multimodal_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --media-type video \
  --media-path assets/red-panda-openai.frames

Clustering:

python3 python/openai_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id clustering \
  --prompt-name document \
  --input "Text embeddings convert sentences into dense vectors for retrieval, clustering, and similarity search."

The validated video path is a directory of pre-extracted frames. If you want to use a video file, extract frames first and pass the frame directory to the API.

Image Retrieval and Clustering Evaluation

The model returns embeddings. Retrieval and clustering are client-side operations built from those embeddings.

The evaluation script accepts either:

one subdirectory per class, or
a flat demo directory with label-prefixed filenames such as cat_0.jpg and dog_1.jpg

Example directory layouts:

animal_test/
├── cat/
│   ├── cat_001.jpg
│   └── cat_002.jpg
├── dog/
│   ├── dog_001.jpg
│   └── dog_002.jpg
└── rabbit/
    ├── rabbit_001.jpg
    └── rabbit_002.jpg

animals_imgs/
├── cat_0.jpeg
├── cat_1.jpeg
├── dog_0.jpeg
└── ...

Run after starting axllm serve:

python3 python/evaluate_image_retrieval_clustering.py \
  --api-url http://127.0.0.1:8000/v1 \
  --dataset-dir animals_imgs \
  --output-json tmp/animals_imgs_eval.json

The script evaluates:

Image nearest-neighbor retrieval with task_id=retrieval.
Text-to-image retrieval with queries such as a photo of a cat.
K-means cluster purity with task_id=clustering.

Use at least two images per class. More images per class make the result more meaningful.

Board Precision

The table below compares board-side axllm serve embeddings with the packaged Hugging Face reference embeddings under python/testdata/service_cases/<task_id>/*/torch_embedding.npy.

Run retrieval validation after starting the service:

python3 python/compare_openai_api_vs_hf_multimodal.py \
  --api-url http://127.0.0.1:8000/v1 \
  --api-package-root . \
  --task-id retrieval

Run clustering validation:

python3 python/compare_openai_api_vs_hf_multimodal.py \
  --api-url http://127.0.0.1:8000/v1 \
  --api-package-root . \
  --task-id clustering

The board-side comparison script does not run the original Hugging Face model. It compares the API output with the packaged HF reference embeddings included in this repository.

Validated retrieval results:

Modality	Case	Output shape	Soft tokens	Cosine vs HF	Mean abs diff	Max abs diff
Text document	`embedding_doc`	`[1, 768]`	`-`	`0.999655`	`0.000760`	`0.003988`
Text query	`red_planet_query`	`[1, 768]`	`-`	`0.999539`	`0.000835`	`0.009185`
Image	`vision_sample`	`[1, 768]`	`64`	`0.996820`	`0.002271`	`0.012364`
Audio 8s	`audio_test_chunk0_8s_wav`	`[1, 768]`	`200`	`0.992794`	`0.003383`	`0.015184`
Video 3 frames	`video_visual_red_panda_openai_mp4`	`[1, 768]`	`192`	`0.996833`	`0.002273`	`0.015725`

Validated clustering results:

Modality	Case	Output shape	Soft tokens	Cosine vs HF	Mean abs diff	Max abs diff
Text document	`embedding_doc`	`[1, 768]`	`-`	`0.999692`	`0.000719`	`0.002805`
Text query	`red_planet_query`	`[1, 768]`	`-`	`0.999579`	`0.000842`	`0.003600`
Image	`vision_sample`	`[1, 768]`	`64`	`0.998207`	`0.001708`	`0.009309`
Audio 8s	`audio_test_chunk0_8s_wav`	`[1, 768]`	`200`	`0.998418`	`0.001614`	`0.006731`
Video 3 frames	`video_visual_red_panda_openai_mp4`	`[1, 768]`	`192`	`0.988590`	`0.004383`	`0.016320`

The clustering video case is the lowest-precision validated case in this package. Text, image, audio, and retrieval video are closer to the packaged HF references.

Performance

This model returns embeddings and does not run token-by-token decoding. The useful runtime metric is media preparation plus LLM prefill.

The default config.json uses a shared 256x256 vision tower, shared 8s audio tower, and task-specific retrieval/clustering mapper/projector AXModels. The audio examples use the shipped assets/audio_test_chunk0_8s.wav, which is 16kHz mono PCM WAV. This package validates and recommends 16kHz mono PCM WAV for audio input.

Split media AXModels in the default package:

AXModel	Role	Output tokens	File size
`jina_v5_omni_nano_vision_tower_256x256.axmodel`	shared vision tower	raw `256`	`96,659,343 bytes`
`jina_v5_omni_nano_vision_merger_retrieval_256x256.axmodel`	retrieval vision mapper	`64`	`12,767,060 bytes`
`jina_v5_omni_nano_vision_merger_clustering_256x256.axmodel`	clustering vision mapper	`64`	`12,767,092 bytes`
`jina_v5_omni_nano_audio_tower_8s.axmodel`	shared audio tower	raw `200`	`702,128,779 bytes`
`jina_v5_omni_nano_audio_projector_retrieval_8s.axmodel`	retrieval audio projector	`200`	`1,093,363 bytes`
`jina_v5_omni_nano_audio_projector_clustering_8s.axmodel`	clustering audio projector	`200`	`1,093,363 bytes`

Representative retrieval latency from the AX650 OpenAI API path:

Scenario	Prompt	LLM input tokens	Soft tokens	Output shape	Media prepare	LLM prefill	Runtime total
Text document	`document`	`19`	`-`	`[1, 768]`	`-`	`120.13 ms`	`121.30 ms`
Text query	`query`	`12`	`-`	`[1, 768]`	`-`	`93.95 ms`	`94.34 ms`
Image	`query`	`85`	`64`	`[1, 768]`	typically `~52-181 ms` after warmup	typically `~70 ms`	typically `~0.12-0.35 s` after warmup
Audio 8s	`query`	`223`	`200`	`[1, 768]`	included in total	included in total	typically `~0.8-1.0 s` after warmup
Video 3 frames	`query`	`213`	`192`	`[1, 768]`	included in total	included in total	typically `~0.35 s` after warmup

Task switching latency measured with the packaged shared-base runtime:

Switch case	Typical latency	Notes
Text `retrieval` -> `clustering`	about `1-2 s`	LLM adapter patch only
Image `retrieval` -> `clustering`	about `1.5 s`	Reuses shared vision tower and switches the small mapper
Same-task image request after mapper is active	typically `~0.13-0.47 s`	No adapter or tower reload

The first audio request initializes the shared audio tower lazily. That one-time audio initialization cost is not part of steady-state same-task inference latency.

Runtime Footprint

Measured on AX650 with the default split-media 8s audio profile and release_axmodel_buffer_after_init enabled. The sample below reflects a steady-state process after the shared audio tower has been initialized.

Item	Value
Estimated additional AXERA CMM over idle board baseline	`537872 KB` (`~525 MiB`)
AXERA CMM total used in the measured board state	`812876 KB` (`~794 MiB`)
Linux process RSS	`160052 KB` (`~156 MiB`)
Linux process PSS	`159925 KB` (`~156 MiB`)
Service ready time	`~6 s` in the measured run

CMM means AXERA contiguous multimedia memory. Linux process memory is resident process memory for the axllm serve process.

Token Layout and Static Shapes

The final embedding output is always [1, 768].

Default encoder profiles:

Input	Static input profile	Soft tokens	Encoder output
Image	`256x256`	`64`	shared tower `[256, 768]`, then mapper `[1, 64, 768]`
Audio	`8.0s`, `16kHz`, mono PCM WAV, `800` mel frames	`200`	shared tower `[200, 1280]`, then projector `[1, 200, 768]`
Video	frame directory, `256x256` per frame, default `3` frames	`64 x frame_count`	per frame tower `[256, 768]`, then mapper `[1, 64, 768]`

The packaged text backbone is compiled with:

prefill_len = 128
effective prefill_max_token_num = 1024
max_token_len = 2047

Choose the video frame count according to your application and the compiled prefill budget.

Input Notes

Use task_id=retrieval for retrieval embeddings and task_id=clustering for clustering embeddings.
document and query are separate text prompt modes exposed by the upstream model. Use document for corpus/document embeddings and query for search queries.
Audio input must be 8s, 16kHz, mono PCM WAV for the default package profile. Convert audio offline on a host with ffmpeg if needed: ffmpeg -i input.wav -ac 1 -ar 16000 -sample_fmt s16 output_16k_mono.wav.
Video input should be a frame directory. Retrieval and clustering validation both use 3 frames from the shipped sample directory.
Arbitrary image resolution, audio duration, video frame count, or token budget requires rebuilding the corresponding encoder or LLM configuration.

Conversion References

Upstream model: https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano
AXERA runtime: https://github.com/AXERA-TECH/ax-llm

If you rebuild the LLM AXModels from the original Hugging Face checkpoint, make sure the build config contains the flattened field text_config.rope_theta = 1000000.0. The upstream nano config stores this value under text_config.rope_parameters.rope_theta, while the AXERA llama build route reads text_config.rope_theta.

Discussion

This package validates retrieval and clustering task switching on AX650 with a shared LLM base and task-specific adapter patches. Additional upstream adapters need their own AX650 adapter patches and media encoder validation before they are enabled in this package.

Downloads last month: -

Model tree for AXERA-TECH/jina-embeddings-v5-omni-nano

Base model

jinaai/jina-embeddings-v5-omni-nano

Finetuned

(6)

this model