jina-embeddings-v5-omni-nano on AXERA NPU

Ready-to-run AX650 multi-task embedding package for jinaai/jina-embeddings-v5-omni-nano.

This package uses one shared LLM base graph, one shared media tower per modality, and task-specific adapter/mapper patches. The validated tasks are:

Task ID Intended use Status
retrieval Query/document embeddings for retrieval and similarity search Default, validated
clustering Embeddings for clustering workloads Validated

The upstream model also contains classification and text-matching adapters. This package revision does not include validated AX650 artifacts for those two adapters.

The API is OpenAI-compatible /v1/embeddings for text, image, 8-second audio, and frame-directory video embedding. The output embedding shape is always [1, 768]. Runtime inference does not require the original Hugging Face safetensors files.

Supported Platform

  • AX650 / NPU3
  • Board runtime with AXEngine shared-weight adapter patch support.
  • Required AXEngine symbols: AX_ENGINE_CreateHandleV4 and AX_ENGINE_PatchHandleWBT.

This multi-task package uses one shared LLM base graph plus task-specific adapter patches. It cannot run with an older /soc/lib/libax_engine.so that does not export the shared-weight patch APIs.

Download

Run on a Linux host with the Hugging Face CLI installed:

mkdir -p AXERA-TECH/jina-embeddings-v5-omni-nano
cd AXERA-TECH/jina-embeddings-v5-omni-nano
hf download AXERA-TECH/jina-embeddings-v5-omni-nano --local-dir .

Package Layout

.
β”œβ”€β”€ README.md
β”œβ”€β”€ config.json
β”œβ”€β”€ bin/axllm
β”œβ”€β”€ runtime/
β”‚   └── lib/
β”‚       └── libax_engine.so
β”œβ”€β”€ delta/
β”‚   β”œβ”€β”€ llama_p128_l0_together.base_weight.bin
β”‚   β”œβ”€β”€ llama_p128_l0_together.adapter_delta.bin
β”‚   β”œβ”€β”€ llama_p128_l0_together.adapter_delta.json
β”‚   └── ...
β”œβ”€β”€ delta_clustering/
β”‚   β”œβ”€β”€ llama_p128_l0_together.adapter_delta.bin
β”‚   β”œβ”€β”€ llama_p128_l0_together.adapter_delta.json
β”‚   └── ...
β”œβ”€β”€ llama_p128_l0_together.axmodel
β”œβ”€β”€ ...
β”œβ”€β”€ llama_p128_l11_together.axmodel
β”œβ”€β”€ llama_post.axmodel
β”œβ”€β”€ model.embed_tokens.weight.bfloat16.bin
β”œβ”€β”€ jina_v5_omni_tokenizer/
β”œβ”€β”€ jina_v5_omni_tokenizer.txt
β”œβ”€β”€ jina_v5_omni_nano_vision_tower_256x256.axmodel
β”œβ”€β”€ jina_v5_omni_nano_vision_merger_retrieval_256x256.axmodel
β”œβ”€β”€ jina_v5_omni_nano_vision_merger_clustering_256x256.axmodel
β”œβ”€β”€ jina_v5_omni_nano_audio_tower_8s.axmodel
β”œβ”€β”€ jina_v5_omni_nano_audio_projector_retrieval_8s.axmodel
β”œβ”€β”€ jina_v5_omni_nano_audio_projector_clustering_8s.axmodel
β”œβ”€β”€ python/
└── assets/

The package contains one LLM graph set at the package root. Task-specific LLM differences are stored as adapter delta files under delta/ and delta_clustering/.

The vision and audio towers are shared by retrieval and clustering. The small merger/projector AXModels are task-specific.

Start the Service

Run on the AX650 board from the package root. Only the port is exposed:

chmod +x ./start_axllm.sh
./start_axllm.sh 8000

start_axllm.sh sets the required runtime environment automatically:

  • AXENGINE_SHARED_WEIGHT_LIB_DIR=$(pwd)/runtime/lib
  • LD_LIBRARY_PATH=$(pwd)/runtime/lib:/soc/lib:...
  • AXLLM_RELEASE_AXMODEL_BUFFER_AFTER_INIT=1

It also checks that the active libax_engine.so exports the required shared-weight symbols before starting the service:

  • AX_ENGINE_CreateHandleV4
  • AX_ENGINE_PatchHandleWBT

This package includes the compatible shared-weight AXEngine runtime at runtime/lib/libax_engine.so. Do not overwrite /soc/lib just for this package. The package-local LD_LIBRARY_PATH method is process-local and is the recommended deployment method when the board image has not yet integrated the shared-weight AXEngine runtime.

Health checks:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Expected model id:

AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047

OpenAI-Compatible Examples

Use task_id to select the adapter. If omitted, the service uses the default retrieval task.

The active adapter is process-global. Do not send mixed task_id requests concurrently to the same service process. Use serial task switching, or run separate service processes if your application needs concurrent retrieval and clustering traffic.

Run the example commands below serially. The default packaged service is validated with max_concurrency = 1.

Task switching is request-driven. Start axllm serve once, then set one of these fields in each /v1/embeddings request body:

  • task_id: recommended field.
  • adapter_task: accepted alias.
  • task: accepted alias.

Supported values in this package are retrieval and clustering. The runtime switches the active adapter before executing the request. The next request may use a different task_id; no service restart is required.

For text-only embedding requests, task switching patches only the LLM adapter. For image, audio, or video requests, the runtime also switches the task-specific media mapper/projector while reusing the shared media tower.

Minimal retrieval request:

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047",
    "task_id": "retrieval",
    "prompt_name": "query",
    "input": "Which planet is known as the Red Planet?",
    "encoding_format": "float"
  }'

Minimal clustering request:

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047",
    "task_id": "clustering",
    "prompt_name": "document",
    "input": "Text embeddings convert sentences into dense vectors for clustering.",
    "encoding_format": "float"
  }'

Text:

python3 python/openai_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --input "Which planet is known as the Red Planet?"

Image:

python3 python/openai_multimodal_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --media-type image \
  --media-path assets/sample.png

Audio:

python3 python/openai_multimodal_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --media-type audio \
  --media-path assets/audio_test_chunk0_8s.wav

Video:

python3 python/openai_multimodal_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id retrieval \
  --prompt-name query \
  --media-type video \
  --media-path assets/red-panda-openai.frames

Clustering:

python3 python/openai_embedding_demo.py \
  --api-url http://127.0.0.1:8000/v1 \
  --model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
  --task-id clustering \
  --prompt-name document \
  --input "Text embeddings convert sentences into dense vectors for retrieval, clustering, and similarity search."

The validated video path is a directory of pre-extracted frames. If you want to use a video file, extract frames first and pass the frame directory to the API.

Image Retrieval and Clustering Evaluation

The model returns embeddings. Retrieval and clustering are client-side operations built from those embeddings.

The evaluation script accepts either:

  • one subdirectory per class, or
  • a flat demo directory with label-prefixed filenames such as cat_0.jpg and dog_1.jpg

Example directory layouts:

animal_test/
β”œβ”€β”€ cat/
β”‚   β”œβ”€β”€ cat_001.jpg
β”‚   └── cat_002.jpg
β”œβ”€β”€ dog/
β”‚   β”œβ”€β”€ dog_001.jpg
β”‚   └── dog_002.jpg
└── rabbit/
    β”œβ”€β”€ rabbit_001.jpg
    └── rabbit_002.jpg

animals_imgs/
β”œβ”€β”€ cat_0.jpeg
β”œβ”€β”€ cat_1.jpeg
β”œβ”€β”€ dog_0.jpeg
└── ...

Run after starting axllm serve:

python3 python/evaluate_image_retrieval_clustering.py \
  --api-url http://127.0.0.1:8000/v1 \
  --dataset-dir animals_imgs \
  --output-json tmp/animals_imgs_eval.json

The script evaluates:

  • Image nearest-neighbor retrieval with task_id=retrieval.
  • Text-to-image retrieval with queries such as a photo of a cat.
  • K-means cluster purity with task_id=clustering.

Use at least two images per class. More images per class make the result more meaningful.

Board Precision

The table below compares board-side axllm serve embeddings with the packaged Hugging Face reference embeddings under python/testdata/service_cases/<task_id>/*/torch_embedding.npy.

Run retrieval validation after starting the service:

python3 python/compare_openai_api_vs_hf_multimodal.py \
  --api-url http://127.0.0.1:8000/v1 \
  --api-package-root . \
  --task-id retrieval

Run clustering validation:

python3 python/compare_openai_api_vs_hf_multimodal.py \
  --api-url http://127.0.0.1:8000/v1 \
  --api-package-root . \
  --task-id clustering

The board-side comparison script does not run the original Hugging Face model. It compares the API output with the packaged HF reference embeddings included in this repository.

Validated retrieval results:

Modality Case Output shape Soft tokens Cosine vs HF Mean abs diff Max abs diff
Text document embedding_doc [1, 768] - 0.999655 0.000760 0.003988
Text query red_planet_query [1, 768] - 0.999539 0.000835 0.009185
Image vision_sample [1, 768] 64 0.996820 0.002271 0.012364
Audio 8s audio_test_chunk0_8s_wav [1, 768] 200 0.992794 0.003383 0.015184
Video 3 frames video_visual_red_panda_openai_mp4 [1, 768] 192 0.996833 0.002273 0.015725

Validated clustering results:

Modality Case Output shape Soft tokens Cosine vs HF Mean abs diff Max abs diff
Text document embedding_doc [1, 768] - 0.999692 0.000719 0.002805
Text query red_planet_query [1, 768] - 0.999579 0.000842 0.003600
Image vision_sample [1, 768] 64 0.998207 0.001708 0.009309
Audio 8s audio_test_chunk0_8s_wav [1, 768] 200 0.998418 0.001614 0.006731
Video 3 frames video_visual_red_panda_openai_mp4 [1, 768] 192 0.988590 0.004383 0.016320

The clustering video case is the lowest-precision validated case in this package. Text, image, audio, and retrieval video are closer to the packaged HF references.

Performance

This model returns embeddings and does not run token-by-token decoding. The useful runtime metric is media preparation plus LLM prefill.

The default config.json uses a shared 256x256 vision tower, shared 8s audio tower, and task-specific retrieval/clustering mapper/projector AXModels. The audio examples use the shipped assets/audio_test_chunk0_8s.wav, which is 16kHz mono PCM WAV. This package validates and recommends 16kHz mono PCM WAV for audio input.

Split media AXModels in the default package:

AXModel Role Output tokens File size
jina_v5_omni_nano_vision_tower_256x256.axmodel shared vision tower raw 256 96,659,343 bytes
jina_v5_omni_nano_vision_merger_retrieval_256x256.axmodel retrieval vision mapper 64 12,767,060 bytes
jina_v5_omni_nano_vision_merger_clustering_256x256.axmodel clustering vision mapper 64 12,767,092 bytes
jina_v5_omni_nano_audio_tower_8s.axmodel shared audio tower raw 200 702,128,779 bytes
jina_v5_omni_nano_audio_projector_retrieval_8s.axmodel retrieval audio projector 200 1,093,363 bytes
jina_v5_omni_nano_audio_projector_clustering_8s.axmodel clustering audio projector 200 1,093,363 bytes

Representative retrieval latency from the AX650 OpenAI API path:

Scenario Prompt LLM input tokens Soft tokens Output shape Media prepare LLM prefill Runtime total
Text document document 19 - [1, 768] - 120.13 ms 121.30 ms
Text query query 12 - [1, 768] - 93.95 ms 94.34 ms
Image query 85 64 [1, 768] typically ~52-181 ms after warmup typically ~70 ms typically ~0.12-0.35 s after warmup
Audio 8s query 223 200 [1, 768] included in total included in total typically ~0.8-1.0 s after warmup
Video 3 frames query 213 192 [1, 768] included in total included in total typically ~0.35 s after warmup

Task switching latency measured with the packaged shared-base runtime:

Switch case Typical latency Notes
Text retrieval -> clustering about 1-2 s LLM adapter patch only
Image retrieval -> clustering about 1.5 s Reuses shared vision tower and switches the small mapper
Same-task image request after mapper is active typically ~0.13-0.47 s No adapter or tower reload

The first audio request initializes the shared audio tower lazily. That one-time audio initialization cost is not part of steady-state same-task inference latency.

Runtime Footprint

Measured on AX650 with the default split-media 8s audio profile and release_axmodel_buffer_after_init enabled. The sample below reflects a steady-state process after the shared audio tower has been initialized.

Item Value
Estimated additional AXERA CMM over idle board baseline 537872 KB (~525 MiB)
AXERA CMM total used in the measured board state 812876 KB (~794 MiB)
Linux process RSS 160052 KB (~156 MiB)
Linux process PSS 159925 KB (~156 MiB)
Service ready time ~6 s in the measured run

CMM means AXERA contiguous multimedia memory. Linux process memory is resident process memory for the axllm serve process.

Token Layout and Static Shapes

The final embedding output is always [1, 768].

Default encoder profiles:

Input Static input profile Soft tokens Encoder output
Image 256x256 64 shared tower [256, 768], then mapper [1, 64, 768]
Audio 8.0s, 16kHz, mono PCM WAV, 800 mel frames 200 shared tower [200, 1280], then projector [1, 200, 768]
Video frame directory, 256x256 per frame, default 3 frames 64 x frame_count per frame tower [256, 768], then mapper [1, 64, 768]

The packaged text backbone is compiled with:

  • prefill_len = 128
  • effective prefill_max_token_num = 1024
  • max_token_len = 2047

Choose the video frame count according to your application and the compiled prefill budget.

Input Notes

  • Use task_id=retrieval for retrieval embeddings and task_id=clustering for clustering embeddings.
  • document and query are separate text prompt modes exposed by the upstream model. Use document for corpus/document embeddings and query for search queries.
  • Audio input must be 8s, 16kHz, mono PCM WAV for the default package profile. Convert audio offline on a host with ffmpeg if needed: ffmpeg -i input.wav -ac 1 -ar 16000 -sample_fmt s16 output_16k_mono.wav.
  • Video input should be a frame directory. Retrieval and clustering validation both use 3 frames from the shipped sample directory.
  • Arbitrary image resolution, audio duration, video frame count, or token budget requires rebuilding the corresponding encoder or LLM configuration.

Conversion References

  • Upstream model: https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano
  • AXERA runtime: https://github.com/AXERA-TECH/ax-llm

If you rebuild the LLM AXModels from the original Hugging Face checkpoint, make sure the build config contains the flattened field text_config.rope_theta = 1000000.0. The upstream nano config stores this value under text_config.rope_parameters.rope_theta, while the AXERA llama build route reads text_config.rope_theta.

Discussion

This package validates retrieval and clustering task switching on AX650 with a shared LLM base and task-specific adapter patches. Additional upstream adapters need their own AX650 adapter patches and media encoder validation before they are enabled in this package.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AXERA-TECH/jina-embeddings-v5-omni-nano

Finetuned
(6)
this model