Instructions to use AXERA-TECH/jina-embeddings-v5-omni-nano with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/jina-embeddings-v5-omni-nano with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="AXERA-TECH/jina-embeddings-v5-omni-nano")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/jina-embeddings-v5-omni-nano", dtype="auto") - Notebooks
- Google Colab
- Kaggle
jina-embeddings-v5-omni-nano on AXERA NPU
Ready-to-run AX650 multi-task embedding package for jinaai/jina-embeddings-v5-omni-nano.
This package uses one shared LLM base graph, one shared media tower per modality, and task-specific adapter/mapper patches. The validated tasks are:
| Task ID | Intended use | Status |
|---|---|---|
retrieval |
Query/document embeddings for retrieval and similarity search | Default, validated |
clustering |
Embeddings for clustering workloads | Validated |
The upstream model also contains classification and text-matching adapters. This package revision does not include validated AX650 artifacts for those two adapters.
The API is OpenAI-compatible /v1/embeddings for text, image, 8-second audio, and frame-directory video embedding. The output embedding shape is always [1, 768]. Runtime inference does not require the original Hugging Face safetensors files.
Supported Platform
- AX650 / NPU3
- Board runtime with AXEngine shared-weight adapter patch support.
- Required AXEngine symbols:
AX_ENGINE_CreateHandleV4andAX_ENGINE_PatchHandleWBT.
This multi-task package uses one shared LLM base graph plus task-specific adapter patches. It cannot run with an older /soc/lib/libax_engine.so that does not export the shared-weight patch APIs.
Download
Run on a Linux host with the Hugging Face CLI installed:
mkdir -p AXERA-TECH/jina-embeddings-v5-omni-nano
cd AXERA-TECH/jina-embeddings-v5-omni-nano
hf download AXERA-TECH/jina-embeddings-v5-omni-nano --local-dir .
Package Layout
.
βββ README.md
βββ config.json
βββ bin/axllm
βββ runtime/
β βββ lib/
β βββ libax_engine.so
βββ delta/
β βββ llama_p128_l0_together.base_weight.bin
β βββ llama_p128_l0_together.adapter_delta.bin
β βββ llama_p128_l0_together.adapter_delta.json
β βββ ...
βββ delta_clustering/
β βββ llama_p128_l0_together.adapter_delta.bin
β βββ llama_p128_l0_together.adapter_delta.json
β βββ ...
βββ llama_p128_l0_together.axmodel
βββ ...
βββ llama_p128_l11_together.axmodel
βββ llama_post.axmodel
βββ model.embed_tokens.weight.bfloat16.bin
βββ jina_v5_omni_tokenizer/
βββ jina_v5_omni_tokenizer.txt
βββ jina_v5_omni_nano_vision_tower_256x256.axmodel
βββ jina_v5_omni_nano_vision_merger_retrieval_256x256.axmodel
βββ jina_v5_omni_nano_vision_merger_clustering_256x256.axmodel
βββ jina_v5_omni_nano_audio_tower_8s.axmodel
βββ jina_v5_omni_nano_audio_projector_retrieval_8s.axmodel
βββ jina_v5_omni_nano_audio_projector_clustering_8s.axmodel
βββ python/
βββ assets/
The package contains one LLM graph set at the package root. Task-specific LLM differences are stored as adapter delta files under delta/ and delta_clustering/.
The vision and audio towers are shared by retrieval and clustering. The small merger/projector AXModels are task-specific.
Start the Service
Run on the AX650 board from the package root. Only the port is exposed:
chmod +x ./start_axllm.sh
./start_axllm.sh 8000
start_axllm.sh sets the required runtime environment automatically:
AXENGINE_SHARED_WEIGHT_LIB_DIR=$(pwd)/runtime/libLD_LIBRARY_PATH=$(pwd)/runtime/lib:/soc/lib:...AXLLM_RELEASE_AXMODEL_BUFFER_AFTER_INIT=1
It also checks that the active libax_engine.so exports the required shared-weight symbols before starting the service:
AX_ENGINE_CreateHandleV4AX_ENGINE_PatchHandleWBT
This package includes the compatible shared-weight AXEngine runtime at runtime/lib/libax_engine.so. Do not overwrite /soc/lib just for this package. The package-local LD_LIBRARY_PATH method is process-local and is the recommended deployment method when the board image has not yet integrated the shared-weight AXEngine runtime.
Health checks:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Expected model id:
AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047
OpenAI-Compatible Examples
Use task_id to select the adapter. If omitted, the service uses the default retrieval task.
The active adapter is process-global. Do not send mixed task_id requests concurrently to the same service process. Use serial task switching, or run separate service processes if your application needs concurrent retrieval and clustering traffic.
Run the example commands below serially. The default packaged service is validated with max_concurrency = 1.
Task switching is request-driven. Start axllm serve once, then set one of these fields in each /v1/embeddings request body:
task_id: recommended field.adapter_task: accepted alias.task: accepted alias.
Supported values in this package are retrieval and clustering. The runtime switches the active adapter before executing the request. The next request may use a different task_id; no service restart is required.
For text-only embedding requests, task switching patches only the LLM adapter. For image, audio, or video requests, the runtime also switches the task-specific media mapper/projector while reusing the shared media tower.
Minimal retrieval request:
curl http://127.0.0.1:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047",
"task_id": "retrieval",
"prompt_name": "query",
"input": "Which planet is known as the Red Planet?",
"encoding_format": "float"
}'
Minimal clustering request:
curl http://127.0.0.1:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047",
"task_id": "clustering",
"prompt_name": "document",
"input": "Text embeddings convert sentences into dense vectors for clustering.",
"encoding_format": "float"
}'
Text:
python3 python/openai_embedding_demo.py \
--api-url http://127.0.0.1:8000/v1 \
--model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
--task-id retrieval \
--prompt-name query \
--input "Which planet is known as the Red Planet?"
Image:
python3 python/openai_multimodal_embedding_demo.py \
--api-url http://127.0.0.1:8000/v1 \
--model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
--task-id retrieval \
--prompt-name query \
--media-type image \
--media-path assets/sample.png
Audio:
python3 python/openai_multimodal_embedding_demo.py \
--api-url http://127.0.0.1:8000/v1 \
--model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
--task-id retrieval \
--prompt-name query \
--media-type audio \
--media-path assets/audio_test_chunk0_8s.wav
Video:
python3 python/openai_multimodal_embedding_demo.py \
--api-url http://127.0.0.1:8000/v1 \
--model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
--task-id retrieval \
--prompt-name query \
--media-type video \
--media-path assets/red-panda-openai.frames
Clustering:
python3 python/openai_embedding_demo.py \
--api-url http://127.0.0.1:8000/v1 \
--model AXERA-TECH/jina-embeddings-v5-omni-nano-AX650-P128-CTX2047 \
--task-id clustering \
--prompt-name document \
--input "Text embeddings convert sentences into dense vectors for retrieval, clustering, and similarity search."
The validated video path is a directory of pre-extracted frames. If you want to use a video file, extract frames first and pass the frame directory to the API.
Image Retrieval and Clustering Evaluation
The model returns embeddings. Retrieval and clustering are client-side operations built from those embeddings.
The evaluation script accepts either:
- one subdirectory per class, or
- a flat demo directory with label-prefixed filenames such as
cat_0.jpganddog_1.jpg
Example directory layouts:
animal_test/
βββ cat/
β βββ cat_001.jpg
β βββ cat_002.jpg
βββ dog/
β βββ dog_001.jpg
β βββ dog_002.jpg
βββ rabbit/
βββ rabbit_001.jpg
βββ rabbit_002.jpg
animals_imgs/
βββ cat_0.jpeg
βββ cat_1.jpeg
βββ dog_0.jpeg
βββ ...
Run after starting axllm serve:
python3 python/evaluate_image_retrieval_clustering.py \
--api-url http://127.0.0.1:8000/v1 \
--dataset-dir animals_imgs \
--output-json tmp/animals_imgs_eval.json
The script evaluates:
- Image nearest-neighbor retrieval with
task_id=retrieval. - Text-to-image retrieval with queries such as
a photo of a cat. - K-means cluster purity with
task_id=clustering.
Use at least two images per class. More images per class make the result more meaningful.
Board Precision
The table below compares board-side axllm serve embeddings with the packaged Hugging Face reference embeddings under python/testdata/service_cases/<task_id>/*/torch_embedding.npy.
Run retrieval validation after starting the service:
python3 python/compare_openai_api_vs_hf_multimodal.py \
--api-url http://127.0.0.1:8000/v1 \
--api-package-root . \
--task-id retrieval
Run clustering validation:
python3 python/compare_openai_api_vs_hf_multimodal.py \
--api-url http://127.0.0.1:8000/v1 \
--api-package-root . \
--task-id clustering
The board-side comparison script does not run the original Hugging Face model. It compares the API output with the packaged HF reference embeddings included in this repository.
Validated retrieval results:
| Modality | Case | Output shape | Soft tokens | Cosine vs HF | Mean abs diff | Max abs diff |
|---|---|---|---|---|---|---|
| Text document | embedding_doc |
[1, 768] |
- |
0.999655 |
0.000760 |
0.003988 |
| Text query | red_planet_query |
[1, 768] |
- |
0.999539 |
0.000835 |
0.009185 |
| Image | vision_sample |
[1, 768] |
64 |
0.996820 |
0.002271 |
0.012364 |
| Audio 8s | audio_test_chunk0_8s_wav |
[1, 768] |
200 |
0.992794 |
0.003383 |
0.015184 |
| Video 3 frames | video_visual_red_panda_openai_mp4 |
[1, 768] |
192 |
0.996833 |
0.002273 |
0.015725 |
Validated clustering results:
| Modality | Case | Output shape | Soft tokens | Cosine vs HF | Mean abs diff | Max abs diff |
|---|---|---|---|---|---|---|
| Text document | embedding_doc |
[1, 768] |
- |
0.999692 |
0.000719 |
0.002805 |
| Text query | red_planet_query |
[1, 768] |
- |
0.999579 |
0.000842 |
0.003600 |
| Image | vision_sample |
[1, 768] |
64 |
0.998207 |
0.001708 |
0.009309 |
| Audio 8s | audio_test_chunk0_8s_wav |
[1, 768] |
200 |
0.998418 |
0.001614 |
0.006731 |
| Video 3 frames | video_visual_red_panda_openai_mp4 |
[1, 768] |
192 |
0.988590 |
0.004383 |
0.016320 |
The clustering video case is the lowest-precision validated case in this package. Text, image, audio, and retrieval video are closer to the packaged HF references.
Performance
This model returns embeddings and does not run token-by-token decoding. The useful runtime metric is media preparation plus LLM prefill.
The default config.json uses a shared 256x256 vision tower, shared 8s audio tower, and task-specific retrieval/clustering mapper/projector AXModels. The audio examples use the shipped assets/audio_test_chunk0_8s.wav, which is 16kHz mono PCM WAV. This package validates and recommends 16kHz mono PCM WAV for audio input.
Split media AXModels in the default package:
| AXModel | Role | Output tokens | File size |
|---|---|---|---|
jina_v5_omni_nano_vision_tower_256x256.axmodel |
shared vision tower | raw 256 |
96,659,343 bytes |
jina_v5_omni_nano_vision_merger_retrieval_256x256.axmodel |
retrieval vision mapper | 64 |
12,767,060 bytes |
jina_v5_omni_nano_vision_merger_clustering_256x256.axmodel |
clustering vision mapper | 64 |
12,767,092 bytes |
jina_v5_omni_nano_audio_tower_8s.axmodel |
shared audio tower | raw 200 |
702,128,779 bytes |
jina_v5_omni_nano_audio_projector_retrieval_8s.axmodel |
retrieval audio projector | 200 |
1,093,363 bytes |
jina_v5_omni_nano_audio_projector_clustering_8s.axmodel |
clustering audio projector | 200 |
1,093,363 bytes |
Representative retrieval latency from the AX650 OpenAI API path:
| Scenario | Prompt | LLM input tokens | Soft tokens | Output shape | Media prepare | LLM prefill | Runtime total |
|---|---|---|---|---|---|---|---|
| Text document | document |
19 |
- |
[1, 768] |
- |
120.13 ms |
121.30 ms |
| Text query | query |
12 |
- |
[1, 768] |
- |
93.95 ms |
94.34 ms |
| Image | query |
85 |
64 |
[1, 768] |
typically ~52-181 ms after warmup |
typically ~70 ms |
typically ~0.12-0.35 s after warmup |
| Audio 8s | query |
223 |
200 |
[1, 768] |
included in total | included in total | typically ~0.8-1.0 s after warmup |
| Video 3 frames | query |
213 |
192 |
[1, 768] |
included in total | included in total | typically ~0.35 s after warmup |
Task switching latency measured with the packaged shared-base runtime:
| Switch case | Typical latency | Notes |
|---|---|---|
Text retrieval -> clustering |
about 1-2 s |
LLM adapter patch only |
Image retrieval -> clustering |
about 1.5 s |
Reuses shared vision tower and switches the small mapper |
| Same-task image request after mapper is active | typically ~0.13-0.47 s |
No adapter or tower reload |
The first audio request initializes the shared audio tower lazily. That one-time audio initialization cost is not part of steady-state same-task inference latency.
Runtime Footprint
Measured on AX650 with the default split-media 8s audio profile and release_axmodel_buffer_after_init enabled. The sample below reflects a steady-state process after the shared audio tower has been initialized.
| Item | Value |
|---|---|
| Estimated additional AXERA CMM over idle board baseline | 537872 KB (~525 MiB) |
| AXERA CMM total used in the measured board state | 812876 KB (~794 MiB) |
| Linux process RSS | 160052 KB (~156 MiB) |
| Linux process PSS | 159925 KB (~156 MiB) |
| Service ready time | ~6 s in the measured run |
CMM means AXERA contiguous multimedia memory. Linux process memory is resident process memory for the axllm serve process.
Token Layout and Static Shapes
The final embedding output is always [1, 768].
Default encoder profiles:
| Input | Static input profile | Soft tokens | Encoder output |
|---|---|---|---|
| Image | 256x256 |
64 |
shared tower [256, 768], then mapper [1, 64, 768] |
| Audio | 8.0s, 16kHz, mono PCM WAV, 800 mel frames |
200 |
shared tower [200, 1280], then projector [1, 200, 768] |
| Video | frame directory, 256x256 per frame, default 3 frames |
64 x frame_count |
per frame tower [256, 768], then mapper [1, 64, 768] |
The packaged text backbone is compiled with:
prefill_len = 128- effective
prefill_max_token_num = 1024 max_token_len = 2047
Choose the video frame count according to your application and the compiled prefill budget.
Input Notes
- Use
task_id=retrievalfor retrieval embeddings andtask_id=clusteringfor clustering embeddings. documentandqueryare separate text prompt modes exposed by the upstream model. Usedocumentfor corpus/document embeddings andqueryfor search queries.- Audio input must be
8s,16kHz, mono PCM WAV for the default package profile. Convert audio offline on a host withffmpegif needed:ffmpeg -i input.wav -ac 1 -ar 16000 -sample_fmt s16 output_16k_mono.wav. - Video input should be a frame directory. Retrieval and clustering validation both use
3frames from the shipped sample directory. - Arbitrary image resolution, audio duration, video frame count, or token budget requires rebuilding the corresponding encoder or LLM configuration.
Conversion References
- Upstream model:
https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano - AXERA runtime:
https://github.com/AXERA-TECH/ax-llm
If you rebuild the LLM AXModels from the original Hugging Face checkpoint, make sure the build config contains the flattened field text_config.rope_theta = 1000000.0. The upstream nano config stores this value under text_config.rope_parameters.rope_theta, while the AXERA llama build route reads text_config.rope_theta.
Discussion
This package validates retrieval and clustering task switching on AX650 with a shared LLM base and task-specific adapter patches. Additional upstream adapters need their own AX650 adapter patches and media encoder validation before they are enabled in this package.
- Downloads last month
- -
Model tree for AXERA-TECH/jina-embeddings-v5-omni-nano
Base model
jinaai/jina-embeddings-v5-omni-nano