--- license: mit language: - en - zh base_model: - Qwen/Qwen3-VL-2B-Instruct - Qwen/Qwen3-VL-4B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - Qwen3-VL - Qwen3-VL-2B-Instruct - Qwen3-VL-4B-Instruct - Int4 - VLM - GPTQ - AX630C - axllm --- # Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448 This repository contains the AX630C deployment of `Qwen3-VL-2B-Instruct-GPTQ-Int4`. The model assets have been flattened to match the `axllm` model directory convention, so the repository root can be passed directly to `axllm run` or `axllm serve`. Compatible with Pulsar2 version: 5.0 ## Conversion References For those who are interested in model conversion, you can start from the original model and the official runtime/conversion references: - https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct - https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) [AXERA-TECH/ax-llm](https://github.com/AXERA-TECH/ax-llm) ## Support Platform - AX630C - SDK >= v3.0.0 - AX630C DEMO Board ## Performance Reference The following benchmark placeholders and demo references are kept from the original repository because they are useful to users when comparing different deployment variants. **Image Process** |Chips| input size | image num | image encoder | ttft(168 tokens) | w4a16 | CMM | Flash | |--|--|--|--|--|--|--|--| |AX630C| 384*384 | 1 | ms | ms | tokens/sec| 2.0 GB | 2.7 GiB | **Video Process** |Chips| input size | image num | image encoder |ttft(600 tokens) | w4a16 | CMM | Flash | |--|--|--|--|--|--|--|--| |AX630C| 384*384 | 8 | ms | ms | tokens/sec| 2.0 GB | 2.7 GB | The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value. ## Repository Layout ```text . ├── config.json ├── post_config.json ├── qwen3_tokenizer.txt ├── model.embed_tokens.weight.bfloat16.bin ├── qwen3_vl_text_p64_l0_together.axmodel ├── ... ├── qwen3_vl_text_p64_l27_together.axmodel ├── qwen3_vl_text_post.axmodel ├── Qwen3-VL-2B-Instruct_vision_u8_320_ax630c.axmodel ├── Qwen3-VL-2B-Instruct_vision_u8_384_ax630c.axmodel ├── images/ ├── video/ ├── qwen3-vl-tokenizer/ ├── main_ax630c ├── main_ax630c_api └── run_*.sh ``` The model files that `axllm` needs are now all at repository root. The legacy AX630C demo binaries and helper scripts are still kept for compatibility. ## How To Use Download all files from this repository to the device. The repository root can now be used in two ways: 1. Directly with `axllm` 2. With the original AX630C demo binaries plus tokenizer service ## Direct Inference with `axllm` > Sorry, the `axllm` inference flow is still being structured, so the following instructions and scripts are expected to be updated in the future. Please stay tuned for the latest updates. ### Installation 方式一: 克隆仓库后执行安装脚本: ```shell git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git cd ax-llm ./install.sh ``` 方式二: 一行命令安装 (默认分支 `axllm`): ```shell curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash ``` 方式三: 下载 Github Actions CI 导出的可执行程序: 如果没有编译环境, 请到: `https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm` 下载最新 CI 导出的 `axllm`, 然后: ```shell chmod +x axllm sudo mv axllm /usr/bin/axllm ``` ### Download Model from Hugging Face ```shell mkdir -p AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448 cd AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448 hf download AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448 --local-dir . ``` ### Run ```shell cd AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448 axllm run . ``` Common interaction examples: ```text prompt >> 你是谁 image >> ``` ```text prompt >> 描述这张图片 image >> ./images/demo.jpg ``` ```text prompt >> 描述这个视频 image >> video:./video ``` ### Serve ```shell axllm serve . --port 8000 curl http://127.0.0.1:8000/health curl http://127.0.0.1:8000/v1/models ``` ### Default Encoder Selection `config.json` defaults to `Qwen3-VL-2B-Instruct_vision_u8_320_ax630c.axmodel`, which matches the `P320` repository variant. If you want to use the 384 encoder with `axllm`, update `filename_image_encoder_axmodel` in `config.json` to: ```json "Qwen3-VL-2B-Instruct_vision_u8_384_ax630c.axmodel" ``` `axllm` will infer the actual vision input size from the encoder model, so no extra width or height changes are required in `config.json`. ## Legacy AX630C Demo Flow The original demo entrance is still available. After the repository was flattened, the scripts were adjusted to load model files from repository root, so they no longer depend on the old nested model directory. ### Install Python Dependencies ```shell pip install -r requirements.txt ``` ### Prepare Tokenizer Server ```shell python3 qwen3_tokenizer.py --port 8080 ``` ### Image Understand Demo #### Input text ```text 描述这张图片 ``` #### Input image ![](./images/recoAll_attractions_1.jpg) Use: ```shell bash run_image_ax630c.sh ``` Sample output preserved from the original repository: ```text root@AX630C ~/Qwen3-VL-2B-Instruct-GPTQ-Int4 # bash run_image_ax630c.sh [I][ Init][ 156]: LLM init start [I][ Init][ 158]: Total CMM:4353 MB [I][ Init][ 34]: connect http://127.0.0.1:8080 ok bos_id: -1, eos_id: 151645 img_start_token: 151652 img_context_token: 151655 3% | ██ | 1 / 31 [0.01s<0.46s, 66.67 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap 6% | ███ | 2 / 31 [0.02s<0.34s, 90.91 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28 103% | ██████████████████████████████████ | 32 / 31 [34.03s<32.96s, 0.94 count/s] init vpm axmodel ok,remain_cmm(854 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652 [I][ Init][ 309]: image encoder output float32 [I][ Init][ 339]: max_token_len : 2047 [I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047 [I][ Init][ 352]: prefill_token_num : 128 [I][ Init][ 356]: grp: 1, prefill_max_token_num : 1 [I][ Init][ 356]: grp: 2, prefill_max_token_num : 128 [I][ Init][ 356]: grp: 3, prefill_max_token_num : 256 [I][ Init][ 356]: grp: 4, prefill_max_token_num : 384 [I][ Init][ 356]: grp: 5, prefill_max_token_num : 512 [I][ Init][ 356]: grp: 6, prefill_max_token_num : 640 [I][ Init][ 356]: grp: 7, prefill_max_token_num : 768 [I][ Init][ 356]: grp: 8, prefill_max_token_num : 896 [I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024 [I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152 [I][ Init][ 360]: prefill_max_token_num : 1152 [I][ Init][ 372]: LLM init ok [I][ Init][ 374]: Left CMM:854 MB Type "q" to exit, Ctrl+c to stop current running prompt >> 描述这张图片 image >> images/recoAll_attractions_1.jpg [I][ EncodeImage][ 440]: pixel_values size 1 [I][ EncodeImage][ 441]: grid_h 24 grid_w 24 [I][ EncodeImage][ 489]: image encode time : 237.778000 ms, size : 1 [I][ Encode][ 532]: input_ids size:168 [I][ Encode][ 540]: offset 15 [I][ Encode][ 569]: img_embed.size:1, 294912 [I][ Encode][ 583]: out_embed size:344064 [I][ Encode][ 584]: input_ids size 168 [I][ Encode][ 586]: position_ids size:168 [I][ Run][ 607]: input token num : 168, prefill_split_num : 2 [I][ Run][ 641]: input_num_token:128 [I][ Run][ 641]: input_num_token:40 [I][ Run][ 865]: ttft: 313.60 ms 这是一张在埃及沙漠中拍摄的风景照片。画面中,三座巨大的金字塔在晴朗的天空下矗立,它们是古埃及文明的象征。这些金字塔由巨大的石块堆叠而成,表面因岁月侵蚀而显得斑驳。在金字塔的前方,有几个人影在沙地上行走,这为整个场景提供了比例感和尺度感。整个场景充满了历史的厚重感和神秘的氛围。 [N][ Run][ 992]: hit eos,avg 14.14 token/s ``` For the 384 image encoder variant: ```shell bash run_image_ax630c_384.sh ``` ### Video Understand Demo #### Input text ```text 描述这个视频 ``` #### Input video ```text ./video ``` Use: ```shell bash run_video_ax630c.sh ``` Sample output preserved from the original repository: ```text root@AX630C ~/Qwen3-VL-2B-Instruct-GPTQ-Int4 # bash run_video_ax630c.sh [I][ Init][ 156]: LLM init start [I][ Init][ 158]: Total CMM:7884 MB [I][ Init][ 34]: connect http://127.0.0.1:8080 ok bos_id: -1, eos_id: 151645 img_start_token: 151652 img_context_token: 151656 3% | ██ | 1 / 31 [0.01s<0.34s, 90.91 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap 6% | ███ | 2 / 31 [0.01s<0.23s, 133.33 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28 103% | ██████████████████████████████████ | 32 / 31 [32.37s<31.36s, 0.99 count/s] init vpm axmodel ok,remain_cmm(4385 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652 [I][ Init][ 309]: image encoder output float32 [I][ Init][ 339]: max_token_len : 2047 [I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047 [I][ Init][ 352]: prefill_token_num : 128 [I][ Init][ 356]: grp: 1, prefill_max_token_num : 1 [I][ Init][ 356]: grp: 2, prefill_max_token_num : 128 [I][ Init][ 356]: grp: 3, prefill_max_token_num : 256 [I][ Init][ 356]: grp: 4, prefill_max_token_num : 384 [I][ Init][ 356]: grp: 5, prefill_max_token_num : 512 [I][ Init][ 356]: grp: 6, prefill_max_token_num : 640 [I][ Init][ 356]: grp: 7, prefill_max_token_num : 768 [I][ Init][ 356]: grp: 8, prefill_max_token_num : 896 [I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024 [I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152 [I][ Init][ 360]: prefill_max_token_num : 1152 [I][ Init][ 372]: LLM init ok [I][ Init][ 374]: Left CMM:4385 MB Type "q" to exit, Ctrl+c to stop current running prompt >> 描述这个视频 video >> video video/frame_0000.jpg video/frame_0008.jpg video/frame_0016.jpg video/frame_0024.jpg video/frame_0032.jpg video/frame_0040.jpg video/frame_0048.jpg video/frame_0056.jpg [I][ EncodeImage][ 440]: pixel_values size 4 [I][ EncodeImage][ 441]: grid_h 24 grid_w 24 [I][ EncodeImage][ 489]: image encode time : 751.481018 ms, size : 4 [I][ Encode][ 532]: input_ids size:600 [I][ Encode][ 540]: offset 15 [I][ Encode][ 569]: img_embed.size:4, 294912 [I][ Encode][ 574]: offset:159 [I][ Encode][ 574]: offset:303 [I][ Encode][ 574]: offset:447 [I][ Encode][ 583]: out_embed size:1228800 [I][ Encode][ 584]: input_ids size 600 [I][ Encode][ 586]: position_ids size:600 [I][ Run][ 607]: input token num : 600, prefill_split_num : 5 [I][ Run][ 641]: input_num_token:128 [I][ Run][ 641]: input_num_token:128 [I][ Run][ 641]: input_num_token:128 [I][ Run][ 641]: input_num_token:128 [I][ Run][ 641]: input_num_token:88 [I][ Run][ 865]: ttft: 843.36 ms 这是一段关于两只山地旱獭(也称“山地土拨鼠”)在山地环境中互动的视频。 在画面中,两只山地旱獭正站在布满碎石的山坡上,背景是连绵起伏的山脉和蓝天。它们的毛色以灰、棕、黑相间,脸部和耳朵周围有明显的黑白条纹,显得非常可爱。 这两只旱獭正在进行一场激烈的“拳击”或“格斗”游戏。它们的前爪高高举起,像在互相击打,但它们的姿势和动作表明它们可能是在进行一场激烈的“拳击”或“格斗”游戏。它们的嘴巴和前爪在空中挥舞,似乎在互相攻击或展示力量。 整个场景充满了动感和活力,展现了这些小动物在自然环境中充满活力和趣味的一面。 [N][ Run][ 992]: hit eos,avg 14.16 token/s ``` For the 384 video encoder variant: ```shell bash run_video_ax630c_384.sh ``` ### Gradio Demo Start tokenizer server for the demo. The current `run_ax_api.sh` defaults to `http://127.0.0.1:8080`: ```shell python3 qwen3_tokenizer.py --port 8080 --host 0.0.0.0 ``` Start the OpenAI-style API server: ```shell bash run_ax_api.sh ``` If the tokenizer server is not running on the same machine, please modify the tokenizer server address in `run_ax_api.sh`. Start the Gradio UI: ```shell python3 gradio_demo.py ``` If the API server is not running on the same machine, please modify the API URL in the Gradio web UI. ![image](https://cdn-uploads.huggingface.co/production/uploads/64b7837c17570fdff9b906b9/Og9fPNi0chg768gicse7M.png)