HY-MT1.5-1.8B_GPTQ_INT4-AX620E

This version of HY-MT1.5-1.8B_GPTQ_INT4 has been converted to run on the Axera NPU using w4a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: > 5.1-patch1-dirty.

Please note that the context of the model is 2k and the maximum prefill length is 1k.

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo:

https://huggingface.co/tencent/HY-MT1.5-1.8B

How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

AXera NPU AXCL LLM Runtime

Support Platform

  • AX620E
    • AX620E DEMO Board
Chips ttft w4a16
AX620E 11538.6 ms (512 prefill) 4.05 tokens/sec

How to use

Download all files from this repository to the device

$ tree -L 1
.
├── assets
├── config.json
├── gradio_demo.py
├── hymt1-5_1k_ax620e_axmodel
├── hymt1-5_tokenizer
├── infer_axmodel.py
├── infer_torch.py
├── README.md
└── utils

5 directories, 5 files

Install transformer

pip install transformers==4.57.1

Inference with AX620E Demo Board

Start the OpenAI-compatible API with axllm serve:

axllm serve . --port 8000

本仓库也附带一个 aarch64 axllm 二进制,可直接在本仓库目录下尝试运行:

chmod +x ./bin/axllm
./bin/axllm serve . --port 8000

该二进制与 AX650 仓库中的打包产物同源,来源和校验信息记录在 bin/axllm.version.json 中。当前已完成 AX650 上的 HY-MT OpenAI API 验证,AX620E 板端请结合实机环境继续确认。

Interactive translation using the C++ Gradio Demo:

python3 gradio_cpp_backend.py --api_base http://127.0.0.1:8000 --model AXERA-TECH/HY-MT1.5-1.8B_GPTQ_INT4-AX620E

English Translate to Chinese:

demo_1

Chinese Translate to Japanese:

demo_2

If you want to run translation tasks from the command-line terminal, you can run the following command:

$ ./run_hymt1-5_1.8b_ax620e.sh
[I][                            Init][ 267]: LLM init ok
[I][                            Init][ 269]: Left CMM:3711 MB
Type "q" to exit, Ctrl+c to stop current running
prompt(输入q退出) >> 今天是个好日子,适合读书和运动.
[I][                             Run][ 349]: input token num : 23, prefill_split_num : 1
[I][                             Run][ 388]: input_num_token:23
[I][                             Run][ 581]: ttft: 157.15 ms
Today is a great day. It’s the perfect time to read and exercise.

[N][                             Run][ 719]: hit eos,avg 13.61 token/s

[I][                             Run][ 724]: decode profile: infer 58.079 ms/token, cache_copy 0.110, post 14.071, callback 0.018, tokens 17

Interactive conversations using the Python Gradio Demo:

$ python3 gradio_demo.py --axmodel_path hymt1-5_1k_ax620e_axmodel --max_seq_len 1023

English Translate to Chinese:

demo_1

Chinese Translate to Japanese:

demo_2


Run the following command on the Axera board to start a chat conversation:

$ python3 infer_axmodel.py -q "It’s on the house."

# output
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 14.55it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 43f8606b-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> 这是免费的。

If you are testing on an AX620E demo board, run the command below:

python3 gradio_demo.py --axmodel_path hymt1-5_1k_ax620e_axmodel --max_seq_len 1023
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/HY-MT1.5-1.8B_GPTQ_INT4-AX620E

Finetuned
(43)
this model