--- library_name: transformers license: bsd-3-clause base_model: - tencent/HY-MT1.5-1.8B tags: - HY-MT1.5 - HY-MT1.5-1.8B - HY-MT1.5-1.8B_GPTQ_INT4 - Int4 - translation language: - zh - en - fr - pt - es - ja - tr - ru - ar - ko - th - it - de - vi - ms - id - tl - hi - pl - cs - nl - km - my - fa - gu - ur - te - mr - he - bn - ta - uk - bo - kk - mn - ug --- # HY-MT1.5-1.8B_GPTQ_INT4-AX620E This version of HY-MT1.5-1.8B_GPTQ_INT4 has been converted to run on the Axera NPU using **w4a16** quantization. This model has been optimized with the following LoRA: Compatible with Pulsar2 version: > 5.1-patch1-dirty. Please note that the context of the model is 2k and the maximum prefill length is 1k. ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/tencent/HY-MT1.5-1.8B [How to Convert LLM from Huggingface to axmodel](https://github.com/AXERA-TECH/HY-MT1.5-1.8B_GPTQ_INT4.axera/tree/main/model_convert) [AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/ax-internvl) [AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/axcl-internvl) ## Support Platform - AX620E - AX620E DEMO Board |Chips|ttft|w4a16| |--|--|--| |AX620E| 11538.6 ms (512 prefill) | 4.05 tokens/sec| ## How to use Download all files from this repository to the device ```sh $ tree -L 1 . ├── assets ├── config.json ├── gradio_demo.py ├── hymt1-5_1k_ax620e_axmodel ├── hymt1-5_tokenizer ├── infer_axmodel.py ├── infer_torch.py ├── README.md └── utils 5 directories, 5 files ``` ### Install transformer ``` pip install transformers==4.57.1 ``` ### Inference with AX620E Demo Board Start the OpenAI-compatible API with `axllm serve`: ```sh axllm serve . --port 8000 ``` 本仓库也附带一个 aarch64 `axllm` 二进制,可直接在本仓库目录下尝试运行: ```sh chmod +x ./bin/axllm ./bin/axllm serve . --port 8000 ``` 该二进制与 AX650 仓库中的打包产物同源,来源和校验信息记录在 `bin/axllm.version.json` 中。当前已完成 AX650 上的 HY-MT OpenAI API 验证,AX620E 板端请结合实机环境继续确认。 Interactive translation using the `C++ Gradio Demo`: ```sh python3 gradio_cpp_backend.py --api_base http://127.0.0.1:8000 --model AXERA-TECH/HY-MT1.5-1.8B_GPTQ_INT4-AX620E ``` English Translate to Chinese: ![demo_1](assets/gradio_cpp_demo_0.png) Chinese Translate to Japanese: ![demo_2](assets/gradio_cpp_demo_1.png) If you want to run translation tasks from the command-line terminal, you can run the following command: ```sh $ ./run_hymt1-5_1.8b_ax620e.sh [I][ Init][ 267]: LLM init ok [I][ Init][ 269]: Left CMM:3711 MB Type "q" to exit, Ctrl+c to stop current running prompt(输入q退出) >> 今天是个好日子,适合读书和运动. [I][ Run][ 349]: input token num : 23, prefill_split_num : 1 [I][ Run][ 388]: input_num_token:23 [I][ Run][ 581]: ttft: 157.15 ms Today is a great day. It’s the perfect time to read and exercise. [N][ Run][ 719]: hit eos,avg 13.61 token/s [I][ Run][ 724]: decode profile: infer 58.079 ms/token, cache_copy 0.110, post 14.071, callback 0.018, tokens 17 ``` --- Interactive conversations using the `Python Gradio Demo`: ```bash $ python3 gradio_demo.py --axmodel_path hymt1-5_1k_ax620e_axmodel --max_seq_len 1023 ``` English Translate to Chinese: ![demo_1](assets/gradio_demo_0.png) Chinese Translate to Japanese: ![demo_2](assets/gradio_demo_1.png) --- Run the following command on the Axera board to start a chat conversation: ```sh $ python3 infer_axmodel.py -q "It’s on the house." # output Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 14.55it/s] [INFO] Using provider: AxEngineExecutionProvider [INFO] Model type: 2 (triple core) [INFO] Compiler version: 5.1-patch1-dirty 43f8606b-dirty Model loaded successfully! slice_indices: [0] Slice prefill done: 0 answer >> 这是免费的。 ``` If you are testing on an `AX620E` demo board, run the command below: ```sh python3 gradio_demo.py --axmodel_path hymt1-5_1k_ax620e_axmodel --max_seq_len 1023 ```