drizzlezyk commited on
Commit
77d7b3a
·
verified ·
1 Parent(s): 464f0f4

Upload inference/vllm_ascend_for_openpangu_embedded_1b.md with huggingface_hub

Browse files
inference/vllm_ascend_for_openpangu_embedded_1b.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Deployment Guide of openPangu Embedded 1B Based on [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
2
+
3
+ ### Deployment Environment Description
4
+
5
+ The Atlas 800T A2 (64 GB) supports the deployment of Pangu Embedded 1B (bf16) with a single card. The vllm-ascend community image v0.9.1-dev is used and needs to be pulled on multiple nodes.
6
+ ```bash
7
+ docker pull quay.io/ascend/vllm-ascend:v0.9.1-dev
8
+ ```
9
+
10
+ ### Docker Boot and Inference Code
11
+
12
+ Perform the following operations on all nodes.
13
+
14
+ Run the following command to start the docker:
15
+ ```bash
16
+ # Update the vllm-ascend image
17
+ export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1-dev # Use correct image id
18
+ export NAME=vllm-ascend # Custom docker name
19
+
20
+ # Run the container using the defined variables
21
+ # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
22
+ # To prevent device interference from other docker containers, add the argument "--privileged"
23
+ docker run --rm \
24
+ --name $NAME \
25
+ --network host \
26
+ --device /dev/davinci0 \
27
+ --device /dev/davinci1 \
28
+ --device /dev/davinci2 \
29
+ --device /dev/davinci3 \
30
+ --device /dev/davinci4 \
31
+ --device /dev/davinci5 \
32
+ --device /dev/davinci6 \
33
+ --device /dev/davinci7 \
34
+ --device /dev/davinci_manager \
35
+ --device /dev/devmm_svm \
36
+ --device /dev/hisi_hdc \
37
+ -v /usr/local/dcmi:/usr/local/dcmi \
38
+ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
39
+ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
40
+ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
41
+ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
42
+ -v /etc/ascend_install.info:/etc/ascend_install.info \
43
+ -v /mnt/sfs_turbo/.cache:/root/.cache \
44
+ -it $IMAGE bash
45
+ ```
46
+ If not inside the container, enter the container as the root user:
47
+ ```
48
+ docker exec -itu root $NAME /bin/bash
49
+ ```
50
+
51
+ Download vllm (v0.9.2) to replace the built-in vllm code of the image.
52
+ ```bash
53
+ pip install --no-deps vllm==0.9.2 pybase64==1.4.1
54
+ ```
55
+
56
+ Download [vllm-ascend (v0.9.2rc1)](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.9.2rc1) and replace the built-in vllm-ascend code in the image (/vllm-workspace/vllm-ascend/). For example, download [Source code (tar.gz)](https://github.com/vllm-project/vllm-ascend/archive/refs/tags/v0.9.2rc1.tar.gz) from Assets to get v0.9.2rc1.tar.gz, then extract and replace:
57
+
58
+ ```bash
59
+ tar -zxvf vllm-ascend-0.9.2rc1.tar.gz -C /vllm-workspace/vllm-ascend/ --strip-components=1
60
+ export PYTHONPATH=/vllm-workspace/vllm-ascend/:${PYTHONPATH}
61
+ ```
62
+
63
+ Use the Pangu model-adapted vllm-ascend code from the current repository to replace parts of the code in `/vllm-workspace/vllm-ascend/vllm_ascend/`:
64
+
65
+ ```bash
66
+ yes | cp -r inference/vllm_ascend/* /vllm-workspace/vllm-ascend/vllm_ascend/
67
+ ```
68
+
69
+ ### openPangu Embedded Inference
70
+
71
+ Perform the following operations on all nodes.
72
+
73
+ Configuration:
74
+ ```bash
75
+ export VLLM_USE_V1=1
76
+ # Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
77
+ # Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
78
+ HOST=xxx.xxx.xxx.xxx
79
+ PORT=8080
80
+ ```
81
+
82
+ openPangu Embedded 1B running command:
83
+ ```bash
84
+ export ASCEND_RT_VISIBLE_DEVICES=0
85
+ LOCAL_CKPT_DIR=/root/.cache/pangu_embedded_1b # The pangu_embedded_1b bf16 weight
86
+ SERVED_MODEL_NAME=pangu_embedded_1b
87
+
88
+ vllm serve $LOCAL_CKPT_DIR \
89
+ --served-model-name $SERVED_MODEL_NAME \
90
+ --tensor-parallel-size 1 \
91
+ --trust-remote-code \
92
+ --host $HOST \
93
+ --port $PORT \
94
+ --max-num-seqs 32 \
95
+ --max-model-len 32768 \
96
+ --max-num-batched-tokens 4096 \
97
+ --tokenizer-mode "slow" \
98
+ --dtype bfloat16 \
99
+ --distributed-executor-backend mp \
100
+ --gpu-memory-utilization 0.93 \
101
+ --no-enable-prefix-caching \
102
+ --no-enable-chunked-prefill \
103
+ ```
104
+
105
+ ### Test Request
106
+
107
+ After server launched, send test request from master node or other nodes:
108
+
109
+ ```bash
110
+ MASTER_NODE_IP=xxx.xxx.xxx.xxx # server node ip
111
+ curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
112
+ -H "Content-Type: application/json" \
113
+ -d '{
114
+ "model": "'$SERVED_MODEL_NAME'",
115
+ "messages": [
116
+ {
117
+ "role": "user",
118
+ "content": "Who are you?"
119
+ }
120
+ ],
121
+ "max_tokens": 512,
122
+ "temperature": 0
123
+ }'
124
+ ```