drizzlezyk commited on
Commit
7c22b90
·
verified ·
1 Parent(s): 54dd322

Upload inference/README_CN.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. inference/README_CN.md +130 -0
inference/README_CN.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## openPangu-R-7B-2512 在[vllm-ascend](https://github.com/vllm-project/vllm-ascend)部署指导文档
2
+
3
+ ### 部署环境说明
4
+
5
+ Atlas 800T A2(64GB) 可部署openPangu-R-7B-2512。
6
+
7
+ ### A2镜像构建和启动
8
+
9
+ 拉取基础镜像:
10
+
11
+ ```
12
+ docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11
13
+ ```
14
+
15
+ 使用Dockerfile.构建镜像:
16
+
17
+ ```
18
+ IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11
19
+ docker build -t $IMAGE -f ./Dockerfile .
20
+ ```
21
+
22
+ 启动镜像:
23
+
24
+ ```
25
+ export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11 # Use correct image id
26
+ export NAME=XXX # Custom docker name
27
+
28
+ # Run the container using the defined variables
29
+ # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
30
+ # To prevent device interference from other docker containers, add the argument "--privileged"
31
+ docker run -itd \
32
+ --privileged \
33
+ --ipc=host \
34
+ --name $NAME \
35
+ --network host \
36
+ --device /dev/davinci0 \
37
+ --device /dev/davinci1 \
38
+ --device /dev/davinci2 \
39
+ --device /dev/davinci3 \
40
+ --device /dev/davinci4 \
41
+ --device /dev/davinci5 \
42
+ --device /dev/davinci6 \
43
+ --device /dev/davinci7 \
44
+ --device /dev/davinci_manager \
45
+ --device /dev/devmm_svm \
46
+ --device /dev/hisi_hdc \
47
+ -v /usr/local/dcmi:/usr/local/dcmi \
48
+ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
49
+ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
50
+ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
51
+ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
52
+ -v /etc/ascend_install.info:/etc/ascend_install.info \
53
+ -v /mnt/:/mnt/ \
54
+ -v /data:/data \
55
+ -v /home/work:/home/work \
56
+ --entrypoint /bin/bash \
57
+ $IMAGE
58
+ ```
59
+
60
+ 需要保证模型权重和本项目代码可在容器中访问。如果未进入容器,需以root用户进容器。
61
+
62
+ ```
63
+ docker exec -itu root $NAME /bin/bash
64
+ cd inference
65
+ pip install -r requirements.txt
66
+ bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
67
+ source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash
68
+ pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall
69
+ ```
70
+
71
+ ### openPangu-R-7B-2512推理
72
+
73
+ 启动脚本:inference/launch.sh
74
+
75
+ 执行命令:
76
+
77
+ ```
78
+ export LOAD_CKPT_DIR = XXX/checkpoint/ # The pangu_7b bf16 weight
79
+ bash inference/launch.sh
80
+ ```
81
+
82
+ 启动脚本示例:
83
+
84
+ ```
85
+ # 指定 HOST=127.0.0.1(本地主机)表示服务器只能从主设备访问。
86
+ # 指定 HOST=0.0.0.0 允许从同一网络上的其他设备甚至从互联网访问 vLLM 服务器,前提是网络配置正确(例如,防火墙规则、端口转发)。
87
+ HOST=xxx.xxx.xxx.xxx
88
+
89
+ python $SCRIPT_DIR/vllm_register.py \
90
+ --model $LOCAL_CKPT_DIR \
91
+ --served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \
92
+ --tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \
93
+ --trust-remote-code \
94
+ --host $HOST \
95
+ --port ${PORT:=8000} \
96
+ --max-num-seqs ${MAX_NUM_SEQS:=256} \
97
+ --max-model-len ${MAX_MODEL_LEN:=40960} \
98
+ --tokenizer-mode "slow" \
99
+ --dtype bfloat16 \
100
+ --enable-log-requests \
101
+ --distributed-executor-backend mp \
102
+ --gpu-memory-utilization 0.9 \
103
+ --max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \
104
+ --no-enable-prefix-caching \
105
+ --enforce_eager \
106
+ --reasoning-parser pangu \
107
+
108
+ ```
109
+
110
+ ### 发请求测试
111
+
112
+ 服务启动后,可发送测试请求:
113
+
114
+ ```
115
+ MASTER_NODE_IP=xxx.xxx.xxx.xxx # server node ip
116
+ curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
117
+ -H "Content-Type: application/json" \
118
+ -d '{
119
+ "model": "'$SERVED_MODEL_NAME'",
120
+ "messages": [
121
+ {
122
+ "role": "user",
123
+ "content": "Who are you?"
124
+ }
125
+ ],
126
+ "max_tokens": 512,
127
+ "temperature": 0
128
+ }'
129
+ ```
130
+