Xiao-AMD commited on
Commit
a064c3b
·
verified ·
1 Parent(s): 0592f74

Delete docs

Browse files
Files changed (1) hide show
  1. docs/deploy_guidance.md +0 -93
docs/deploy_guidance.md DELETED
@@ -1,93 +0,0 @@
1
- # Kimi-K2.7-Code Deployment Guide
2
-
3
- > [!Note]
4
- > This guide only provides some examples of deployment commands for Kimi-K2.7-Code, which may not be the optimal configuration. Since inference engines are still being updated frequently, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
5
-
6
- > [!Note]
7
- > Kimi-K2.7-Code has the same architecture as Kimi-K2.5/Kimi-K2.6, and the deployment method can be directly reused.
8
- ## vLLM Deployment
9
-
10
- You can refer to https://recipes.vllm.ai/moonshotai/Kimi-K2.6 for the newest deployment guide.
11
-
12
- This model is available in nightly vLLM wheel:
13
- ```
14
- uv pip install -U vllm \
15
- --torch-backend=auto \
16
- --extra-index-url https://wheels.vllm.ai/nightly
17
- ```
18
-
19
- Nightly wheels may be unstable and are considered experimental. For stable production use, we recommend vLLM 0.19.1, which has been manually verified.
20
-
21
- Here is the example to serve this model on a H200 single node with TP8 via vLLM:
22
- ```bash
23
- vllm serve $MODEL_PATH -tp 8 --mm-encoder-tp-mode data --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
24
- ```
25
- **Key notes**
26
- - `--tool-call-parser kimi_k2`: Required for enabling tool calling
27
- - `--reasoning-parser kimi_k2`: Kimi-K2.7-Code supports thinking mode only. Make sure to pass this for correct reasoning processing.
28
-
29
- ## SGLang Deployment
30
-
31
- You can refer to https://cookbook.sglang.io/autoregressive/Moonshotai/Kimi-K2.6 for the newest deployment guide.
32
-
33
- This model is supported in SGLang v0.5.10 and later stable releases (no nightly / main build required). `uv` is preferred:
34
-
35
- ```
36
- uv pip install "sglang>=0.5.10.post1" --prerelease=allow
37
- ```
38
-
39
- Here is the example for it to run with TP8 on H200 in a single node via SGLang:
40
- ``` bash
41
- sglang serve --model-path $MODEL_PATH --tp 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
42
- ```
43
- **Key parameter notes:**
44
- - `--tool-call-parser kimi_k2`: Required when enabling tool usage.
45
- - `--reasoning-parser kimi_k2`: Required for correctly processing reasoning content.
46
-
47
- ## KTransformers Deployment
48
- ### KTransformers+SGLang Inference Deployment
49
- Launch with KTransformers + SGLang for CPU+GPU heterogeneous inference:
50
-
51
- ```
52
- python -m sglang.launch_server \
53
- --host 0.0.0.0 \
54
- --port 31245 \
55
- --model /path/to/kimi-k2.7-code \
56
- --kt-weight-path /path/to/kimi-k2.7-code \
57
- --kt-cpuinfer 96 \
58
- --kt-threadpool-count 2 \
59
- --kt-num-gpu-experts 30 \
60
- --kt-method RAWINT4 \
61
- --kt-gpu-prefill-token-threshold 400 \
62
- --trust-remote-code \
63
- --mem-fraction-static 0.94 \
64
- --served-model-name Kimi-K2.7-Code \
65
- --enable-mixed-chunk \
66
- --tensor-parallel-size 4 \
67
- --enable-p2p-check \
68
- --disable-shared-experts-fusion \
69
- --chunked-prefill-size 32658 \
70
- --max-total-tokens 50000 \
71
- --attention-backend flashinfer
72
- ```
73
-
74
- Achieves 640.12 tokens/s Prefill and 24.51 tokens/s Decode (48-way concurrency) on 8× NVIDIA L20 + 2× Intel 6454S.
75
-
76
- More details: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Kimi-K2.5.md .
77
-
78
- ### KTransformers+LLaMA-Factory Fine-tuning Deployment
79
-
80
- You can use below command to run LoRA SFT with KT+llamafactory.
81
-
82
- ```
83
- # For LoRA SFT
84
- USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
85
- # For Chat with model after LoRA SFT
86
- llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
87
- # For API with model after LoRA SFT
88
- llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml
89
- ```
90
-
91
- This achieves end-to-end LoRA SFT Throughput: 44.55 token/s on 2× NVIDIA 4090 + Intel 8488C with 1.97T RAM and 200G swap memory.
92
-
93
- More details refer to https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/SFT_Installation_Guide_KimiK2.5.md .