0xSero commited on
Commit
7f2e7a3
·
verified ·
1 Parent(s): 818e1ab

Update DGX Spark 200K serving recipe

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -15,11 +15,12 @@ tags:
15
  base_model: deepseek-ai/DeepSeek-V4-Flash
16
  ---
17
 
18
- # Deepseek-V4-Flash-162B-REAP
19
 
20
- This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model. The validated single-DGX Spark serving recipe is maintained here:
21
 
22
- - GitHub: https://github.com/0xSero/deepseek-v4-flash-spark-200k
 
23
  - Docker registry target: `ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27`
24
  - Validated local Docker image: `vllm-node-dsv4-cutlass451:latest` / `sha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a`
25
  - Model repo used by the recipe: `0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP`
@@ -30,7 +31,7 @@ This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model. The validated singl
30
  Run this on the DGX Spark. `HF_TOKEN` is only required if the model repo is private or not already cached on the machine.
31
 
32
  ```bash
33
- HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-v4-flash-spark-200k; git clone https://github.com/0xSero/deepseek-v4-flash-spark-200k.git; cd deepseek-v4-flash-spark-200k; ./install.sh --profile k144-nospec-200k --launch'
34
  ```
35
 
36
  Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.
@@ -42,7 +43,7 @@ The profile lives at `configs/k144-nospec-200k.env` in the GitHub repo.
42
  ```bash
43
  MODEL_REPO=0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP
44
  MODEL_REVISION=d663e8fb16809f6619000648b187b257249ed824
45
- SERVED_MODEL_NAME=deepseek-v4-flash-k144-g27-cutlass451
46
  CONTEXT_LENGTH=200000
47
  KV_CACHE_MEMORY_BYTES=14G
48
  MAX_NUM_BATCHED_TOKENS=8192
@@ -50,14 +51,13 @@ MAX_NUM_SEQS=1
50
  GPU_MEMORY_UTILIZATION=0.88
51
  WATCHDOG_MIN_AVAILABLE_KB=8388608
52
  KV_CACHE_DTYPE=fp8
53
- ENFORCE_EAGER=0
54
- THINKING=false
55
  SPECULATIVE_CONFIG=
56
  VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
57
  VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
58
  ```
59
 
60
- The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, and CUDA graphs. Do not add `--enforce-eager`; this profile was validated with CUDA graph capture enabled.
61
 
62
  ## Docker runtime
63
 
 
15
  base_model: deepseek-ai/DeepSeek-V4-Flash
16
  ---
17
 
18
+ # DeepSeek-V4-Flash-Spark-Mini
19
 
20
+ This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model, served as `DeepSeek-V4-Flash-Spark-Mini`. The validated single-DGX Spark serving recipe is maintained here:
21
 
22
+ - One-command Spark wrapper: https://github.com/0xSero/deepseek-spark
23
+ - Runtime module: https://github.com/0xSero/deepseek-v4-flash-spark-200k
24
  - Docker registry target: `ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27`
25
  - Validated local Docker image: `vllm-node-dsv4-cutlass451:latest` / `sha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a`
26
  - Model repo used by the recipe: `0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP`
 
31
  Run this on the DGX Spark. `HF_TOKEN` is only required if the model repo is private or not already cached on the machine.
32
 
33
  ```bash
34
+ HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-spark; git clone https://github.com/0xSero/deepseek-spark.git; cd deepseek-spark; ./setup.sh full k144'
35
  ```
36
 
37
  Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.
 
43
  ```bash
44
  MODEL_REPO=0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP
45
  MODEL_REVISION=d663e8fb16809f6619000648b187b257249ed824
46
+ SERVED_MODEL_NAME=DeepSeek-V4-Flash-Spark-Mini
47
  CONTEXT_LENGTH=200000
48
  KV_CACHE_MEMORY_BYTES=14G
49
  MAX_NUM_BATCHED_TOKENS=8192
 
51
  GPU_MEMORY_UTILIZATION=0.88
52
  WATCHDOG_MIN_AVAILABLE_KB=8388608
53
  KV_CACHE_DTYPE=fp8
54
+ THINKING=true
 
55
  SPECULATIVE_CONFIG=
56
  VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
57
  VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
58
  ```
59
 
60
+ The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, and CUDA graph capture.
61
 
62
  ## Docker runtime
63