0xSero commited on
Commit
2b799ae
·
verified ·
1 Parent(s): d663e8f

Update DGX Spark 200K serving recipe

Browse files
Files changed (1) hide show
  1. README.md +52 -175
README.md CHANGED
@@ -1,215 +1,92 @@
1
  ---
2
  license: mit
3
  library_name: transformers
4
- base_model: deepseek-ai/DeepSeek-V4-Flash
5
  tags:
6
  - deepseek-v4
7
  - mixture-of-experts
8
  - reap
 
 
 
 
 
9
  - experimental
10
- - text-generation
11
- private: true
12
  ---
13
 
14
- # DeepSeek-V4-Flash-162B-codex-K144-REAP
15
-
16
- **Experimental checkpoint, not ready for production use. Keep private unless explicitly approved for release.**
17
-
18
- This is a Routing-Enhanced Activation Pruning (REAP) derivative of
19
- [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash).
20
- It keeps **144 / 256 routed experts per routed layer** in a compact MoE layout.
21
-
22
- The goal of this checkpoint is to fit a higher-quality DeepSeek-V4-Flash REAP in a tighter serving envelope than K160 while preserving long-context, tool/JSON, and formatting behavior.
23
-
24
- ## Status
25
 
26
- - Repo visibility: private
27
- - Readiness: experimental validation checkpoint
28
- - Main known issue: long visual-format generations can still trigger conservative repetition detectors, mostly on diagram separator patterns
29
- - No observed `U+FFFD` replacement-character corruption in the latest targeted probe
30
- - Recommended decoding for the targeted formatting probe: `temperature=1.0`, `top_p=1.0`, `repetition_penalty=1.03`
31
 
32
- ## Build Summary
 
 
 
33
 
34
- | Field | Value |
35
- |---|---:|
36
- | Base model | `deepseek-ai/DeepSeek-V4-Flash` |
37
- | Target label | `162B-codex-K144` |
38
- | Estimated total parameters | `162.801B` |
39
- | Original routed experts / layer | `256` |
40
- | Kept routed experts / layer | `144` |
41
- | Routed MoE layers | `43` |
42
- | Per-expert params | `25,165,824` |
43
- | Estimated routed expert params | `155.827B` |
44
- | Estimated static params | `6.975B` |
45
- | HF-layout artifact size | ~`88G` |
46
- | Indexed tensor bytes | `93,684,970,120` |
47
 
48
- The checkpoint uses the compact-K layout:
49
 
50
- ```text
51
- config.json: n_routed_experts = 144
52
- inference/config.json: n_routed_experts = 144
53
- inference/config.json: n_activated_experts = 6
54
  ```
55
 
56
- Hash-routed layers use `tid2eid` remapping by gate-weight cosine similarity rather than arbitrary fallback remapping. This avoids the earlier hash-layer route-collapse bug seen in smaller/older REAP builds.
57
-
58
- ## Observation Source
59
-
60
- Expert selection used the available rows from:
61
-
62
- [`0xSero/deepseek-v4-flash-reap-observations-v2`](https://huggingface.co/datasets/0xSero/deepseek-v4-flash-reap-observations-v2)
63
 
64
- Snapshot metadata:
65
 
66
- ```text
67
- snapshot_label: partial-v2-21289
68
- available_rows_used: 21289
69
- all_experts_observed: true
70
- categories:
71
- unicode_stress: 200
72
- v2_combined: 21289
73
- ```
74
-
75
- ## Serving Configuration Tested
76
-
77
- Tested on a single B200 using Dockerized vLLM:
78
 
79
  ```bash
80
- VLLM_IMAGE=vllm/vllm-openai:deepseekv4-cu130
81
- REAP_DIR=/home/ubuntu/ds4-flash-reap/reaps/DeepSeek-V4-Flash-162B-codex-K144-REAP
82
- SERVED_NAME=ds4-flash-k144-codex-vllm
83
- MAX_MODEL_LEN=204800
 
 
84
  MAX_NUM_SEQS=1
85
- MAX_NUM_BATCHED_TOKENS=4096
86
- GPU_MEMORY_UTILIZATION=0.96
87
- PORT=8000
 
 
 
 
 
88
  ```
89
 
90
- Core vLLM flags:
91
 
92
- ```bash
93
- vllm serve $REAP_DIR \
94
- --served-model-name $SERVED_NAME \
95
- --trust-remote-code \
96
- --kv-cache-dtype fp8 \
97
- --block-size 256 \
98
- --tensor-parallel-size 1 \
99
- --enable-expert-parallel \
100
- --gpu-memory-utilization 0.96 \
101
- --max-model-len 204800 \
102
- --max-num-batched-tokens 4096 \
103
- --max-num-seqs 1 \
104
- --tokenizer-mode deepseek_v4 \
105
- --tool-call-parser deepseek_v4 \
106
- --enable-auto-tool-choice \
107
- --reasoning-parser deepseek_v4 \
108
- --no-enable-flashinfer-autotune
109
- ```
110
 
111
- Measured load/runtime notes:
112
 
113
  ```text
114
- vLLM model memory: 86.03 GiB
115
- server VRAM after warmup: ~176.6 GiB / 183.4 GiB on B200
116
- advertised max_model_len: 204800
117
  ```
118
 
119
- ## Validation Summary
120
 
121
- Validation run root:
122
 
123
- ```text
124
- /home/ubuntu/ds4-flash-reap/runs/k144-bench-20260527T131905Z
125
- ```
126
 
127
- Uploaded validation artifacts:
128
 
129
  ```text
130
- validation/200k_smoke.json
131
- validation/stream_bench.json
132
- validation/targeted_baseline.compact.json
133
- validation/targeted_repetition_penalty_1p03.compact.json
 
 
134
  ```
135
 
136
- ### 200K Smoke
137
-
138
- | Check | Result | Time | Notes |
139
- |---|---:|---:|---|
140
- | Loader sanity | pass | `4.49s` | exact `{"ok": true, "n": 3}` |
141
- | 200K context echo | pass | `16.78s` | observed `182,974` prompt tokens, exact context JSON |
142
- | Needle-in-haystack | pass | `7.62s` | exact `BLUE-OWL-38DD5B49D231` |
143
-
144
- Overall smoke status: **pass**.
145
-
146
- ### Stream Bench
147
 
148
- | Shape | TTFT p50 | E2E p50 | Output tok/s p50 | Health |
149
- |---|---:|---:|---:|---|
150
- | `latency_1k_out128_c1` | `0.145s` | `1.232s` | `84.38` | no replacement/repetition flags |
151
- | `decode_1k_out512_c1` | `0.145s` | `1.442s` | `88.07` | no replacement/repetition flags |
152
- | `prefill_32k_out64_c1` | `0.249s` | `0.845s` | `93.92` | no replacement/repetition flags |
153
- | `ctx_180k_out64_c1` | `10.222s` | `10.817s` | `95.83` | no replacement/repetition flags |
154
- | `conc_1k_out128_c4` | `3.246s` | `4.227s` | `86.87` | no replacement/repetition flags |
155
 
156
- ### Targeted Formatting / Tool Probe
157
-
158
- Baseline targeted probe:
159
-
160
- ```text
161
- replacement_ids: []
162
- health_fail_ids:
163
- - ascii_only_bars
164
- - mermaid_ascii_stress
165
- - unicode_box_request
166
- - long_context_visual_recall
167
- ```
168
-
169
- Manual inspection showed these were mostly conservative detector triggers on diagram syntax such as repeated `-`, `─`, `|`, and repeated layer blocks, not `U+FFFD` corruption or full sentence-loop collapse.
170
-
171
- With `repetition_penalty=1.03`:
172
-
173
- ```text
174
- replacement_ids: []
175
- health_fail_ids:
176
- - mermaid_ascii_stress
177
- - unicode_box_request
178
- - long_context_visual_recall
179
- ```
180
-
181
- `ascii_only_bars` passed with `repetition_penalty=1.03`. A stronger penalty (`1.10`) was not recommended because it damaged multilingual/instruction behavior.
182
-
183
- ## Intended Use
184
-
185
- This checkpoint is intended for internal REAP/OPSD validation and serving-envelope experiments:
186
-
187
- - long-context serving smoke tests
188
- - tool/JSON behavior checks
189
- - Unicode and formatting stress testing
190
- - comparison against K132/K160 REAP candidates
191
- - future OPD/OPSD repair experiments
192
-
193
- Do not treat this as a final public model release.
194
-
195
- ## Limitations
196
-
197
- - Experimental pruned MoE checkpoint, not a fully validated model
198
- - Long visual-format generations still need targeted stability work
199
- - Health detectors are intentionally conservative and can flag valid diagrams
200
- - Benchmark coverage is smoke/targeted validation, not a complete public benchmark suite
201
- - This is a derivative of DeepSeek-V4-Flash and inherits base-model licensing/usage constraints
202
-
203
- ## Provenance Files
204
-
205
- Important included files:
206
-
207
- ```text
208
- reap_plan.json
209
- config.json
210
- inference/config.json
211
- model.safetensors.index.json
212
- model-00001-of-00046.safetensors ... model-00046-of-00046.safetensors
213
- ```
214
 
215
- The `reap_plan.json` file records the kept-expert maps, parameter estimate, and observation snapshot.
 
1
  ---
2
  license: mit
3
  library_name: transformers
4
+ pipeline_tag: text-generation
5
  tags:
6
  - deepseek-v4
7
  - mixture-of-experts
8
  - reap
9
+ - dgx-spark
10
+ - vllm
11
+ - long-context
12
+ - fp8
13
+ - mxfp4
14
  - experimental
15
+ base_model: deepseek-ai/DeepSeek-V4-Flash
 
16
  ---
17
 
18
+ # Deepseek-V4-Flash-162B-REAP
 
 
 
 
 
 
 
 
 
 
19
 
20
+ This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model. The validated single-DGX Spark serving recipe is maintained here:
 
 
 
 
21
 
22
+ - GitHub: https://github.com/0xSero/deepseek-v4-flash-spark-200k
23
+ - Docker image: `ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27`
24
+ - Model repo used by the recipe: `0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP`
25
+ - Validated revision: `d663e8fb16809f6619000648b187b257249ed824`
26
 
27
+ ## One-command Spark install
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ Run this on the DGX Spark. `HF_TOKEN` is only required if the model repo is private or not already cached on the machine.
30
 
31
+ ```bash
32
+ HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-v4-flash-spark-200k; git clone https://github.com/0xSero/deepseek-v4-flash-spark-200k.git; cd deepseek-v4-flash-spark-200k; ./install.sh --profile k144-nospec-200k --launch'
 
 
33
  ```
34
 
35
+ Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.
 
 
 
 
 
 
36
 
37
+ ## Exact working profile
38
 
39
+ The profile lives at `configs/k144-nospec-200k.env` in the GitHub repo.
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ```bash
42
+ MODEL_REPO=0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP
43
+ MODEL_REVISION=d663e8fb16809f6619000648b187b257249ed824
44
+ SERVED_MODEL_NAME=deepseek-v4-flash-k144-g27-cutlass451
45
+ CONTEXT_LENGTH=200000
46
+ KV_CACHE_MEMORY_BYTES=14G
47
+ MAX_NUM_BATCHED_TOKENS=8192
48
  MAX_NUM_SEQS=1
49
+ GPU_MEMORY_UTILIZATION=0.88
50
+ WATCHDOG_MIN_AVAILABLE_KB=8388608
51
+ KV_CACHE_DTYPE=fp8
52
+ ENFORCE_EAGER=0
53
+ THINKING=false
54
+ SPECULATIVE_CONFIG=
55
+ VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
56
+ VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
57
  ```
58
 
59
+ The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, and CUDA graphs. Do not add `--enforce-eager`; this profile was validated with CUDA graph capture enabled.
60
 
61
+ ## Docker runtime
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ The expected public image is:
64
 
65
  ```text
66
+ ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27
 
 
67
  ```
68
 
69
+ The image lineage is the DGX Spark DeepSeek V4 vLLM build `vllm-node-dsv4:latest` with vLLM `0.1.dev17016+g27fd665bd.d20260526` and `nvidia-cutlass-dsl[cu13]==4.5.1`. The installer tags the pulled image as `vllm-node-dsv4-cutlass451:latest`.
70
 
71
+ The repo also carries the runtime patcher used during validation. It applies the nonstandard REAP expert-count router fallback, MXFP4 memory hygiene, optional cute-dsl override hook, and a FlashInfer CUDA IPC `libcudart` fix. It does not modify model weights.
72
 
73
+ ## Validation
 
 
74
 
75
+ Validation was run on `spark-2822`, a single DGX Spark / GB10 / SM121 machine, on May 27 2026.
76
 
77
  ```text
78
+ run_dir: /home/sero/spark/benchmarks/deepseek-reap/single-server-sweep/k144-nospec-200k-mnbt8192-20260527T190139Z
79
+ prompt_tokens: 186,390
80
+ TTFT: 345.834 s
81
+ prefill: 538.958 tok/s
82
+ decode: 13.899 tok/s
83
+ needle_retained: true
84
  ```
85
 
86
+ Task coverage at 200K included smoke, ASCII, Unicode, Mermaid, code explanation, religion/philosophy prompts, tool-call fidelity, and a long-needle retrieval test. The 200K sweep completed and retained the needle, but the watchdog logged a low-memory kill at final teardown near the 8 GiB threshold. Treat this as proof that K144 can serve 200K on one Spark, not as the most comfortable always-on daemon profile.
 
 
 
 
 
 
 
 
 
 
87
 
88
+ K144 MTP2 improved short-context decode in testing, but it was not long-context safe at the tested watchdog thresholds. The published 200K profile is therefore the no-speculative-decoding profile.
 
 
 
 
 
 
89
 
90
+ ## Intended use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
+ This model card is for experimental local inference and reproducibility of the DGX Spark REAP serving recipe. The model is a pruned/quantized DeepSeek V4 Flash derivative; evaluate behavior and license obligations before production use.