Text Generation
Transformers
Safetensors
deepseek_v4
deepseek
deepseek-v4
dgx-spark
experimental
fp8
long-context
mixture-of-experts
mxfp4
reap
vllm
8-bit precision
Instructions to use 0xSero/DeepSeek-V4-Flash-162B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/DeepSeek-V4-Flash-162B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/DeepSeek-V4-Flash-162B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/DeepSeek-V4-Flash-162B") model = AutoModelForCausalLM.from_pretrained("0xSero/DeepSeek-V4-Flash-162B") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 0xSero/DeepSeek-V4-Flash-162B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/DeepSeek-V4-Flash-162B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V4-Flash-162B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B
- SGLang
How to use 0xSero/DeepSeek-V4-Flash-162B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/DeepSeek-V4-Flash-162B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V4-Flash-162B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/DeepSeek-V4-Flash-162B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V4-Flash-162B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use 0xSero/DeepSeek-V4-Flash-162B with Docker Model Runner:
docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B
File size: 12,293 Bytes
9ae6e86 2b799ae b63ed83 2b29d03 9ae6e86 b63ed83 2b29d03 b63ed83 3f236a4 b63ed83 3f236a4 b63ed83 3f236a4 b63ed83 3f236a4 b35971d 3f236a4 b63ed83 2b799ae 7f2e7a3 b63ed83 3f236a4 b63ed83 2b799ae b63ed83 2b799ae b63ed83 30ea6d3 2b799ae 7f2e7a3 2b799ae b63ed83 2b799ae 7f2e7a3 2b799ae b63ed83 3f236a4 b63ed83 2b799ae 9ae6e86 d40cb10 9ae6e86 b896746 b63ed83 3f236a4 818e1ab d40cb10 818e1ab 3f236a4 b63ed83 2b799ae b63ed83 3f236a4 9ae6e86 b63ed83 2b799ae b63ed83 3f236a4 9ae6e86 b896746 d40cb10 3f236a4 9ae6e86 3f236a4 9ae6e86 3f236a4 2b29d03 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | ---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- deepseek
- deepseek-v4
- dgx-spark
- experimental
- fp8
- long-context
- mixture-of-experts
- mxfp4
- reap
- vllm
base_model:
- deepseek-ai/DeepSeek-V4-Flash
---
> [!TIP]
> **[Support this work →](https://donate.sybilsolutions.ai)** · [X](https://x.com/0xsero) · [GitHub](https://github.com/0xsero) · [REAP paper](https://arxiv.org/abs/2510.13999) · [Cerebras REAP](https://huggingface.co/collections/cerebras/cerebras-reap)
# DeepSeek-V4-Flash-162B
REAP-pruned [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash).
## At a glance
| | |
|---|---|
| Base model | [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
| Format | BF16 |
| Total params | **162B** |
| Active / token | — |
| Experts / layer | 144 |
| Layers | 43 |
| Hidden size | 4096 |
| Context | 1,048,576 |
| On-disk size | 94 GB |
## Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
| `DeepSeek-V4-Flash-162B` **(this)** | BF16 | [link](https://huggingface.co/0xSero/DeepSeek-V4-Flash-162B) |
| `DeepSeek-V4-Flash-162B-GGUF` | GGUF | [link](https://huggingface.co/0xSero/DeepSeek-V4-Flash-162B-GGUF) |
| `DeepSeek-V4-Flash-180B` | BF16 | [link](https://huggingface.co/0xSero/DeepSeek-V4-Flash-180B) |
| `DeepSeek-V4-Flash-180B-GGUF` | GGUF | [link](https://huggingface.co/0xSero/DeepSeek-V4-Flash-180B-GGUF) |
| `DeepSeek-V4-Flash-213B` | BF16 | [link](https://huggingface.co/0xSero/DeepSeek-V4-Flash-213B) |
**162B parameters | K144 REAP-pruned | 200K context | no speculative decoding**
This is a smaller pruned DeepSeek V4 Flash that runs on a single DGX Spark. It trades some model capacity for higher prefill speed and a more conservative memory profile. It is the fallback option when you want 200K context with a bit more headroom.
## What this is
- Base: `deepseek-ai/DeepSeek-V4-Flash`
- Pruning: REAP (Routing-Enhanced Activation Pruning) at K144
- Final size: ~162B active parameters
- Quantization: NVFP4 / MXFP4 expert weights with FP8 KV cache
- Serving: vLLM with DeepSeek V4 tokenizer, reasoning parser, and tool-call parser
- Context: 200,000 tokens validated end-to-end
- Hardware target: single NVIDIA DGX Spark / GB10 / SM121
K144 was the smaller checkpoint that still reached 200K on one Spark. It prefills faster than K160 (about 539 tok/s vs 514 tok/s) but decodes slower (about 14 tok/s vs 24 tok/s) because it lacks MTP speculative decoding. The watchdog also logged a low-memory kill at final teardown, so treat this as proof-of-concept rather than a comfortable always-on daemon.
## How the REAP checkpoint was made
REAP (Router-weighted Expert Activation Pruning) is the Cerebras Research one-shot MoE compression method: https://github.com/CerebrasResearch/reap.
Short version: take DeepSeek V4 Flash, measure which MoE experts actually matter under real prompts, keep the most useful routed experts, delete the colder ones, remap the router/expert tables, then pack the surviving model into the low-bit format we serve.
Step by step:
**1. Start from DeepSeek V4 Flash.** DeepSeek V4 Flash is a sparse MoE model. Every token does not use every expert; the router picks a small top-k subset per token. That sparsity is what makes expert pruning viable. The served K144 checkpoint keeps this structure: `model_type=deepseek_v4`, 43 hidden layers, hidden size 4096, 1 shared expert, 6 routed experts active per token, and `max_position_embeddings=1048576` from the base.
**2. Run calibration prompts through the original model.** A calibration corpus is passed through the unpruned model. For each token and each MoE layer, REAP records router scores, which experts the top-k selected, how strongly the router weighted them, and how large the expert activations were. The useful signal is roughly `router_probability * topk_selected * activation_strength * frequency`. This is the "router-weighted activation" part of the name.
**3. Rank experts per layer.** Each MoE layer gets its own ranking. Hot experts are ones the router actually depends on; cold experts are rarely picked or contribute little.
```python
for layer in moe_layers:
scores = {}
for batch in calibration_data:
router_output = model.router(layer, batch.hidden_states)
topk_experts, gate_weights = select_experts(router_output)
for token in batch.tokens:
for expert, weight in topk_experts[token]:
activation = estimate_activation_strength(layer, expert, token)
scores[expert] += weight * activation
keep_experts[layer] = top_k(scores, K)
```
For this checkpoint, `K=144` routed experts per MoE layer are kept. The shared expert is always kept.
**4. Physically prune the expert weights.** This is structural surgery on the MoE expert tensors, not LoRA, prompt tuning, or fine-tuning. Embeddings, attention, norms, router, shared expert, selected routed experts, and the LM head all stay. Low-ranked routed experts are removed and expert IDs are remapped so the model has a compact expert table. That is why the config now reports `n_routed_experts: 144` instead of the larger original count.
**5. Update router metadata.** Because experts were deleted, the router cannot point at old expert IDs. REAP rewrites the routing metadata and the token-to-expert mapping used by the runtime. This is why vLLM needed a router patch: K144 and K160 are valid checkpoints but use nonstandard routed-expert counts that some fused CUDA router kernels do not template-instantiate. The patch forces the general fallback router path. It does not change weights or model behavior.
**6. Quantize and pack.** The pruned checkpoint is packed into the low-bit format the runtime serves: MXFP4/NVFP4-style packed expert weights with FP8 MLA KV cache. That is how K144 lands in a memory range that fits on one DGX Spark with extra prefill headroom.
**7. Validate quality and fit.** Multiple sizes were tested on one Spark: 213B was too large, 200B failed readiness, 180B/K160 was the best balance, and 162B/K144 is the smaller fallback profile published here.
### What REAP changes vs. preserves
Changes: number of routed experts, expert tensors, expert ID mapping, checkpoint size, runtime memory footprint.
Preserves: context length, tokenizer, attention architecture, number of layers, hidden size, number of experts used per token, base chat format.
### What we did in this project
We did not recreate the REAP pipeline ourselves. We downloaded the already-created REAP checkpoints, inspected their configs and expert counts, patched vLLM to accept the nonstandard expert counts, built and validated the DGX Spark runtime, found working one-Spark profiles, and published the serving repos, configs, and model cards.
The end-to-end artifact:
```text
DeepSeek V4 Flash
-> router-weighted expert pruning (REAP)
-> K144 expert-retained checkpoint
-> low-bit packed checkpoint
-> vLLM Spark runtime
-> 200K context serving recipe
```
## How we got here
See the [DeepSeek-V4-Flash-Spark](https://huggingface.co/0xSero/DeepSeek-V4-Flash-180B) model card for the full story. The short version: we tested every REAP checkpoint from 148B through 213B on a single DGX Spark. Most failed before the API came up. K160 was the largest that survived with speculative decoding. K144 was the next viable option without it.
The same runtime patches apply: native ARM64 vLLM build, Cutlass 4.5.1 workaround, REAP expert-count fallback, MXFP4 memory hygiene, and FlashInfer CUDA IPC fix.
## One-command install
Run this on the DGX Spark. `HF_TOKEN` is only needed if the model repo is private or not already cached.
```bash
HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-spark; git clone https://github.com/0xSero/deepseek-spark.git; cd deepseek-spark; ./setup.sh full k144'
```
Do not commit tokens. Pass them only through the environment for this one command.
## Exact working profile
The profile lives at `configs/k144-nospec-200k.env` in the GitHub repo.
```bash
MODEL_REPO=0xSero/DeepSeek-V4-Flash-162B
MODEL_REVISION=d663e8fb16809f6619000648b187b257249ed824
SERVED_MODEL_NAME=DeepSeek-V4-Flash-Spark-Mini
CONTEXT_LENGTH=200000
KV_CACHE_MEMORY_BYTES=14G
MAX_NUM_BATCHED_TOKENS=8192
MAX_NUM_SEQS=1
GPU_MEMORY_UTILIZATION=0.88
WATCHDOG_MIN_AVAILABLE_KB=8388608
KV_CACHE_DTYPE=fp8
THINKING=true
SPECULATIVE_CONFIG=
VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
```
The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, and CUDA graph capture. No speculative decoding.
## Docker runtime
The runtime Docker image is published at:
```text
ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27
```
The image lineage is the DGX Spark DeepSeek V4 vLLM build `vllm-node-dsv4:latest` with vLLM `0.1.dev17016+g27fd665bd.d20260526` and `nvidia-cutlass-dsl[cu13]==4.5.1`. The final local tag is `vllm-node-dsv4-cutlass451:latest`.
Exact image validated on `spark-2822`:
```text
vllm-node-dsv4-cutlass451:latest
sha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a
```
The installer pulls the published image automatically. Pass `IMAGE_REF=...` only when testing a different runtime image.
The runtime patcher applies the nonstandard REAP expert-count router fallback, MXFP4 memory hygiene, optional cute-dsl override hook, and a FlashInfer CUDA IPC `libcudart` fix. It does not modify model weights.
## Validation
Run on `spark-2822`, a single DGX Spark / GB10 / SM121, on May 27 2026.
200K long-needle benchmark:
```text
run_dir: /home/sero/spark/benchmarks/deepseek-reap/single-server-sweep/k144-nospec-200k-mnbt8192-20260527T190139Z
prompt_tokens: 186,390
TTFT: 345.834 s
prefill: 538.958 tok/s
decode: 13.899 tok/s
needle_retained: true
```
Task coverage at 200K included smoke, ASCII, Unicode, and Mermaid diagrams; code explanation; religion and philosophy prompts; tool-call fidelity; and long-needle retrieval. All passed. The watchdog logged a low-memory kill at final teardown near the 8 GB threshold, so this is proven but not the most comfortable always-on profile.
K144 with MTP2 was tested but was not long-context safe at the tested watchdog thresholds. The published 200K profile is therefore the no-speculative-decoding profile.
## Why K144 without speculative decoding
K144 without MTP is the conservative option. It uses a larger KV cache (14 GB vs 6 GB) and bigger prefill chunks (8192 vs 4096), which gives it the highest prefill speed of the tested single-Spark profiles. The tradeoff is lower decode speed and a tighter memory margin at teardown.
Choose this if you value prefill throughput over decode speed, or if you want a simpler profile without speculative decoding.
## Limitations
- This is a pruned model. It is not the full DeepSeek V4 Flash. Evaluate quality against your own tasks before trusting it for production work.
- 200K context works, but memory is tight. The watchdog killed the process at teardown during validation.
- The public 200K success path for the full model remains dual-Spark TP=2. This is a compromise.
- The Docker image and patches are experimental. They are not upstream vLLM and may break on newer commits.
## Links
- One-command wrapper: https://github.com/0xSero/deepseek-spark
- Runtime module (configs, patcher, evidence): https://github.com/0xSero/deepseek-spark/tree/main/runtime
- Base model: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
- Larger single-Spark profile: https://huggingface.co/0xSero/DeepSeek-V4-Flash-180B
## License
MIT for the serving recipe and tooling. The base model weights follow the DeepSeek V4 Flash license. Review it before use.
## License & citation
License inherited from the base model.
```bibtex
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
```
## Sponsors
Made possible by **NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle**.
|