DeepSeek-V4-Flash-DSpark
Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By Fraser Price.
The official DeepSeek-V4-Flash (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from DeepSpec) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.2-1.4x faster than the model's stock MTP at identical quality (speculative decoding is lossless). It runs on a Blackwell / CUDA-13.2 vLLM image that ships the dspark speculative method (voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629), the image from local-inference-lab/rtx6kpro.
Update
The serving image has moved from the original fraserpricee/vllm:dspark-cu132-20260627 to the local-inference-lab vLLM build (voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629) — a more performant DSpark + sparse-MLA implementation on Blackwell. All credits to them!
The performance numbers below were measured on this V2 build, weights are the same.
Related models
| Repo | What |
|---|---|
| deepseek-ai/DeepSeek-V4-Flash | the official base model |
| DeepSeek-V4-Flash-DSpark | this repo (stock + DSpark) |
| DeepSeek-V4-Flash-Abliterated-DSpark | abliterated (uncensored) + DSpark |
Quick start
Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see Hardware).
curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh
That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on http://localhost:8000/v1. No other files or arguments needed. On 4 GPUs:
GPUS=0,1,2,3 TP=4 bash run.sh
Everything is overridable by env var:
| Variable | Default | Meaning |
|---|---|---|
GPUS |
0,1 |
GPU indices (comma-separated) |
TP |
2 |
Tensor-parallel size; set to the number of GPUs |
DSPARK_TOKENS |
5 |
DSpark draft tokens per step |
PORT |
8000 |
API port |
MAX_MODEL_LEN |
262144 |
Context length |
GPU_MEM_UTIL |
0.92 |
Fraction of VRAM to use |
HF_REPO |
fraserprice/DeepSeek-V4-Flash-DSpark |
repo to download |
MODEL_DIR |
(HF cache) | serve a local dir instead of downloading |
SERVED_NAME |
DeepSeek-V4-Flash-DSpark |
API model name |
IMAGE |
(pinned DSpark image) | inference container |
Hardware
Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.
| Setup | Works? |
|---|---|
2 x RTX Pro 6000 (96 GB), TP=2 |
✅ reference config |
4 x RTX Pro 6000, TP=4 |
✅ more KV headroom / throughput |
| < ~180 GB total VRAM | ❌ won't fit |
The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the dspark speculative method to vLLM. See local-inference-lab/rtx6kpro for the image source.
Performance
Measured on RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, gpu_memory_utilization=0.92, DSpark num_speculative_tokens=5, 256 generated tokens/request. PP is prefill throughput (tok/s); TG is per-request decode throughput (tok/s); total is aggregate throughput (tok/s) across concurrent requests. MTP TG is the base model's shipped single-layer-MTP decode for reference; Decode ↑ is DSpark's gain over it.
These numbers were measured on the abliterated DSpark twin and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.
Latency
| GPUs | prompt | conc | TTFT p50 | ITL p50 | E2E p50 |
|---|---|---|---|---|---|
| 2 | 1,000 | 1 | 155 ms | 15.6 ms | 1.34 s |
| 2 | 1,000 | 3 | 435 ms | 23.1 ms | 2.04 s |
| 2 | 10,000 | 1 | 1,232 ms | 15.9 ms | 2.32 s |
| 2 | 10,000 | 3 | 2,495 ms | 23.8 ms | 5.30 s |
| 2 | 100,000 | 1 | 14,568 ms | 16.4 ms | 15.71 s |
| 2 | 100,000 | 3 | 29,078 ms | 26.1 ms | 44.77 s |
| 4 | 1,000 | 1 | 127 ms | 11.9 ms | 0.96 s |
| 4 | 1,000 | 3 | 259 ms | 16.3 ms | 1.52 s |
| 4 | 10,000 | 1 | 983 ms | 12.0 ms | 1.87 s |
| 4 | 10,000 | 3 | 2,229 ms | 16.9 ms | 4.06 s |
| 4 | 100,000 | 1 | 12,772 ms | 12.5 ms | 13.59 s |
| 4 | 100,000 | 3 | 26,698 ms | 18.9 ms | 40.11 s |
Throughput (tok/s)
| GPUs | prompt | conc | PP | MTP TG | DSpark TG | Decode ↑ | total |
|---|---|---|---|---|---|---|---|
| 2 | 1,000 | 1 | 8,080 | 190.9 | 224.7 | 1.18× | 1,160 |
| 2 | 1,000 | 3 | 8,377 | 138.4 | 152.5 | 1.10× | 2,120 |
| 2 | 10,000 | 1 | 10,081 | 191.5 | 239.5 | 1.25× | 5,469 |
| 2 | 10,000 | 3 | 10,050 | 90.3 | 106.6 | 1.18× | 7,046 |
| 2 | 100,000 | 1 | 8,525 | 185.4 | 247.1 | 1.33× | 7,965 |
| 2 | 100,000 | 3 | 8,585 | 60.4 | 67.1 | 1.11× | 8,320 |
| 4 | 1,000 | 1 | 9,769 | 233.3 | 322.7 | 1.38× | 1,605 |
| 4 | 1,000 | 3 | 10,578 | 181.9 | 212.8 | 1.17× | 2,927 |
| 4 | 10,000 | 1 | 12,645 | 234.5 | 314.3 | 1.34× | 6,984 |
| 4 | 10,000 | 3 | 12,642 | 115.0 | 148.3 | 1.29× | 9,122 |
| 4 | 100,000 | 1 | 9,734 | 232.8 | 335.8 | 1.44× | 9,194 |
| 4 | 100,000 | 3 | 9,507 | 74.7 | 92.7 | 1.24× | 9,278 |
~1.2-1.4x faster single-stream decode than the base model's shipped MTP, at the same (lossless) quality.
Credits
- deepseek-ai/DeepSeek-V4-Flash: the base model (MIT).
- DeepSeek-AI / DeepSpec and DeepSeek-V4-Flash-DSpark: the DSpark technique and draft weights.
- local-inference-lab/rtx6kpro: the sm120 Blackwell / CUDA-13.2 vLLM build (
voipmonitor/vllm:eldritch-…) that serves this model, with thedsparkspeculative method and B12X sparse-MLA stack. voipmonitor/vllm: the vLLM image registry the image is published to.
License
MIT, inheriting from the base model.
Built by Fraser Price, @fraserpricee. Found this useful? A follow is appreciated.
- Downloads last month
- 1,833
Model tree for fraserprice/DeepSeek-V4-Flash-DSpark
Base model
deepseek-ai/DeepSeek-V4-Flash