DeepSeek-V4-Flash-DSpark

Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By Fraser Price.

The official DeepSeek-V4-Flash (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from DeepSpec) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.2-1.4x faster than the model's stock MTP at identical quality (speculative decoding is lossless). It runs on a Blackwell / CUDA-13.2 vLLM image that ships the dspark speculative method (voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629), the image from local-inference-lab/rtx6kpro.

Update

The serving image has moved from the original fraserpricee/vllm:dspark-cu132-20260627 to the local-inference-lab vLLM build (voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629) — a more performant DSpark + sparse-MLA implementation on Blackwell. All credits to them! The performance numbers below were measured on this V2 build, weights are the same.

Related models

Repo What
deepseek-ai/DeepSeek-V4-Flash the official base model
DeepSeek-V4-Flash-DSpark this repo (stock + DSpark)
DeepSeek-V4-Flash-Abliterated-DSpark abliterated (uncensored) + DSpark

Quick start

Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see Hardware).

curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh

That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on http://localhost:8000/v1. No other files or arguments needed. On 4 GPUs:

GPUS=0,1,2,3 TP=4 bash run.sh

Everything is overridable by env var:

Variable Default Meaning
GPUS 0,1 GPU indices (comma-separated)
TP 2 Tensor-parallel size; set to the number of GPUs
DSPARK_TOKENS 5 DSpark draft tokens per step
PORT 8000 API port
MAX_MODEL_LEN 262144 Context length
GPU_MEM_UTIL 0.92 Fraction of VRAM to use
HF_REPO fraserprice/DeepSeek-V4-Flash-DSpark repo to download
MODEL_DIR (HF cache) serve a local dir instead of downloading
SERVED_NAME DeepSeek-V4-Flash-DSpark API model name
IMAGE (pinned DSpark image) inference container

Hardware

Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.

Setup Works?
2 x RTX Pro 6000 (96 GB), TP=2 ✅ reference config
4 x RTX Pro 6000, TP=4 ✅ more KV headroom / throughput
< ~180 GB total VRAM ❌ won't fit

The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the dspark speculative method to vLLM. See local-inference-lab/rtx6kpro for the image source.

Performance

Measured on RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, gpu_memory_utilization=0.92, DSpark num_speculative_tokens=5, 256 generated tokens/request. PP is prefill throughput (tok/s); TG is per-request decode throughput (tok/s); total is aggregate throughput (tok/s) across concurrent requests. MTP TG is the base model's shipped single-layer-MTP decode for reference; Decode ↑ is DSpark's gain over it.

These numbers were measured on the abliterated DSpark twin and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.

Latency

GPUs prompt conc TTFT p50 ITL p50 E2E p50
2 1,000 1 155 ms 15.6 ms 1.34 s
2 1,000 3 435 ms 23.1 ms 2.04 s
2 10,000 1 1,232 ms 15.9 ms 2.32 s
2 10,000 3 2,495 ms 23.8 ms 5.30 s
2 100,000 1 14,568 ms 16.4 ms 15.71 s
2 100,000 3 29,078 ms 26.1 ms 44.77 s
4 1,000 1 127 ms 11.9 ms 0.96 s
4 1,000 3 259 ms 16.3 ms 1.52 s
4 10,000 1 983 ms 12.0 ms 1.87 s
4 10,000 3 2,229 ms 16.9 ms 4.06 s
4 100,000 1 12,772 ms 12.5 ms 13.59 s
4 100,000 3 26,698 ms 18.9 ms 40.11 s

Throughput (tok/s)

GPUs prompt conc PP MTP TG DSpark TG Decode ↑ total
2 1,000 1 8,080 190.9 224.7 1.18× 1,160
2 1,000 3 8,377 138.4 152.5 1.10× 2,120
2 10,000 1 10,081 191.5 239.5 1.25× 5,469
2 10,000 3 10,050 90.3 106.6 1.18× 7,046
2 100,000 1 8,525 185.4 247.1 1.33× 7,965
2 100,000 3 8,585 60.4 67.1 1.11× 8,320
4 1,000 1 9,769 233.3 322.7 1.38× 1,605
4 1,000 3 10,578 181.9 212.8 1.17× 2,927
4 10,000 1 12,645 234.5 314.3 1.34× 6,984
4 10,000 3 12,642 115.0 148.3 1.29× 9,122
4 100,000 1 9,734 232.8 335.8 1.44× 9,194
4 100,000 3 9,507 74.7 92.7 1.24× 9,278

~1.2-1.4x faster single-stream decode than the base model's shipped MTP, at the same (lossless) quality.

Credits

License

MIT, inheriting from the base model.


Built by Fraser Price, @fraserpricee. Found this useful? A follow is appreciated.

Downloads last month
1,833
Safetensors
Model size
165B params
Tensor type
BF16
·
I64
·
F32
·
F8_E8M0
·
F8_E4M3
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraserprice/DeepSeek-V4-Flash-DSpark

Quantized
(88)
this model