DeepSeek-V4-Flash-DSpark

Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By Fraser Price.

The official DeepSeek-V4-Flash (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from DeepSpec) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.2-1.4x faster than the model's stock MTP at identical quality (speculative decoding is lossless). It runs on a Blackwell / CUDA-13.2 vLLM image that ships the dspark speculative method (voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629), the image from local-inference-lab/rtx6kpro.

Update

The serving image has moved from the original fraserpricee/vllm:dspark-cu132-20260627 to the local-inference-lab vLLM build (voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629) — a more performant DSpark + sparse-MLA implementation on Blackwell. All credits to them! The performance numbers below were measured on this V2 build, weights are the same.

Related models

Repo	What
deepseek-ai/DeepSeek-V4-Flash	the official base model
DeepSeek-V4-Flash-DSpark	this repo (stock + DSpark)
DeepSeek-V4-Flash-Abliterated-DSpark	abliterated (uncensored) + DSpark

Quick start

Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see Hardware).

curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh

That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on http://localhost:8000/v1. No other files or arguments needed. On 4 GPUs:

GPUS=0,1,2,3 TP=4 bash run.sh

Everything is overridable by env var:

Variable	Default	Meaning
`GPUS`	`0,1`	GPU indices (comma-separated)
`TP`	`2`	Tensor-parallel size; set to the number of GPUs
`DSPARK_TOKENS`	`5`	DSpark draft tokens per step
`PORT`	`8000`	API port
`MAX_MODEL_LEN`	`262144`	Context length
`GPU_MEM_UTIL`	`0.92`	Fraction of VRAM to use
`HF_REPO`	`fraserprice/DeepSeek-V4-Flash-DSpark`	repo to download
`MODEL_DIR`	(HF cache)	serve a local dir instead of downloading
`SERVED_NAME`	`DeepSeek-V4-Flash-DSpark`	API model name
`IMAGE`	(pinned DSpark image)	inference container

Hardware

Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.

Setup	Works?
2 x RTX Pro 6000 (96 GB), `TP=2`	✅ reference config
4 x RTX Pro 6000, `TP=4`	✅ more KV headroom / throughput
< ~180 GB total VRAM	❌ won't fit

The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the dspark speculative method to vLLM. See local-inference-lab/rtx6kpro for the image source.

Performance

Measured on RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, gpu_memory_utilization=0.92, DSpark num_speculative_tokens=5, 256 generated tokens/request. PP is prefill throughput (tok/s); TG is per-request decode throughput (tok/s); total is aggregate throughput (tok/s) across concurrent requests. MTP TG is the base model's shipped single-layer-MTP decode for reference; Decode ↑ is DSpark's gain over it.

These numbers were measured on the abliterated DSpark twin and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.

Latency

GPUs	prompt	conc	TTFT p50	ITL p50	E2E p50
2	1,000	1	155 ms	15.6 ms	1.34 s
2	1,000	3	435 ms	23.1 ms	2.04 s
2	10,000	1	1,232 ms	15.9 ms	2.32 s
2	10,000	3	2,495 ms	23.8 ms	5.30 s
2	100,000	1	14,568 ms	16.4 ms	15.71 s
2	100,000	3	29,078 ms	26.1 ms	44.77 s
4	1,000	1	127 ms	11.9 ms	0.96 s
4	1,000	3	259 ms	16.3 ms	1.52 s
4	10,000	1	983 ms	12.0 ms	1.87 s
4	10,000	3	2,229 ms	16.9 ms	4.06 s
4	100,000	1	12,772 ms	12.5 ms	13.59 s
4	100,000	3	26,698 ms	18.9 ms	40.11 s

Throughput (tok/s)

GPUs	prompt	conc	PP	MTP TG	DSpark TG	Decode ↑	total
2	1,000	1	8,080	190.9	224.7	1.18×	1,160
2	1,000	3	8,377	138.4	152.5	1.10×	2,120
2	10,000	1	10,081	191.5	239.5	1.25×	5,469
2	10,000	3	10,050	90.3	106.6	1.18×	7,046
2	100,000	1	8,525	185.4	247.1	1.33×	7,965
2	100,000	3	8,585	60.4	67.1	1.11×	8,320
4	1,000	1	9,769	233.3	322.7	1.38×	1,605
4	1,000	3	10,578	181.9	212.8	1.17×	2,927
4	10,000	1	12,645	234.5	314.3	1.34×	6,984
4	10,000	3	12,642	115.0	148.3	1.29×	9,122
4	100,000	1	9,734	232.8	335.8	1.44×	9,194
4	100,000	3	9,507	74.7	92.7	1.24×	9,278

~1.2-1.4x faster single-stream decode than the base model's shipped MTP, at the same (lossless) quality.

Credits

deepseek-ai/DeepSeek-V4-Flash: the base model (MIT).
DeepSeek-AI / DeepSpec and DeepSeek-V4-Flash-DSpark: the DSpark technique and draft weights.
local-inference-lab/rtx6kpro: the sm120 Blackwell / CUDA-13.2 vLLM build (voipmonitor/vllm:eldritch-…) that serves this model, with the dspark speculative method and B12X sparse-MLA stack.
voipmonitor/vllm: the vLLM image registry the image is published to.

License

MIT, inheriting from the base model.

Built by Fraser Price, @fraserpricee. Found this useful? A follow is appreciated.

Downloads last month: 1,833

Safetensors

Model size

165B params

Tensor type

BF16

I64

F32

F8_E8M0

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraserprice/DeepSeek-V4-Flash-DSpark

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(88)

this model