---
license: mit
base_model:
  - deepseek-ai/DeepSeek-V4-Flash
  - deepseek-ai/DeepSeek-V4-Flash-DSpark
tags:
  - deepseek
  - fp8
  - vllm
  - blackwell
  - rtx-pro-6000
  - speculative-decoding
  - dspark
---

# DeepSeek-V4-Flash-DSpark

*Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By [Fraser Price](https://x.com/fraserpricee).*

The official [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from [DeepSpec](https://github.com/deepseek-ai/DeepSpec)) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.2-1.4x faster than the model's stock MTP at identical quality (speculative decoding is lossless). It runs on a Blackwell / CUDA-13.2 vLLM image that ships the `dspark` speculative method ([`voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629`](https://hub.docker.com/r/voipmonitor/vllm)), the image from [local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro).

## Update

The serving image has moved from the original `fraserpricee/vllm:dspark-cu132-20260627` to the [local-inference-lab](https://github.com/local-inference-lab/rtx6kpro) vLLM build (`voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629`) — a more performant DSpark + sparse-MLA implementation on Blackwell. All credits to them! 
The performance numbers below were measured on this V2 build, weights are the same.

## Related models

| Repo | What |
|---|---|
| [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | the official base model |
| **DeepSeek-V4-Flash-DSpark** | this repo (stock + DSpark) |
| [DeepSeek-V4-Flash-Abliterated-DSpark](https://huggingface.co/fraserprice/DeepSeek-V4-Flash-Abliterated-DSpark) | abliterated (uncensored) + DSpark |

## Quick start

> Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see [Hardware](#hardware)).

```bash
curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh
```

That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on `http://localhost:8000/v1`. No other files or arguments needed. On 4 GPUs:

```bash
GPUS=0,1,2,3 TP=4 bash run.sh
```

Everything is overridable by env var:

| Variable | Default | Meaning |
|---|---|---|
| `GPUS` | `0,1` | GPU indices (comma-separated) |
| `TP` | `2` | Tensor-parallel size; set to the number of GPUs |
| `DSPARK_TOKENS` | `5` | DSpark draft tokens per step |
| `PORT` | `8000` | API port |
| `MAX_MODEL_LEN` | `262144` | Context length |
| `GPU_MEM_UTIL` | `0.92` | Fraction of VRAM to use |
| `HF_REPO` | `fraserprice/DeepSeek-V4-Flash-DSpark` | repo to download |
| `MODEL_DIR` | *(HF cache)* | serve a local dir instead of downloading |
| `SERVED_NAME` | `DeepSeek-V4-Flash-DSpark` | API model name |
| `IMAGE` | *(pinned DSpark image)* | inference container |

## Hardware

Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism.

| Setup | Works? |
|---|---|
| **2 x RTX Pro 6000 (96 GB), `TP=2`** | ✅ reference config |
| 4 x RTX Pro 6000, `TP=4` | ✅ more KV headroom / throughput |
| < ~180 GB total VRAM | ❌ won't fit |

The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the `dspark` speculative method to vLLM. See [local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro) for the image source.

## Performance

Measured on RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, `gpu_memory_utilization=0.92`, DSpark `num_speculative_tokens=5`, 256 generated tokens/request. `PP` is prefill throughput (tok/s); `TG` is per-request decode throughput (tok/s); `total` is aggregate throughput (tok/s) across concurrent requests. `MTP TG` is the base model's shipped single-layer-MTP decode for reference; `Decode ↑` is DSpark's gain over it.

> These numbers were measured on the [abliterated DSpark twin](https://huggingface.co/fraserprice/DeepSeek-V4-Flash-Abliterated-DSpark) and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute.

### Latency

| GPUs | prompt  | conc | TTFT p50  | ITL p50 | E2E p50 |
|-----:|--------:|-----:|----------:|--------:|--------:|
| 2 | 1,000   | 1 | 155 ms    | 15.6 ms | 1.34 s  |
| 2 | 1,000   | 3 | 435 ms    | 23.1 ms | 2.04 s  |
| 2 | 10,000  | 1 | 1,232 ms  | 15.9 ms | 2.32 s  |
| 2 | 10,000  | 3 | 2,495 ms  | 23.8 ms | 5.30 s  |
| 2 | 100,000 | 1 | 14,568 ms | 16.4 ms | 15.71 s |
| 2 | 100,000 | 3 | 29,078 ms | 26.1 ms | 44.77 s |
| 4 | 1,000   | 1 | 127 ms    | 11.9 ms | 0.96 s  |
| 4 | 1,000   | 3 | 259 ms    | 16.3 ms | 1.52 s  |
| 4 | 10,000  | 1 | 983 ms    | 12.0 ms | 1.87 s  |
| 4 | 10,000  | 3 | 2,229 ms  | 16.9 ms | 4.06 s  |
| 4 | 100,000 | 1 | 12,772 ms | 12.5 ms | 13.59 s |
| 4 | 100,000 | 3 | 26,698 ms | 18.9 ms | 40.11 s |

### Throughput (tok/s)

| GPUs | prompt  | conc | PP     | MTP TG | DSpark TG | Decode ↑ | total |
|-----:|--------:|-----:|-------:|-------:|----------:|:--------:|------:|
| 2 | 1,000   | 1 | 8,080  | 190.9 | 224.7 | 1.18× | 1,160 |
| 2 | 1,000   | 3 | 8,377  | 138.4 | 152.5 | 1.10× | 2,120 |
| 2 | 10,000  | 1 | 10,081 | 191.5 | 239.5 | 1.25× | 5,469 |
| 2 | 10,000  | 3 | 10,050 | 90.3  | 106.6 | 1.18× | 7,046 |
| 2 | 100,000 | 1 | 8,525  | 185.4 | 247.1 | 1.33× | 7,965 |
| 2 | 100,000 | 3 | 8,585  | 60.4  | 67.1  | 1.11× | 8,320 |
| 4 | 1,000   | 1 | 9,769  | 233.3 | 322.7 | 1.38× | 1,605 |
| 4 | 1,000   | 3 | 10,578 | 181.9 | 212.8 | 1.17× | 2,927 |
| 4 | 10,000  | 1 | 12,645 | 234.5 | 314.3 | 1.34× | 6,984 |
| 4 | 10,000  | 3 | 12,642 | 115.0 | 148.3 | 1.29× | 9,122 |
| 4 | 100,000 | 1 | 9,734  | 232.8 | 335.8 | 1.44× | 9,194 |
| 4 | 100,000 | 3 | 9,507  | 74.7  | 92.7  | 1.24× | 9,278 |

~1.2-1.4x faster single-stream decode than the base model's shipped MTP, at the same (lossless) quality.

## Credits

- [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash): the base model (MIT).
- [DeepSeek-AI / DeepSpec](https://github.com/deepseek-ai/DeepSpec) and [DeepSeek-V4-Flash-DSpark](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark): the DSpark technique and draft weights.
- [local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro): the sm120 Blackwell / CUDA-13.2 vLLM build (`voipmonitor/vllm:eldritch-…`) that serves this model, with the `dspark` speculative method and B12X sparse-MLA stack.
- [`voipmonitor/vllm`](https://hub.docker.com/r/voipmonitor/vllm): the vLLM image registry the image is published to.

## License

MIT, inheriting from the base model.

---

Built by Fraser Price, [@fraserpricee](https://x.com/fraserpricee). Found this useful? A follow is appreciated.