--- license: mit base_model: - deepseek-ai/DeepSeek-V4-Flash - deepseek-ai/DeepSeek-V4-Flash-DSpark tags: - deepseek - fp8 - vllm - blackwell - rtx-pro-6000 - speculative-decoding - dspark --- # DeepSeek-V4-Flash-DSpark *Stock DeepSeek-V4-Flash with DSpark speculative decoding. Faster than MTP on 2-4x RTX Pro 6000. By [Fraser Price](https://x.com/fraserpricee).* The official [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) (FP8 weights + FP4 experts, unmodified) with DeepSeek's DSpark draft module bolted on. DSpark (from [DeepSpec](https://github.com/deepseek-ai/DeepSpec)) is block-wise speculative decoding: it drafts several tokens per step through a small Markov head, so single-stream decode is ~1.2-1.4x faster than the model's stock MTP at identical quality (speculative decoding is lossless). It runs on a Blackwell / CUDA-13.2 vLLM image that ships the `dspark` speculative method ([`voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629`](https://hub.docker.com/r/voipmonitor/vllm)), the image from [local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro). ## Update The serving image has moved from the original `fraserpricee/vllm:dspark-cu132-20260627` to the [local-inference-lab](https://github.com/local-inference-lab/rtx6kpro) vLLM build (`voipmonitor/vllm:eldritch-enlightenment-v2226f26-b12x15cd38c-cu132-20260629`) — a more performant DSpark + sparse-MLA implementation on Blackwell. All credits to them! The performance numbers below were measured on this V2 build, weights are the same. ## Related models | Repo | What | |---|---| | [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | the official base model | | **DeepSeek-V4-Flash-DSpark** | this repo (stock + DSpark) | | [DeepSeek-V4-Flash-Abliterated-DSpark](https://huggingface.co/fraserprice/DeepSeek-V4-Flash-Abliterated-DSpark) | abliterated (uncensored) + DSpark | ## Quick start > Needs Docker with the NVIDIA Container Toolkit and 2-4 GPUs (see [Hardware](#hardware)). ```bash curl -fsSL https://huggingface.co/fraserprice/DeepSeek-V4-Flash-DSpark/resolve/main/run.sh -o run.sh && bash run.sh ``` That one command grabs the launcher, downloads the weights (~167 GB) into your Hugging Face cache, pulls the inference image, and serves an OpenAI-compatible API on `http://localhost:8000/v1`. No other files or arguments needed. On 4 GPUs: ```bash GPUS=0,1,2,3 TP=4 bash run.sh ``` Everything is overridable by env var: | Variable | Default | Meaning | |---|---|---| | `GPUS` | `0,1` | GPU indices (comma-separated) | | `TP` | `2` | Tensor-parallel size; set to the number of GPUs | | `DSPARK_TOKENS` | `5` | DSpark draft tokens per step | | `PORT` | `8000` | API port | | `MAX_MODEL_LEN` | `262144` | Context length | | `GPU_MEM_UTIL` | `0.92` | Fraction of VRAM to use | | `HF_REPO` | `fraserprice/DeepSeek-V4-Flash-DSpark` | repo to download | | `MODEL_DIR` | *(HF cache)* | serve a local dir instead of downloading | | `SERVED_NAME` | `DeepSeek-V4-Flash-DSpark` | API model name | | `IMAGE` | *(pinned DSpark image)* | inference container | ## Hardware Tuned for NVIDIA RTX Pro 6000 Blackwell (96 GB, sm_120). Needs roughly 180 GB total VRAM for weights + DSpark module + KV cache at long context, split across GPUs by tensor parallelism. | Setup | Works? | |---|---| | **2 x RTX Pro 6000 (96 GB), `TP=2`** | ✅ reference config | | 4 x RTX Pro 6000, `TP=4` | ✅ more KV headroom / throughput | | < ~180 GB total VRAM | ❌ won't fit | The serving image targets Blackwell (sm_120) / CUDA 13.2 and adds the `dspark` speculative method to vLLM. See [local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro) for the image source. ## Performance Measured on RTX Pro 6000 Blackwell (PCIe 5.0), FP8 weights + FP8 KV cache, `gpu_memory_utilization=0.92`, DSpark `num_speculative_tokens=5`, 256 generated tokens/request. `PP` is prefill throughput (tok/s); `TG` is per-request decode throughput (tok/s); `total` is aggregate throughput (tok/s) across concurrent requests. `MTP TG` is the base model's shipped single-layer-MTP decode for reference; `Decode ↑` is DSpark's gain over it. > These numbers were measured on the [abliterated DSpark twin](https://huggingface.co/fraserprice/DeepSeek-V4-Flash-Abliterated-DSpark) and apply identically here: same architecture, precision, and DSpark module. Abliteration is a rank-3 projection on the base weights that does not change compute. ### Latency | GPUs | prompt | conc | TTFT p50 | ITL p50 | E2E p50 | |-----:|--------:|-----:|----------:|--------:|--------:| | 2 | 1,000 | 1 | 155 ms | 15.6 ms | 1.34 s | | 2 | 1,000 | 3 | 435 ms | 23.1 ms | 2.04 s | | 2 | 10,000 | 1 | 1,232 ms | 15.9 ms | 2.32 s | | 2 | 10,000 | 3 | 2,495 ms | 23.8 ms | 5.30 s | | 2 | 100,000 | 1 | 14,568 ms | 16.4 ms | 15.71 s | | 2 | 100,000 | 3 | 29,078 ms | 26.1 ms | 44.77 s | | 4 | 1,000 | 1 | 127 ms | 11.9 ms | 0.96 s | | 4 | 1,000 | 3 | 259 ms | 16.3 ms | 1.52 s | | 4 | 10,000 | 1 | 983 ms | 12.0 ms | 1.87 s | | 4 | 10,000 | 3 | 2,229 ms | 16.9 ms | 4.06 s | | 4 | 100,000 | 1 | 12,772 ms | 12.5 ms | 13.59 s | | 4 | 100,000 | 3 | 26,698 ms | 18.9 ms | 40.11 s | ### Throughput (tok/s) | GPUs | prompt | conc | PP | MTP TG | DSpark TG | Decode ↑ | total | |-----:|--------:|-----:|-------:|-------:|----------:|:--------:|------:| | 2 | 1,000 | 1 | 8,080 | 190.9 | 224.7 | 1.18× | 1,160 | | 2 | 1,000 | 3 | 8,377 | 138.4 | 152.5 | 1.10× | 2,120 | | 2 | 10,000 | 1 | 10,081 | 191.5 | 239.5 | 1.25× | 5,469 | | 2 | 10,000 | 3 | 10,050 | 90.3 | 106.6 | 1.18× | 7,046 | | 2 | 100,000 | 1 | 8,525 | 185.4 | 247.1 | 1.33× | 7,965 | | 2 | 100,000 | 3 | 8,585 | 60.4 | 67.1 | 1.11× | 8,320 | | 4 | 1,000 | 1 | 9,769 | 233.3 | 322.7 | 1.38× | 1,605 | | 4 | 1,000 | 3 | 10,578 | 181.9 | 212.8 | 1.17× | 2,927 | | 4 | 10,000 | 1 | 12,645 | 234.5 | 314.3 | 1.34× | 6,984 | | 4 | 10,000 | 3 | 12,642 | 115.0 | 148.3 | 1.29× | 9,122 | | 4 | 100,000 | 1 | 9,734 | 232.8 | 335.8 | 1.44× | 9,194 | | 4 | 100,000 | 3 | 9,507 | 74.7 | 92.7 | 1.24× | 9,278 | ~1.2-1.4x faster single-stream decode than the base model's shipped MTP, at the same (lossless) quality. ## Credits - [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash): the base model (MIT). - [DeepSeek-AI / DeepSpec](https://github.com/deepseek-ai/DeepSpec) and [DeepSeek-V4-Flash-DSpark](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark): the DSpark technique and draft weights. - [local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro): the sm120 Blackwell / CUDA-13.2 vLLM build (`voipmonitor/vllm:eldritch-…`) that serves this model, with the `dspark` speculative method and B12X sparse-MLA stack. - [`voipmonitor/vllm`](https://hub.docker.com/r/voipmonitor/vllm): the vLLM image registry the image is published to. ## License MIT, inheriting from the base model. --- Built by Fraser Price, [@fraserpricee](https://x.com/fraserpricee). Found this useful? A follow is appreciated.