Text Generation
Transformers
Safetensors
deepseek_v4
deepseek
deepseek-v4
dgx-spark
experimental
fp8
long-context
mixture-of-experts
mxfp4
reap
vllm
8-bit precision
Instructions to use 0xSero/DeepSeek-V4-Flash-162B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/DeepSeek-V4-Flash-162B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/DeepSeek-V4-Flash-162B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/DeepSeek-V4-Flash-162B") model = AutoModelForCausalLM.from_pretrained("0xSero/DeepSeek-V4-Flash-162B") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 0xSero/DeepSeek-V4-Flash-162B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/DeepSeek-V4-Flash-162B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V4-Flash-162B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B
- SGLang
How to use 0xSero/DeepSeek-V4-Flash-162B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/DeepSeek-V4-Flash-162B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V4-Flash-162B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/DeepSeek-V4-Flash-162B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V4-Flash-162B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use 0xSero/DeepSeek-V4-Flash-162B with Docker Model Runner:
docker model run hf.co/0xSero/DeepSeek-V4-Flash-162B
Update DGX Spark 200K serving recipe
Browse files
README.md
CHANGED
|
@@ -15,11 +15,12 @@ tags:
|
|
| 15 |
base_model: deepseek-ai/DeepSeek-V4-Flash
|
| 16 |
---
|
| 17 |
|
| 18 |
-
#
|
| 19 |
|
| 20 |
-
This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model. The validated single-DGX Spark serving recipe is maintained here:
|
| 21 |
|
| 22 |
-
-
|
|
|
|
| 23 |
- Docker registry target: `ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27`
|
| 24 |
- Validated local Docker image: `vllm-node-dsv4-cutlass451:latest` / `sha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a`
|
| 25 |
- Model repo used by the recipe: `0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP`
|
|
@@ -30,7 +31,7 @@ This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model. The validated singl
|
|
| 30 |
Run this on the DGX Spark. `HF_TOKEN` is only required if the model repo is private or not already cached on the machine.
|
| 31 |
|
| 32 |
```bash
|
| 33 |
-
HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-
|
| 34 |
```
|
| 35 |
|
| 36 |
Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.
|
|
@@ -42,7 +43,7 @@ The profile lives at `configs/k144-nospec-200k.env` in the GitHub repo.
|
|
| 42 |
```bash
|
| 43 |
MODEL_REPO=0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP
|
| 44 |
MODEL_REVISION=d663e8fb16809f6619000648b187b257249ed824
|
| 45 |
-
SERVED_MODEL_NAME=
|
| 46 |
CONTEXT_LENGTH=200000
|
| 47 |
KV_CACHE_MEMORY_BYTES=14G
|
| 48 |
MAX_NUM_BATCHED_TOKENS=8192
|
|
@@ -50,14 +51,13 @@ MAX_NUM_SEQS=1
|
|
| 50 |
GPU_MEMORY_UTILIZATION=0.88
|
| 51 |
WATCHDOG_MIN_AVAILABLE_KB=8388608
|
| 52 |
KV_CACHE_DTYPE=fp8
|
| 53 |
-
|
| 54 |
-
THINKING=false
|
| 55 |
SPECULATIVE_CONFIG=
|
| 56 |
VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
|
| 57 |
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
|
| 58 |
```
|
| 59 |
|
| 60 |
-
The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, and CUDA
|
| 61 |
|
| 62 |
## Docker runtime
|
| 63 |
|
|
|
|
| 15 |
base_model: deepseek-ai/DeepSeek-V4-Flash
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# DeepSeek-V4-Flash-Spark-Mini
|
| 19 |
|
| 20 |
+
This is the 162B / K144 REAP-pruned DeepSeek V4 Flash model, served as `DeepSeek-V4-Flash-Spark-Mini`. The validated single-DGX Spark serving recipe is maintained here:
|
| 21 |
|
| 22 |
+
- One-command Spark wrapper: https://github.com/0xSero/deepseek-spark
|
| 23 |
+
- Runtime module: https://github.com/0xSero/deepseek-v4-flash-spark-200k
|
| 24 |
- Docker registry target: `ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27`
|
| 25 |
- Validated local Docker image: `vllm-node-dsv4-cutlass451:latest` / `sha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a`
|
| 26 |
- Model repo used by the recipe: `0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP`
|
|
|
|
| 31 |
Run this on the DGX Spark. `HF_TOKEN` is only required if the model repo is private or not already cached on the machine.
|
| 32 |
|
| 33 |
```bash
|
| 34 |
+
HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-spark; git clone https://github.com/0xSero/deepseek-spark.git; cd deepseek-spark; ./setup.sh full k144'
|
| 35 |
```
|
| 36 |
|
| 37 |
Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.
|
|
|
|
| 43 |
```bash
|
| 44 |
MODEL_REPO=0xSero/DeepSeek-V4-Flash-162B-codex-K144-REAP
|
| 45 |
MODEL_REVISION=d663e8fb16809f6619000648b187b257249ed824
|
| 46 |
+
SERVED_MODEL_NAME=DeepSeek-V4-Flash-Spark-Mini
|
| 47 |
CONTEXT_LENGTH=200000
|
| 48 |
KV_CACHE_MEMORY_BYTES=14G
|
| 49 |
MAX_NUM_BATCHED_TOKENS=8192
|
|
|
|
| 51 |
GPU_MEMORY_UTILIZATION=0.88
|
| 52 |
WATCHDOG_MIN_AVAILABLE_KB=8388608
|
| 53 |
KV_CACHE_DTYPE=fp8
|
| 54 |
+
THINKING=true
|
|
|
|
| 55 |
SPECULATIVE_CONFIG=
|
| 56 |
VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
|
| 57 |
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
|
| 58 |
```
|
| 59 |
|
| 60 |
+
The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, and CUDA graph capture.
|
| 61 |
|
| 62 |
## Docker runtime
|
| 63 |
|