Instructions to use sokann/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sokann/DeepSeek-V4-Flash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sokann/DeepSeek-V4-Flash-GGUF", filename="DeepSeek-V4-Flash.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use sokann/DeepSeek-V4-Flash-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: llama cli -hf sokann/DeepSeek-V4-Flash-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: llama cli -hf sokann/DeepSeek-V4-Flash-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: ./llama-cli -hf sokann/DeepSeek-V4-Flash-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf sokann/DeepSeek-V4-Flash-GGUF
Use Docker
docker model run hf.co/sokann/DeepSeek-V4-Flash-GGUF
- LM Studio
- Jan
- Ollama
How to use sokann/DeepSeek-V4-Flash-GGUF with Ollama:
ollama run hf.co/sokann/DeepSeek-V4-Flash-GGUF
- Unsloth Studio
How to use sokann/DeepSeek-V4-Flash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sokann/DeepSeek-V4-Flash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sokann/DeepSeek-V4-Flash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sokann/DeepSeek-V4-Flash-GGUF to start chatting
- Pi
How to use sokann/DeepSeek-V4-Flash-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sokann/DeepSeek-V4-Flash-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sokann/DeepSeek-V4-Flash-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sokann/DeepSeek-V4-Flash-GGUF
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use sokann/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
docker model run hf.co/sokann/DeepSeek-V4-Flash-GGUF
- Lemonade
How to use sokann/DeepSeek-V4-Flash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sokann/DeepSeek-V4-Flash-GGUF
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sokann/DeepSeek-V4-Flash-GGUFRun Hermes
hermesDeepSeek-V4-Flash-GGUF
This is a GGUF model for DeepSeek-V4-Flash, created with the convert_hf_to_gguf.py script from the merged commit of https://github.com/ggml-org/llama.cpp/pull/24162 (8c146a83, b9840), thanks to the amazing works from u/fairydreaming and u/am17an.
(The weights are identical with the one uploaded previously; the only difference is that the chat template is now included in the GGUF)
The script does 2 things:
- Repacks routed experts tensors in FP4 to MXFP4.
- Quantizes other tensors in FP8 to Q8_0.
The GGUF model can be considered as lossless when compared to the original weights in safetensors.
The size of the model is about 146 GiB. It fits comfortably on a machine with 160 GiB of RAM and 48 GiB of VRAM, and runs reasonably well:
$ build/bin/llama-server -m ~/models/DeepSeek-V4-Flash.gguf -cram 0 --jinja --no-mmap -dev CUDA0,CUDA1 --fit on -c 32768 -b 2048 -ub 2048
...
3.23.716.821 I slot print_timing: id 3 | task 0 | prompt eval time = 15977.50 ms / 5407 tokens ( 2.95 ms per token, 338.41 tokens per second)
3.23.716.827 I slot print_timing: id 3 | task 0 | eval time = 43043.40 ms / 586 tokens ( 73.45 ms per token, 13.61 tokens per second)
3.23.716.828 I slot print_timing: id 3 | task 0 | total time = 59020.89 ms / 5993 tokens
1M context
And after applying the lightning indexer change from https://github.com/ggml-org/llama.cpp/pull/24231, we can have 1M context with just 6GiB of VRAM!
To join the One million tokens prompt club, apply lcpp-dsv4-lid-combo.diff, which contains:
- commit 57a667e9 cherry-picked from https://github.com/ggml-org/llama.cpp/pull/24231
- the fixes from https://github.com/ggml-org/llama.cpp/pull/24776 and https://github.com/ggml-org/llama.cpp/pull/24945
- a small change to deepseek4.cpp to wire up the lightning indexer, made by DSV4 itself, of course
The speed is decent on my machine with 2 x RTX PRO 4000:
$ ~/repo/llama.cpp/build/bin/llama-batched-bench -m ~/models/DeepSeek-V4-Flash-r3.gguf -b 2048 -ub 2048 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288,1048064 -ntg 128 -fa 1 -cmoe --no-repack --no-mmap
llama_batched_bench: n_kv_max = 1048576, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 112, n_threads_batch = 112
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 8192 | 128 | 1 | 8320 | 25.401 | 322.50 | 8.754 | 14.62 | 34.155 | 243.60 |
| 16384 | 128 | 1 | 16512 | 58.092 | 282.03 | 8.827 | 14.50 | 66.920 | 246.74 |
| 32768 | 128 | 1 | 32896 | 148.657 | 220.43 | 9.158 | 13.98 | 157.815 | 208.45 |
| 65536 | 128 | 1 | 65664 | 426.688 | 153.59 | 10.113 | 12.66 | 436.801 | 150.33 |
|131072 | 128 | 1 | 131200 | 1373.003 | 95.46 | 11.775 | 10.87 | 1384.778 | 94.74 |
|262144 | 128 | 1 | 262272 | 4838.849 | 54.17 | 15.295 | 8.37 | 4854.144 | 54.03 |
|524288 | 128 | 1 | 524416 | 18152.150 | 28.88 | 22.333 | 5.73 | 18174.482 | 28.85 |
At long context, PP and TG started to converge. Stopped it before reaching 1M, as it was getting slow.
KLD comparison
Using wiki.test.raw with -c 512, llama-perplexity was used to first generate the base logits using this near-lossless quant, and then used to do measure the KLD of the 2 quants from antirez:
- kld/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf.kld.txt
- kld/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf.kld.txt
These 2 quants work amazingly well. However, their KLD scores are quite bad.
Summary for the Q2 quant:
Mean PPL(Q) : 6.155042 ± 0.042084
Mean PPL(base) : 4.738434 ± 0.030638
Cor(ln(PPL(Q)), ln(PPL(base))): 87.64%
...
Mean KLD: 0.422277 ± 0.002287
Maximum KLD: 15.166180
99.9% KLD: 8.191362
...
RMS Δp : 21.606 ± 0.086 %
Same top p: 77.586 ± 0.110 %
Summary for the Q4 quant:
Mean PPL(Q) : 4.570599 ± 0.029570
Mean PPL(base) : 4.738434 ± 0.030638
Cor(ln(PPL(Q)), ln(PPL(base))): 96.77%
...
Mean KLD: 0.115089 ± 0.000817
Maximum KLD: 9.996851
99.9% KLD: 3.656791
...
RMS Δp : 11.098 ± 0.065 %
Same top p: 88.794 ± 0.083 %
Not sure I did something wrong in the measurement.
- Downloads last month
- 725
We're not able to determine the quantization variants.
Model tree for sokann/DeepSeek-V4-Flash-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF