DeepSeek V4 Flash GGUF

GGUF quantizations for deepseek-ai/DeepSeek-V4-Flash

DeepSeek published the original model weights in MXFP4, so the MXFP4 GGUFs in this repo are direct conversions of those original safetensors created with scripts/convert-to-gguf.sh. The smaller IQ GGUFs are imatrix quantizations created with scripts/quantize.sh.

Quant Recipes

Recipe Quant Size Default type Tensor-specific overrides Intended output
IQ3_XXS 106912.23 MiB (3.15 BPW) Q6_K ffn_down_exps=iq3_xxs, ffn_gate_exps=iq3_xxs, ffn_up_exps=iq3_xxs Compact GGUF with all MoE expert FFN projections in iq3_xxs.
IQ2_XXS 76641.23 MiB (2.26 BPW) Q6_K ffn_down_exps=iq2_xs, ffn_gate_exps=iq2_xxs, ffn_up_exps=iq2_xxs Smaller GGUF with the down projection in iq2_xs and gate/up projections in iq2_xxs.

More details in scripts/quantize.sh.

Usage

This is the script I use to run:

#!/bin/sh -e

model="./IQ3_XXS/DeepSeek-V4-Flash-IQ3_XXS-00001-of-00004.gguf"

ctx=131072
parallel=1

ctx_size=$((ctx * parallel))

llama-server --no-mmap --no-warmup \
  --model $model --ctx-size $ctx_size -np $parallel \
  --temp 1.0 --top-p 1.0
Downloads last month
-
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tarruda/DeepSeek-V4-Flash-GGUF

Quantized
(84)
this model