How to use from
Ollama
ollama run hf.co/tarruda/DeepSeek-V4-Flash-GGUF:IQ3_XXS
Quick Links

DeepSeek V4 Flash GGUF

GGUF quantizations for deepseek-ai/DeepSeek-V4-Flash

DeepSeek published the original model weights in MXFP4, so the MXFP4 GGUFs in this repo are direct conversions of those original safetensors.

Quant Recipes

| Recipe | Quant Size | Default type | Tensor-specific overrides | | --- | --- | --- | --- | --- | | IQ3_XXS | 106912.23 MiB (3.15 BPW) | Q6_K | ffn_down_exps=iq3_xxs, ffn_gate_exps=iq3_xxs, ffn_up_exps=iq3_xxs | | IQ2_XS | 82145.23 MiB (2.42 BPW) | Q6_K | ffn_down_exps=iq2_xs, ffn_gate_exps=iq2_xs, ffn_up_exps=iq2_xs |

Usage

This is the script I use to run:

#!/bin/sh -e

model="./IQ3_XXS/DeepSeek-V4-Flash-IQ3_XXS-00001-of-00004.gguf"

ctx=131072
parallel=1

ctx_size=$((ctx * parallel))

llama-server --no-mmap --no-warmup \
  --model $model --ctx-size $ctx_size -np $parallel \
  --temp 1.0 --top-p 1.0
Downloads last month
-
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tarruda/DeepSeek-V4-Flash-GGUF

Quantized
(84)
this model