How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tarruda/DeepSeek-V4-Flash-GGUF:IQ3_XXS
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tarruda/DeepSeek-V4-Flash-GGUF:IQ3_XXS
Run Hermes
hermes
Quick Links

DeepSeek V4 Flash GGUF

GGUF quantizations for deepseek-ai/DeepSeek-V4-Flash

DeepSeek published the original model weights in MXFP4, so the MXFP4 GGUFs in this repo are direct conversions of those original safetensors.

Quant Recipes

| Recipe | Quant Size | Default type | Tensor-specific overrides | | --- | --- | --- | --- | --- | | IQ3_XXS | 106912.23 MiB (3.15 BPW) | Q6_K | ffn_down_exps=iq3_xxs, ffn_gate_exps=iq3_xxs, ffn_up_exps=iq3_xxs | | IQ2_XS | 82145.23 MiB (2.42 BPW) | Q6_K | ffn_down_exps=iq2_xs, ffn_gate_exps=iq2_xs, ffn_up_exps=iq2_xs |

Usage

This is the script I use to run:

#!/bin/sh -e

model="./IQ3_XXS/DeepSeek-V4-Flash-IQ3_XXS-00001-of-00004.gguf"

ctx=131072
parallel=1

ctx_size=$((ctx * parallel))

llama-server --no-mmap --no-warmup \
  --model $model --ctx-size $ctx_size -np $parallel \
  --temp 1.0 --top-p 1.0
Downloads last month
-
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tarruda/DeepSeek-V4-Flash-GGUF

Quantized
(84)
this model