How to use from
Ollama
ollama run hf.co/catlilface/Gemma-4-26B-A4B-NVFP4-GGUF:NVFP4
Quick Links

Gemma 4 26B A4B NVFP4 GGUF

About

This model is an unmodified GGUF quantization of nvidia/Gemma-4-26B-A4B-NVFP4, made with the llama.cpp conversion tool. Please refer to the official NVIDIA repository for quality metrics. As of now, this repository contains NVFP4 GGUF in two variants: NVFP4 and NVFP4_FP8 of Gemma 4 26B A4B, and a projector file.

Inference

It is recommended to use llamacpp with docker. You can start inferencing this model with the command below:

docker run --rm \
  --runtime nvidia \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_HUB_CACHE=/root/.cache/huggingface/hub \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda13 \
  -hf catlilface/Gemma-4-26B-A4B-NVFP4-GGUF:NVFP4 \
  -c 4000

Adjust llama.cpp parameters to better fit your hardware.

Acknowledgements

Special thanks to ynankani for his contribution, which made this quantization possible.

Downloads last month
2,483
GGUF
Model size
25B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for catlilface/Gemma-4-26B-A4B-NVFP4-GGUF

Quantized
(1)
this model