llama.cpp `preset.ini` to specify `gemma4-v2-Q8_0.gguf` over `MTP/gemma-4-12B-it-MTP-Q8_0.gguf`

#49
by avramick - opened

I was having a hard time running this because it was picking up the wrong file in the MTP

β”œβ”€β”€ gemma4-v2-Q8_0.gguf -> ../../blobs/2c20a496baf3e9a3ead59d37c7afe228a863662d58155f360d44eb8b2465cb7f
└── MTP
    └── gemma-4-12B-it-MTP-Q8_0.gguf -> ../../../blobs/145db9094bc0f85f1701e255a2ed216dcc9800fc8bc8631ad00905b456bd451b

but I got it working with this .ini

[gemma-4-12b-agentic-v2]
hf = yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
hf-file = gemma4-v2-Q8_0.gguf
fit = false
ctx-size = 16384
n-gpu-layers = 99
mmap = false
flash-attn = on
jinja = true
temp = 1.0
top-p = 0.95
top-k = 64
avramick changed discussion title from preset.ini to preset.ini to specify `gemma4-v2-Q8_0.gguf` over `MTP/gemma-4-12B-it-MTP-Q8_0.gguf`
avramick changed discussion title from preset.ini to specify `gemma4-v2-Q8_0.gguf` over `MTP/gemma-4-12B-it-MTP-Q8_0.gguf` to llama.cpp `preset.ini` to specify `gemma4-v2-Q8_0.gguf` over `MTP/gemma-4-12B-it-MTP-Q8_0.gguf`

@avramick Spot on β€” and thanks for writing it up. The MTP/ file is the speculative-decoding draft, not a standalone model, so a loader that auto-grabs a .gguf can latch onto it by mistake. Pinning hf-file = gemma4-v2-Q8_0.gguf the way you did is the clean fix.

If you ever want the speedup, that MTP file is meant to be paired with the main model as a draft (roughly 1.2–1.3x, lossless) β€” the repo's MTP notes have the known-good llama.cpp setup, since build support for the Gemma 4 draft is a bit picky right now. I'll also add your preset snippet to the model card so others don't trip on this. Thanks again!

Sign up or log in to comment