I'm trying to download the 12gb version....

#30
by crhylove - opened

And it's showing up as 465mb. v1 has been working, but v2 is better right?

Ah β€” that 465 MB file is the MTP draft model (the little speculative-decoding helper inside the MTP/ folder), not the actual model. That's why it's so small.

The full "12 GB version" you want is gemma4-v2-Q8_0.gguf in the root folder (11.8 GB). Grab that one and you're set. If you want something lighter that still runs great, gemma4-v2-Q4_K_M.gguf (6.9 GB) is the sweet spot.

The MTP file is optional β€” it only helps if you're running speculative decoding in llama.cpp to speed generation up, and it loads alongside the main model, not instead of it.

and yeah β€” v2 is the upgrade for agentic / tool-driven coding, that's what it was tuned for. If v1's been working well for you it's still solid; v2 just pushes the agentic + coding side further.

but from "Use this model", ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q8_0, this pulls the 400mb file, shouldn't it pull the 12gb model?

@ly2025 You're right β€” that's a real gotcha, it's pulling the wrong file. The repo has two files with "Q8_0" in the name: the actual model (gemma4-v2-Q8_0.gguf, ~11.8 GB) and the tiny MTP draft head (MTP/gemma-4-12B-it-MTP-Q8_0.gguf, ~400 MB). Ollama's :Q8_0 tag is ambiguous, so it grabs the small MTP one. (A couple of other tools auto-pick it too β€” it's on my list to make the repo less confusing.)

Reliable fix β€” download the exact main file and import it with a Modelfile so the auto-resolver can't pick wrong:

hf download yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF gemma4-v2-Q8_0.gguf --local-dir .

Then a Modelfile:
FROM ./gemma4-v2-Q8_0.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER repeat_penalty 1.1
ollama create gemma4-v2 -f Modelfile
ollama run gemma4-v2

Two notes: for proper tool-calling/agentic urom the official Gemma 4 (ollama pull gemma4β†’ ollama show --modelfile gemma4) into your Modelfile β€” a bare import can report "does not support chat." And keep num_ctx capped like above so Gemma 4's 256K context doesn't blow up your KV cache. For the most reliable tool parsing overall, llama.cpp llama-server --jinja is still the best path.

Sign up or log in to comment