Instructions to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4
- SGLang
How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with Docker Model Runner:
docker model run hf.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))- 💻 Gemma-4-12B-Coder (fable5 × composer2.5) — NVFP4A16 for vLLM ✨
💻 Gemma-4-12B-Coder (fable5 × composer2.5) — NVFP4A16 for vLLM ✨
A faithful 4-bit build of yuxinlu1's coding model, now runnable in vLLM — with a bundled MTP draft for ~1.6× interactive speed. 🚀
TL;DR — A local Python-coding assistant that thinks before it codes. 8.25 GB, runs on one 16 GB Blackwell GPU, native in vLLM (no
--quantizationflag). Bundled speculative-decode draft included. 💚
🙏 Credit & what this is
This is a weight-only NVFP4 (W4A16) re-quantization of yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 — full credit and thanks to @yuxinlu1 for the model and the lovely training recipe. Please ⭐ and follow the original repo; if you want a v2, that's the author's signal to watch.
The author's design intent (preserved here): a focused fine-tune of google/gemma-4-12B-it on verifiable Python coding — distilled from real chain-of-thought (Composer 2.5, kept only where the code passed its tests) plus a Fable 5 "second-attempt" set that recovers the hard cases the main teacher missed. The result reasons in the open (edge cases, complexity) in Gemma's native thinking channel, then emits a clean, runnable solution. It is Python/algorithmic-focused, de-refused (not safety-aligned — add your own guardrails), and English-centric.
Why this build exists: the author shipped GGUF only (great for llama.cpp). This repo reconstructs a vLLM-native artifact so you can serve it with continuous batching, tensor-parallelism, and speculative decoding on Blackwell GPUs.
How it was made (provenance, for the curious): the author's Q8_0 GGUF (≈lossless) was dequantized to BF16, the gemma-4 language tensors grafted onto a same-arch gemma4_unified skeleton, then quantized to NVFP4A16 with llm-compressor. Quality was verified to match the Q8 source (see below). W4A16 (weights FP4, activations BF16) is used deliberately: the base is non-QAT, where full W4A4 collapses on this architecture — weight-only keeps it robust.
📊 How good is it? (independent eval, greedy pass@1)
| Benchmark | Score |
|---|---|
| HumanEval | 90.2% (148/164) |
| MBPP | 85.7% (366/427) |
| HumanEval[:50] — this NVFP4 build vs the Q8 source | 96% = 96% (parity, no quality loss) |
Strong at: hard algorithms (DP, graphs, Fenwick/segment trees, bitmask DP), bug-fixing & refactoring (accurate root-cause + genuine O(n²)→O(n) rewrites that preserve semantics), and faithful open reasoning that matches the emitted code. Japanese prompts cause no measurable Python-quality drop.
⚠️ Know the one sharp edge (verified): on quant / time-series code it can write a look-ahead bias (e.g. an unshifted position × a forward-shifted return), and its reasoning sometimes states the correct rule while the code does the opposite. Do not ship its pandas/numpy back-test or accounting code unreviewed — gate it. It's a superb algorithm/debug specialist, not an unsupervised quant author.
🚀 Run it — pick your path
You need: a Blackwell GPU (SM120 / RTX 50-series / RTX PRO / GB10 / B100/200), Docker with the NVIDIA runtime. Gemma-4 unified is new, so you need a vLLM build that registers Gemma4UnifiedForConditionalGeneration (recent nightly). vLLM auto-detects the NVFP4 weights — no --quantization flag.
🟢 Easiest — one GPU, just chat (start here)
# download (~8.25 GB)
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 --local-dir ./model
docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
-v $PWD/model:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-coder \
--max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code
Then open the OpenAI-compatible endpoint at http://localhost:8000/v1.
🧠 IMPORTANT — turn the thinking channel ON
This model was trained to think first. In vLLM you must enable it per request (otherwise it skips reasoning and quality drops on hard problems):
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "gemma4-coder",
"messages": [{"role":"user","content":"Write a function that returns the longest palindromic substring. Think through edge cases first."}],
"temperature": 0.0,
"chat_template_kwargs": {"enable_thinking": true}
}'
In the Python OpenAI client, pass it via extra_body:
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = c.chat.completions.create(
model="gemma4-coder",
messages=[{"role":"user","content":"...your coding task..."}],
temperature=0.0, # greedy = deterministic code
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(r.choices[0].message.content)
💡 Sampling: greedy (
temperature 0) for deterministic solutions, or the author'stemp 1.0, top_p 0.95, top_k 64for variety (top_kviaextra_body).
⚡ Fastest interactive — TP=4 + bundled MTP speculative decode
A 0.4 B MTP draft is bundled in assistant/ (Google's gemma-4-12B-it assistant). It's lossless (the target verifies every token) and gives ~1.6× single-stream speed. Use num_speculative_tokens: 3 (stable optimum) and --kv-cache-dtype fp8 (NVFP4 KV would break the draft):
docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
-e NCCL_P2P_DISABLE=1 \
-v $PWD/model:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-coder \
--tensor-parallel-size 4 --disable-custom-all-reduce \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}' \
--max-model-len 16384 --gpu-memory-utilization 0.90 --trust-remote-code
The bundled draft was trained on base
gemma-4-12B-it. On this coder fine-tune it stays lossless; acceptance (and thus the exact speedup) may be a touch lower than a coder-native draft. Measured numbers below.
🔌 Multi-GPU without NVLink (consumer / entry Blackwell over PCIe)
There is no working GPU P2P on plain PCIe, so tensor-parallel hangs unless you disable both NCCL P2P and vLLM's custom all-reduce:
-e NCCL_P2P_DISABLE=1 \ # <-- env; else hangs at NCCL init
--tensor-parallel-size 4 \
--disable-custom-all-reduce \ # <-- flag; else the forward deadlocks
Flag cheat-sheet
| Flag / env | When | Why |
|---|---|---|
vllm/vllm-openai:nightly |
always | only nightly registers Gemma4UnifiedForConditionalGeneration |
--trust-remote-code |
always | new architecture |
chat_template_kwargs={"enable_thinking":true} |
every request | turns the reasoning channel on |
NCCL_P2P_DISABLE=1 (env) |
TP > 1, no NVLink | else hangs at NCCL init |
--disable-custom-all-reduce |
TP > 1, no NVLink | else the forward deadlocks |
--ipc=host --shm-size 16gb |
TP > 1 (docker) | host-path NCCL needs shared memory |
--speculative-config '{"method":"mtp",...}' |
interactive (≤8 concurrent) | ~1.6× single-stream; turn off for big batches |
--kv-cache-dtype fp8 |
with MTP | NVFP4 KV collapses draft acceptance |
📈 Throughput (measured — 4× RTX PRO 2000 Blackwell, 16 GB, PCIe / no-NVLink)
Single-stream decode (1 request, 512 tok, thinking on):
| config | tok/s | note |
|---|---|---|
| TP=2 | 53 | 2 GPUs |
| TP=4 | 74 | 4 GPUs, lowest latency |
| TP=4 + MTP (k=3) | 130 (1.76×) | bundled draft, lossless |
Aggregate throughput (no spec-decode; turn MTP off for batch):
| concurrency | 1 | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| TP=2 tok/s | 53 | 103 | 202 | 369 | 631 |
| TP=4 tok/s | 74 | 146 | 272 | 492 | 780 |
Choosing a layout on a fixed GPU budget: TP=4 gives the lowest latency, but TP=2 is more efficient per GPU (≈316 vs 195 tok/s/GPU at 16-way). For max farm throughput, run two data-parallel TP=2 replicas (≈1.3k tok/s on 4 GPUs) instead of one TP=4. Rule of thumb: MTP on for interactive (≤8 concurrent), off for high-concurrency batch.
🔧 Quantization details
| Scheme | NVFP4A16 — weights FP4 (group 16, FP8 scales), activations BF16 |
| Format | compressed-tensors (native vLLM auto-detect) |
| Tool | llm-compressor 0.11, data-free RTN |
| Ignored (kept high-precision) | lm_head, vision/audio embedding_projection |
| Size | 8.25 GB model + 0.85 GB MTP draft · needs Blackwell (SM120) |
| Source | dequantized from the author's Q8_0 GGUF (≈lossless), verified to parity |
📚 Base, license, and a note on use
- Original model:
yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1(fine-tune ofgoogle/gemma-4-12B-it). - MTP draft:
google/gemma-4-12B-it-assistant, bundled inassistant/. - License: Gemma Terms of Use — derivatives must comply.
- De-refused / not safety-aligned: add your own guardrails for production. Strongest on Python / algorithmic tasks; double-check general facts and especially time-series / quant code. Shared as-is, no warranty. Happy hacking! 🐾✨
NVFP4 build & eval by Lna-Lab. Thanks again to @yuxinlu1 for the original.
- Downloads last month
- 20
Model tree for sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4
Base model
google/gemma-4-12B
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)