Instructions to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4

SGLang

How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4
```

💻 Gemma-4-12B-Coder (fable5 × composer2.5) — NVFP4A16 for vLLM ✨

A faithful 4-bit build of yuxinlu1's coding model, now runnable in vLLM — with a bundled MTP draft for ~1.6× interactive speed. 🚀

TL;DR — A local Python-coding assistant that thinks before it codes. 8.25 GB, runs on one 16 GB Blackwell GPU, native in vLLM (no --quantization flag). Bundled speculative-decode draft included. 💚

🙏 Credit & what this is

This is a weight-only NVFP4 (W4A16) re-quantization of yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 — full credit and thanks to @yuxinlu1 for the model and the lovely training recipe. Please ⭐ and follow the original repo; if you want a v2, that's the author's signal to watch.

The author's design intent (preserved here): a focused fine-tune of google/gemma-4-12B-it on verifiable Python coding — distilled from real chain-of-thought (Composer 2.5, kept only where the code passed its tests) plus a Fable 5 "second-attempt" set that recovers the hard cases the main teacher missed. The result reasons in the open (edge cases, complexity) in Gemma's native thinking channel, then emits a clean, runnable solution. It is Python/algorithmic-focused, de-refused (not safety-aligned — add your own guardrails), and English-centric.

Why this build exists: the author shipped GGUF only (great for llama.cpp). This repo reconstructs a vLLM-native artifact so you can serve it with continuous batching, tensor-parallelism, and speculative decoding on Blackwell GPUs.

How it was made (provenance, for the curious): the author's Q8_0 GGUF (≈lossless) was dequantized to BF16, the gemma-4 language tensors grafted onto a same-arch gemma4_unified skeleton, then quantized to NVFP4A16 with llm-compressor. Quality was verified to match the Q8 source (see below). W4A16 (weights FP4, activations BF16) is used deliberately: the base is non-QAT, where full W4A4 collapses on this architecture — weight-only keeps it robust.

📊 How good is it? (independent eval, greedy pass@1)

Benchmark	Score
HumanEval	90.2% (148/164)
MBPP	85.7% (366/427)
HumanEval[:50] — this NVFP4 build vs the Q8 source	96% = 96% (parity, no quality loss)

Strong at: hard algorithms (DP, graphs, Fenwick/segment trees, bitmask DP), bug-fixing & refactoring (accurate root-cause + genuine O(n²)→O(n) rewrites that preserve semantics), and faithful open reasoning that matches the emitted code. Japanese prompts cause no measurable Python-quality drop.

⚠️ Know the one sharp edge (verified): on quant / time-series code it can write a look-ahead bias (e.g. an unshifted position × a forward-shifted return), and its reasoning sometimes states the correct rule while the code does the opposite. Do not ship its pandas/numpy back-test or accounting code unreviewed — gate it. It's a superb algorithm/debug specialist, not an unsupervised quant author.

🚀 Run it — pick your path

You need: a Blackwell GPU (SM120 / RTX 50-series / RTX PRO / GB10 / B100/200), Docker with the NVIDIA runtime. Gemma-4 unified is new, so you need a vLLM build that registers Gemma4UnifiedForConditionalGeneration (recent nightly). vLLM auto-detects the NVFP4 weights — no --quantization flag.

🟢 Easiest — one GPU, just chat (start here)

# download (~8.25 GB)
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 --local-dir ./model

docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code

Then open the OpenAI-compatible endpoint at http://localhost:8000/v1.

🧠 IMPORTANT — turn the thinking channel ON

This model was trained to think first. In vLLM you must enable it per request (otherwise it skips reasoning and quality drops on hard problems):

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "gemma4-coder",
  "messages": [{"role":"user","content":"Write a function that returns the longest palindromic substring. Think through edge cases first."}],
  "temperature": 0.0,
  "chat_template_kwargs": {"enable_thinking": true}
}'

In the Python OpenAI client, pass it via extra_body:

from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = c.chat.completions.create(
    model="gemma4-coder",
    messages=[{"role":"user","content":"...your coding task..."}],
    temperature=0.0,                       # greedy = deterministic code
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(r.choices[0].message.content)

💡 Sampling: greedy (temperature 0) for deterministic solutions, or the author's temp 1.0, top_p 0.95, top_k 64 for variety (top_k via extra_body).

⚡ Fastest interactive — TP=4 + bundled MTP speculative decode

A 0.4 B MTP draft is bundled in assistant/ (Google's gemma-4-12B-it assistant). It's lossless (the target verifies every token) and gives ~1.6× single-stream speed. Use num_speculative_tokens: 3 (stable optimum) and --kv-cache-dtype fp8 (NVFP4 KV would break the draft):

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --tensor-parallel-size 4 --disable-custom-all-reduce \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}' \
  --max-model-len 16384 --gpu-memory-utilization 0.90 --trust-remote-code

The bundled draft was trained on base gemma-4-12B-it. On this coder fine-tune it stays lossless; acceptance (and thus the exact speedup) may be a touch lower than a coder-native draft. Measured numbers below.

🔌 Multi-GPU without NVLink (consumer / entry Blackwell over PCIe)

There is no working GPU P2P on plain PCIe, so tensor-parallel hangs unless you disable both NCCL P2P and vLLM's custom all-reduce:

  -e NCCL_P2P_DISABLE=1 \          # <-- env; else hangs at NCCL init
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \    # <-- flag; else the forward deadlocks

Flag cheat-sheet

Flag / env	When	Why
`vllm/vllm-openai:nightly`	always	only nightly registers `Gemma4UnifiedForConditionalGeneration`
`--trust-remote-code`	always	new architecture
`chat_template_kwargs={"enable_thinking":true}`	every request	turns the reasoning channel on
`NCCL_P2P_DISABLE=1` (env)	TP > 1, no NVLink	else hangs at NCCL init
`--disable-custom-all-reduce`	TP > 1, no NVLink	else the forward deadlocks
`--ipc=host --shm-size 16gb`	TP > 1 (docker)	host-path NCCL needs shared memory
`--speculative-config '{"method":"mtp",...}'`	interactive (≤8 concurrent)	~1.6× single-stream; turn off for big batches
`--kv-cache-dtype fp8`	with MTP	NVFP4 KV collapses draft acceptance

📈 Throughput (measured — 4× RTX PRO 2000 Blackwell, 16 GB, PCIe / no-NVLink)

Single-stream decode (1 request, 512 tok, thinking on):

config	tok/s	note
TP=2	53	2 GPUs
TP=4	74	4 GPUs, lowest latency
TP=4 + MTP (k=3)	130 (1.76×)	bundled draft, lossless

Aggregate throughput (no spec-decode; turn MTP off for batch):

concurrency	1	2	4	8	16
TP=2 tok/s	53	103	202	369	631
TP=4 tok/s	74	146	272	492	780

Choosing a layout on a fixed GPU budget: TP=4 gives the lowest latency, but TP=2 is more efficient per GPU (≈316 vs 195 tok/s/GPU at 16-way). For max farm throughput, run two data-parallel TP=2 replicas (≈1.3k tok/s on 4 GPUs) instead of one TP=4. Rule of thumb: MTP on for interactive (≤8 concurrent), off for high-concurrency batch.

🔧 Quantization details


Scheme	NVFP4A16 — weights FP4 (group 16, FP8 scales), activations BF16
Format	`compressed-tensors` (native vLLM auto-detect)
Tool	llm-compressor 0.11, data-free RTN
Ignored (kept high-precision)	`lm_head`, vision/audio `embedding_projection`
Size	8.25 GB model + 0.85 GB MTP draft · needs Blackwell (SM120)
Source	dequantized from the author's `Q8_0` GGUF (≈lossless), verified to parity

📚 Base, license, and a note on use

Original model: yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 (fine-tune of google/gemma-4-12B-it).
MTP draft: google/gemma-4-12B-it-assistant, bundled in assistant/.
License: Gemma Terms of Use — derivatives must comply.
De-refused / not safety-aligned: add your own guardrails for production. Strongest on Python / algorithmic tasks; double-check general facts and especially time-series / quant code. Shared as-is, no warranty. Happy hacking! 🐾✨