💻 Gemma-4-12B-Coder (fable5 × composer2.5) — NVFP4A16 for vLLM

A faithful 4-bit build of yuxinlu1's coding model, now runnable in vLLM — with a bundled MTP draft for ~1.6× interactive speed. 🚀

TL;DR — A local Python-coding assistant that thinks before it codes. 8.25 GB, runs on one 16 GB Blackwell GPU, native in vLLM (no --quantization flag). Bundled speculative-decode draft included. 💚


🙏 Credit & what this is

This is a weight-only NVFP4 (W4A16) re-quantization of yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 — full credit and thanks to @yuxinlu1 for the model and the lovely training recipe. Please ⭐ and follow the original repo; if you want a v2, that's the author's signal to watch.

The author's design intent (preserved here): a focused fine-tune of google/gemma-4-12B-it on verifiable Python coding — distilled from real chain-of-thought (Composer 2.5, kept only where the code passed its tests) plus a Fable 5 "second-attempt" set that recovers the hard cases the main teacher missed. The result reasons in the open (edge cases, complexity) in Gemma's native thinking channel, then emits a clean, runnable solution. It is Python/algorithmic-focused, de-refused (not safety-aligned — add your own guardrails), and English-centric.

Why this build exists: the author shipped GGUF only (great for llama.cpp). This repo reconstructs a vLLM-native artifact so you can serve it with continuous batching, tensor-parallelism, and speculative decoding on Blackwell GPUs.

How it was made (provenance, for the curious): the author's Q8_0 GGUF (≈lossless) was dequantized to BF16, the gemma-4 language tensors grafted onto a same-arch gemma4_unified skeleton, then quantized to NVFP4A16 with llm-compressor. Quality was verified to match the Q8 source (see below). W4A16 (weights FP4, activations BF16) is used deliberately: the base is non-QAT, where full W4A4 collapses on this architecture — weight-only keeps it robust.


📊 How good is it? (independent eval, greedy pass@1)

Benchmark Score
HumanEval 90.2% (148/164)
MBPP 85.7% (366/427)
HumanEval[:50] — this NVFP4 build vs the Q8 source 96% = 96% (parity, no quality loss)

Strong at: hard algorithms (DP, graphs, Fenwick/segment trees, bitmask DP), bug-fixing & refactoring (accurate root-cause + genuine O(n²)→O(n) rewrites that preserve semantics), and faithful open reasoning that matches the emitted code. Japanese prompts cause no measurable Python-quality drop.

⚠️ Know the one sharp edge (verified): on quant / time-series code it can write a look-ahead bias (e.g. an unshifted position × a forward-shifted return), and its reasoning sometimes states the correct rule while the code does the opposite. Do not ship its pandas/numpy back-test or accounting code unreviewed — gate it. It's a superb algorithm/debug specialist, not an unsupervised quant author.


🚀 Run it — pick your path

You need: a Blackwell GPU (SM120 / RTX 50-series / RTX PRO / GB10 / B100/200), Docker with the NVIDIA runtime. Gemma-4 unified is new, so you need a vLLM build that registers Gemma4UnifiedForConditionalGeneration (recent nightly). vLLM auto-detects the NVFP4 weights — no --quantization flag.

🟢 Easiest — one GPU, just chat (start here)

# download (~8.25 GB)
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 --local-dir ./model

docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code

Then open the OpenAI-compatible endpoint at http://localhost:8000/v1.

🧠 IMPORTANT — turn the thinking channel ON

This model was trained to think first. In vLLM you must enable it per request (otherwise it skips reasoning and quality drops on hard problems):

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "gemma4-coder",
  "messages": [{"role":"user","content":"Write a function that returns the longest palindromic substring. Think through edge cases first."}],
  "temperature": 0.0,
  "chat_template_kwargs": {"enable_thinking": true}
}'

In the Python OpenAI client, pass it via extra_body:

from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = c.chat.completions.create(
    model="gemma4-coder",
    messages=[{"role":"user","content":"...your coding task..."}],
    temperature=0.0,                       # greedy = deterministic code
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(r.choices[0].message.content)

💡 Sampling: greedy (temperature 0) for deterministic solutions, or the author's temp 1.0, top_p 0.95, top_k 64 for variety (top_k via extra_body).

⚡ Fastest interactive — TP=4 + bundled MTP speculative decode

A 0.4 B MTP draft is bundled in assistant/ (Google's gemma-4-12B-it assistant). It's lossless (the target verifies every token) and gives ~1.6× single-stream speed. Use num_speculative_tokens: 3 (stable optimum) and --kv-cache-dtype fp8 (NVFP4 KV would break the draft):

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --tensor-parallel-size 4 --disable-custom-all-reduce \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}' \
  --max-model-len 16384 --gpu-memory-utilization 0.90 --trust-remote-code

The bundled draft was trained on base gemma-4-12B-it. On this coder fine-tune it stays lossless; acceptance (and thus the exact speedup) may be a touch lower than a coder-native draft. Measured numbers below.

🔌 Multi-GPU without NVLink (consumer / entry Blackwell over PCIe)

There is no working GPU P2P on plain PCIe, so tensor-parallel hangs unless you disable both NCCL P2P and vLLM's custom all-reduce:

  -e NCCL_P2P_DISABLE=1 \          # <-- env; else hangs at NCCL init
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \    # <-- flag; else the forward deadlocks

Flag cheat-sheet

Flag / env When Why
vllm/vllm-openai:nightly always only nightly registers Gemma4UnifiedForConditionalGeneration
--trust-remote-code always new architecture
chat_template_kwargs={"enable_thinking":true} every request turns the reasoning channel on
NCCL_P2P_DISABLE=1 (env) TP > 1, no NVLink else hangs at NCCL init
--disable-custom-all-reduce TP > 1, no NVLink else the forward deadlocks
--ipc=host --shm-size 16gb TP > 1 (docker) host-path NCCL needs shared memory
--speculative-config '{"method":"mtp",...}' interactive (≤8 concurrent) ~1.6× single-stream; turn off for big batches
--kv-cache-dtype fp8 with MTP NVFP4 KV collapses draft acceptance

📈 Throughput (measured — 4× RTX PRO 2000 Blackwell, 16 GB, PCIe / no-NVLink)

Single-stream decode (1 request, 512 tok, thinking on):

config tok/s note
TP=2 53 2 GPUs
TP=4 74 4 GPUs, lowest latency
TP=4 + MTP (k=3) 130 (1.76×) bundled draft, lossless

Aggregate throughput (no spec-decode; turn MTP off for batch):

concurrency 1 2 4 8 16
TP=2 tok/s 53 103 202 369 631
TP=4 tok/s 74 146 272 492 780

Choosing a layout on a fixed GPU budget: TP=4 gives the lowest latency, but TP=2 is more efficient per GPU (≈316 vs 195 tok/s/GPU at 16-way). For max farm throughput, run two data-parallel TP=2 replicas (≈1.3k tok/s on 4 GPUs) instead of one TP=4. Rule of thumb: MTP on for interactive (≤8 concurrent), off for high-concurrency batch.


🔧 Quantization details

Scheme NVFP4A16 — weights FP4 (group 16, FP8 scales), activations BF16
Format compressed-tensors (native vLLM auto-detect)
Tool llm-compressor 0.11, data-free RTN
Ignored (kept high-precision) lm_head, vision/audio embedding_projection
Size 8.25 GB model + 0.85 GB MTP draft · needs Blackwell (SM120)
Source dequantized from the author's Q8_0 GGUF (≈lossless), verified to parity

📚 Base, license, and a note on use

NVFP4 build & eval by Lna-Lab. Thanks again to @yuxinlu1 for the original.

Downloads last month
20
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4