How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4

One repo, speculative decoding included — and on this dense 31B it is dramatic: 2.4× Japanese / 2.9× English. This is the MTP bundle of Huihui-gemma-4-31B-it-qat-abliterated-NVFP4: the NVFP4 (full W4A4) body plus the matching gemma4_assistant MTP draft checkpoint in assistant/, so a single hf download gives you everything vllm serve --speculative-config needs.

Measured: Japanese 86 tok/s · English 106 tok/s single-stream on 4× RTX PRO 2000 Blackwell 16 GB (vs 36–37 baseline) — a QAT-origin, abliterated Gemma 4 31B dense in 20.4 GB + a 0.94 GB draft, pulled into the practical zone on 4 entry-level Blackwell cards.

Lineage: google/gemma-4-31B-it-qat-q4_0-unquantized (QAT q4_0 → bf16) → huihui-ai abliteration → Lna-Lab NVFP4 W4A4 → this bundle, adding google/gemma-4-31B-it-qat-q4_0-unquantized-assistant (bf16, unmodified) as the speculative draft.

Body Gemma4ForConditionalGeneration — 31B dense, 60 text layers (hidden 5376) + vision tower · NVFP4 W4A4 (compressed-tensors / nvfp4-pack-quantized) · 20.4 GB
Draft (assistant/) Gemma4AssistantForCausalLM (model_type: gemma4_assistant) — 4-layer MTP head riding the target's hidden states · bf16 · 0.94 GB
Spec method vLLM gemma4_mtp, num_speculative_tokens: 4
Hardware NVIDIA Blackwell (SM120) required · TP=4 (4× 16 GB) recommended — TP=2 + draft does not leave usable KV on 16 GB cards
vLLM ≥ 0.21 (compressed-tensors NVFP4 auto-detect + gemma4_mtp; measured on 0.21.0)

Why MTP, and why a bundle

Gemma 4's multi-token-prediction is not a head baked into the main checkpoint (Qwen3.6-style). Google ships it as a separate assistant checkpoint that vLLM's --speculative-config loads alongside the target. That normally costs you a second hf download and a path dance. This repo ends that: the assistant lives in assistant/ and you point the config at the local subfolder.

Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4/
├── model.safetensors          # NVFP4 W4A4 body (20.4 GB)
├── config.json / generation_config.json / processor_config.json
├── tokenizer.json / tokenizer_config.json / chat_template.jinja
├── recipe.yaml                # llm-compressor recipe
└── assistant/                 # gemma4_mtp draft (bf16, 0.94 GB)
    ├── model.safetensors
    ├── config.json            # model_type: gemma4_assistant
    └── tokenizer / chat_template

Quickstart (TP=4 recommended)

hf download sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 \
  --local-dir gemma4-31b-mtp
DIR=$(realpath gemma4-31b-mtp)

NCCL_P2P_DISABLE=1 vllm serve "$DIR" \
  --served-model-name gemma4-31b-mtp \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 8192 \
  --limit-mm-per-prompt '{"image":0}' \
  --speculative-config "{\"method\":\"gemma4_mtp\",\"model\":\"$DIR/assistant\",\"num_speculative_tokens\":4}"
  • The point: model in --speculative-config is the bundled local path — no second download, no HF resolution at serve time.
  • TP=4 is the right call twice over: (1) the dense 31B gains +64% from TP=2→4 even without spec-decode, and (2) the draft's VRAM share makes TP=2 KV-starved on 16 GB cards. This config measured KV 17,385 tok (bf16 KV; add --kv-cache-dtype fp8 for more).
  • vLLM 0.21 multimodal budget trap: keep --max-num-batched-tokens ≥ 2496 even with '{"image":0}', or startup fails validating the multimodal token budget.
  • NCCL_P2P_DISABLE=1 + --disable-custom-all-reduce are required on PCIe no-NVLink boxes (TP hangs without them); drop both if you have NVLink/P2P. Keep CUDA graphs ON.
  • vLLM 0.21's quantization-inheritance trap does not fire here: with an explicit draft model path the draft's own config decides (bf16). Only the model:null MTP-from-target path inherits target quantization.

Measured (RTX PRO 2000 Blackwell 16 GB ×4, TP=4, PCIe no-NVLink, vLLM 0.21.0, 2026-06-12)

Single-stream, T=0 chat completions, ×3 each. Acceptance = accepted/drafted from /metrics.

config JA 128 JA 512 EN 128 EN 512 acceptance JA / EN
baseline (no spec) 36.4 36.3 36.7
native MTP (this bundle, N=4) 85.7 75.6 106.1 91.0 41–51% / 55–71%
EAGLE-3 (vanilla-trained NVFP4 draft) N=3 33.5 34.0 46.9 1–3% / 16%

JA 2.1–2.4×, EN 2.5–2.9× over baseline. A dense 31B at 36 tok/s leaves four GPUs verification-hungry — exactly the regime where MTP shines (the already-fast MoE 26B sibling only gains 1.2–1.5× from the same trick).

Two lessons paid for in benchmarks:

  • EAGLE-3 was dead on arrival against this abliterated QAT body — a vanilla-31B-trained, English-data draft gets 1–3% JA acceptance (below baseline) and only 16% EN. Distribution shift between drafter and verifier kills it. The google MTP assistant shrugs both problems off (41–51% JA acceptance despite vanilla training) because its 4-layer head re-uses the target's own hidden states instead of imitating its distribution from scratch.
  • If your traffic is Japanese (or anything non-English), the MTP assistant is the only draft of the ones we tested that pays.

Concurrent (aggregate throughput)

4 / 8 concurrent streams × 256 tok each (T=0, diverse prompts, prefix-cache busted, ×3 averaged). Baseline = same body, no spec-decode (measured with JA prompts; baseline JA≈EN single-stream).

streams baseline tok/s MTP JA tok/s MTP EN tok/s acceptance JA / EN
1 36.4 85.7 (2.4×) 106.1 (2.9×) 41–51% / 55–71%
4 139.5 211.6 (+52%) 251.9 (+81%) ~39% / ~52%
8 252.9 320.1 (+27%) 387.2 (+53%) ~41% / ~53%

MTP keeps winning at every concurrency this box can reach. The multiplier decays as batching fills the GPUs (2.4× → 1.5× → 1.3× JA), but a dense 31B at 253 tok/s aggregate is still verification-hungry, acceptance holds 40%/53% under batch, and the KV budget (17,385 tok) caps realistic concurrency long before any crossover. On this model there is no regime where you should turn MTP off. (Contrast: the MoE 26B sibling — already compute-saturated — breaks even at 8 streams.)

The body: QAT × NVFP4 (the finding, in short)

Full W4A4 NVFP4 breaks non-QAT gemma-4 — the non-QAT 12B collapsed outright on this exact recipe. This 31B's q4_0 QAT weights take it cleanly: fluent Japanese, correct multi-step logic, valid haiku, zero repetition/mojibake, with a plain ultrachat 256×2048 calibration. If you want gemma-4 in NVFP4 W4A4, go through a QAT checkpoint. Full evidence and bake recipe (pure-CPU calibration through the multimodal processor — multi-GPU dispatch silently corrupts gemma4 activations on no-P2P boxes) in the non-MTP card.

Notes

  • Abliterated (uncensored). Refusal behavior removed upstream — you are responsible for your deployment. Use responsibly and lawfully.
  • NVFP4 is Blackwell-specific; the body will not run on Ampere/Hopper. The bf16 assistant inherits the body's GPU anyway.
  • The assistant/ checkpoint is google's, redistributed unmodified under the same Gemma terms; its original model card is included as assistant/README.md.
  • Gemma is provided under and subject to the Gemma Terms of Use.

Credits

Support the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai:

Downloads last month
234
Safetensors
Model size
18B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4

Quantized
(23)
this model