Instructions to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4

SGLang

How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4
```

Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4

One repo, speculative decoding included — and on this dense 31B it is dramatic: 2.4× Japanese / 2.9× English. This is the MTP bundle of Huihui-gemma-4-31B-it-qat-abliterated-NVFP4: the NVFP4 (full W4A4) body plus the matching gemma4_assistant MTP draft checkpoint in assistant/, so a single hf download gives you everything vllm serve --speculative-config needs.

Measured: Japanese 86 tok/s · English 106 tok/s single-stream on 4× RTX PRO 2000 Blackwell 16 GB (vs 36–37 baseline) — a QAT-origin, abliterated Gemma 4 31B dense in 20.4 GB + a 0.94 GB draft, pulled into the practical zone on 4 entry-level Blackwell cards.

Lineage: google/gemma-4-31B-it-qat-q4_0-unquantized (QAT q4_0 → bf16) → huihui-ai abliteration → Lna-Lab NVFP4 W4A4 → this bundle, adding google/gemma-4-31B-it-qat-q4_0-unquantized-assistant (bf16, unmodified) as the speculative draft.


Body	`Gemma4ForConditionalGeneration` — 31B dense, 60 text layers (hidden 5376) + vision tower · NVFP4 W4A4 (`compressed-tensors` / `nvfp4-pack-quantized`) · 20.4 GB
Draft (`assistant/`)	`Gemma4AssistantForCausalLM` (`model_type: gemma4_assistant`) — 4-layer MTP head riding the target's hidden states · bf16 · 0.94 GB
Spec method	vLLM `gemma4_mtp`, `num_speculative_tokens: 4`
Hardware	NVIDIA Blackwell (SM120) required · TP=4 (4× 16 GB) recommended — TP=2 + draft does not leave usable KV on 16 GB cards
vLLM	≥ 0.21 (compressed-tensors NVFP4 auto-detect + `gemma4_mtp`; measured on 0.21.0)

Why MTP, and why a bundle

Gemma 4's multi-token-prediction is not a head baked into the main checkpoint (Qwen3.6-style). Google ships it as a separate assistant checkpoint that vLLM's --speculative-config loads alongside the target. That normally costs you a second hf download and a path dance. This repo ends that: the assistant lives in assistant/ and you point the config at the local subfolder.

Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4/
├── model.safetensors          # NVFP4 W4A4 body (20.4 GB)
├── config.json / generation_config.json / processor_config.json
├── tokenizer.json / tokenizer_config.json / chat_template.jinja
├── recipe.yaml                # llm-compressor recipe
└── assistant/                 # gemma4_mtp draft (bf16, 0.94 GB)
    ├── model.safetensors
    ├── config.json            # model_type: gemma4_assistant
    └── tokenizer / chat_template

Quickstart (TP=4 recommended)

hf download sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 \
  --local-dir gemma4-31b-mtp
DIR=$(realpath gemma4-31b-mtp)

NCCL_P2P_DISABLE=1 vllm serve "$DIR" \
  --served-model-name gemma4-31b-mtp \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 8192 \
  --limit-mm-per-prompt '{"image":0}' \
  --speculative-config "{\"method\":\"gemma4_mtp\",\"model\":\"$DIR/assistant\",\"num_speculative_tokens\":4}"

The point: model in --speculative-config is the bundled local path — no second download, no HF resolution at serve time.
TP=4 is the right call twice over: (1) the dense 31B gains +64% from TP=2→4 even without spec-decode, and (2) the draft's VRAM share makes TP=2 KV-starved on 16 GB cards. This config measured KV 17,385 tok (bf16 KV; add --kv-cache-dtype fp8 for more).
vLLM 0.21 multimodal budget trap: keep --max-num-batched-tokens ≥ 2496 even with '{"image":0}', or startup fails validating the multimodal token budget.
NCCL_P2P_DISABLE=1 + --disable-custom-all-reduce are required on PCIe no-NVLink boxes (TP hangs without them); drop both if you have NVLink/P2P. Keep CUDA graphs ON.
vLLM 0.21's quantization-inheritance trap does not fire here: with an explicit draft model path the draft's own config decides (bf16). Only the model:null MTP-from-target path inherits target quantization.

Measured (RTX PRO 2000 Blackwell 16 GB ×4, TP=4, PCIe no-NVLink, vLLM 0.21.0, 2026-06-12)

Single-stream, T=0 chat completions, ×3 each. Acceptance = accepted/drafted from /metrics.

config	JA 128	JA 512	EN 128	EN 512	acceptance JA / EN
baseline (no spec)	36.4	36.3	36.7	—	—
native MTP (this bundle, N=4)	85.7	75.6	106.1	91.0	41–51% / 55–71%
EAGLE-3 (vanilla-trained NVFP4 draft) N=3	33.5	34.0	46.9	—	1–3% / 16%

JA 2.1–2.4×, EN 2.5–2.9× over baseline. A dense 31B at 36 tok/s leaves four GPUs verification-hungry — exactly the regime where MTP shines (the already-fast MoE 26B sibling only gains 1.2–1.5× from the same trick).

Two lessons paid for in benchmarks:

EAGLE-3 was dead on arrival against this abliterated QAT body — a vanilla-31B-trained, English-data draft gets 1–3% JA acceptance (below baseline) and only 16% EN. Distribution shift between drafter and verifier kills it. The google MTP assistant shrugs both problems off (41–51% JA acceptance despite vanilla training) because its 4-layer head re-uses the target's own hidden states instead of imitating its distribution from scratch.
If your traffic is Japanese (or anything non-English), the MTP assistant is the only draft of the ones we tested that pays.

Concurrent (aggregate throughput)

4 / 8 concurrent streams × 256 tok each (T=0, diverse prompts, prefix-cache busted, ×3 averaged). Baseline = same body, no spec-decode (measured with JA prompts; baseline JA≈EN single-stream).

streams	baseline tok/s	MTP JA tok/s	MTP EN tok/s	acceptance JA / EN
1	36.4	85.7 (2.4×)	106.1 (2.9×)	41–51% / 55–71%
4	139.5	211.6 (+52%)	251.9 (+81%)	~39% / ~52%
8	252.9	320.1 (+27%)	387.2 (+53%)	~41% / ~53%

MTP keeps winning at every concurrency this box can reach. The multiplier decays as batching fills the GPUs (2.4× → 1.5× → 1.3× JA), but a dense 31B at 253 tok/s aggregate is still verification-hungry, acceptance holds ~~40%/~~53% under batch, and the KV budget (17,385 tok) caps realistic concurrency long before any crossover. On this model there is no regime where you should turn MTP off. (Contrast: the MoE 26B sibling — already compute-saturated — breaks even at 8 streams.)

The body: QAT × NVFP4 (the finding, in short)

Full W4A4 NVFP4 breaks non-QAT gemma-4 — the non-QAT 12B collapsed outright on this exact recipe. This 31B's q4_0 QAT weights take it cleanly: fluent Japanese, correct multi-step logic, valid haiku, zero repetition/mojibake, with a plain ultrachat 256×2048 calibration. If you want gemma-4 in NVFP4 W4A4, go through a QAT checkpoint. Full evidence and bake recipe (pure-CPU calibration through the multimodal processor — multi-GPU dispatch silently corrupts gemma4 activations on no-P2P boxes) in the non-MTP card.

Notes

Abliterated (uncensored). Refusal behavior removed upstream — you are responsible for your deployment. Use responsibly and lawfully.
NVFP4 is Blackwell-specific; the body will not run on Ampere/Hopper. The bf16 assistant inherits the body's GPU anyway.
The assistant/ checkpoint is google's, redistributed unmodified under the same Gemma terms; its original model card is included as assistant/README.md.
Gemma is provided under and subject to the Gemma Terms of Use.

Credits

Original model & MTP assistant: Google DeepMind (Gemma 4, QAT q4_0)
QAT-unquantize & abliteration: huihui-ai
NVFP4 quantization, spec-decode measurement & bundle: Lna-Lab · Tooling: llm-compressor / vLLM

Support the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai:

Ko-fi: https://ko-fi.com/huihuiai
Bitcoin: bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge

Downloads last month: 234

Safetensors

Model size

18B params

Tensor type

F32

BF16

F8_E4M3

Model tree for sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Finetuned

google/gemma-4-31B-it-qat-q4_0-unquantized

Quantized

(23)

this model