Instructions to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4
- SGLang
How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 with Docker Model Runner:
docker model run hf.co/sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4
Use Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4
One repo, speculative decoding included — and on this dense 31B it is dramatic: 2.4× Japanese / 2.9× English. This is the MTP bundle of Huihui-gemma-4-31B-it-qat-abliterated-NVFP4: the NVFP4 (full W4A4) body plus the matching gemma4_assistant MTP draft checkpoint in assistant/, so a single hf download gives you everything vllm serve --speculative-config needs.
Measured: Japanese 86 tok/s · English 106 tok/s single-stream on 4× RTX PRO 2000 Blackwell 16 GB (vs 36–37 baseline) — a QAT-origin, abliterated Gemma 4 31B dense in 20.4 GB + a 0.94 GB draft, pulled into the practical zone on 4 entry-level Blackwell cards.
Lineage: google/gemma-4-31B-it-qat-q4_0-unquantized (QAT q4_0 → bf16) → huihui-ai abliteration → Lna-Lab NVFP4 W4A4 → this bundle, adding google/gemma-4-31B-it-qat-q4_0-unquantized-assistant (bf16, unmodified) as the speculative draft.
| Body | Gemma4ForConditionalGeneration — 31B dense, 60 text layers (hidden 5376) + vision tower · NVFP4 W4A4 (compressed-tensors / nvfp4-pack-quantized) · 20.4 GB |
Draft (assistant/) |
Gemma4AssistantForCausalLM (model_type: gemma4_assistant) — 4-layer MTP head riding the target's hidden states · bf16 · 0.94 GB |
| Spec method | vLLM gemma4_mtp, num_speculative_tokens: 4 |
| Hardware | NVIDIA Blackwell (SM120) required · TP=4 (4× 16 GB) recommended — TP=2 + draft does not leave usable KV on 16 GB cards |
| vLLM | ≥ 0.21 (compressed-tensors NVFP4 auto-detect + gemma4_mtp; measured on 0.21.0) |
Why MTP, and why a bundle
Gemma 4's multi-token-prediction is not a head baked into the main checkpoint (Qwen3.6-style). Google ships it as a separate assistant checkpoint that vLLM's --speculative-config loads alongside the target. That normally costs you a second hf download and a path dance. This repo ends that: the assistant lives in assistant/ and you point the config at the local subfolder.
Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4/
├── model.safetensors # NVFP4 W4A4 body (20.4 GB)
├── config.json / generation_config.json / processor_config.json
├── tokenizer.json / tokenizer_config.json / chat_template.jinja
├── recipe.yaml # llm-compressor recipe
└── assistant/ # gemma4_mtp draft (bf16, 0.94 GB)
├── model.safetensors
├── config.json # model_type: gemma4_assistant
└── tokenizer / chat_template
Quickstart (TP=4 recommended)
hf download sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4 \
--local-dir gemma4-31b-mtp
DIR=$(realpath gemma4-31b-mtp)
NCCL_P2P_DISABLE=1 vllm serve "$DIR" \
--served-model-name gemma4-31b-mtp \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 8192 \
--limit-mm-per-prompt '{"image":0}' \
--speculative-config "{\"method\":\"gemma4_mtp\",\"model\":\"$DIR/assistant\",\"num_speculative_tokens\":4}"
- The point:
modelin--speculative-configis the bundled local path — no second download, no HF resolution at serve time. - TP=4 is the right call twice over: (1) the dense 31B gains +64% from TP=2→4 even without spec-decode, and (2) the draft's VRAM share makes TP=2 KV-starved on 16 GB cards. This config measured KV 17,385 tok (bf16 KV; add
--kv-cache-dtype fp8for more). - vLLM 0.21 multimodal budget trap: keep
--max-num-batched-tokens≥ 2496 even with'{"image":0}', or startup fails validating the multimodal token budget. NCCL_P2P_DISABLE=1+--disable-custom-all-reduceare required on PCIe no-NVLink boxes (TP hangs without them); drop both if you have NVLink/P2P. Keep CUDA graphs ON.- vLLM 0.21's quantization-inheritance trap does not fire here: with an explicit draft
modelpath the draft's own config decides (bf16). Only themodel:nullMTP-from-target path inherits target quantization.
Measured (RTX PRO 2000 Blackwell 16 GB ×4, TP=4, PCIe no-NVLink, vLLM 0.21.0, 2026-06-12)
Single-stream, T=0 chat completions, ×3 each. Acceptance = accepted/drafted from /metrics.
| config | JA 128 | JA 512 | EN 128 | EN 512 | acceptance JA / EN |
|---|---|---|---|---|---|
| baseline (no spec) | 36.4 | 36.3 | 36.7 | — | — |
| native MTP (this bundle, N=4) | 85.7 | 75.6 | 106.1 | 91.0 | 41–51% / 55–71% |
| EAGLE-3 (vanilla-trained NVFP4 draft) N=3 | 33.5 | 34.0 | 46.9 | — | 1–3% / 16% |
JA 2.1–2.4×, EN 2.5–2.9× over baseline. A dense 31B at 36 tok/s leaves four GPUs verification-hungry — exactly the regime where MTP shines (the already-fast MoE 26B sibling only gains 1.2–1.5× from the same trick).
Two lessons paid for in benchmarks:
- EAGLE-3 was dead on arrival against this abliterated QAT body — a vanilla-31B-trained, English-data draft gets 1–3% JA acceptance (below baseline) and only 16% EN. Distribution shift between drafter and verifier kills it. The google MTP assistant shrugs both problems off (41–51% JA acceptance despite vanilla training) because its 4-layer head re-uses the target's own hidden states instead of imitating its distribution from scratch.
- If your traffic is Japanese (or anything non-English), the MTP assistant is the only draft of the ones we tested that pays.
Concurrent (aggregate throughput)
4 / 8 concurrent streams × 256 tok each (T=0, diverse prompts, prefix-cache busted, ×3 averaged). Baseline = same body, no spec-decode (measured with JA prompts; baseline JA≈EN single-stream).
| streams | baseline tok/s | MTP JA tok/s | MTP EN tok/s | acceptance JA / EN |
|---|---|---|---|---|
| 1 | 36.4 | 85.7 (2.4×) | 106.1 (2.9×) | 41–51% / 55–71% |
| 4 | 139.5 | 211.6 (+52%) | 251.9 (+81%) | ~39% / ~52% |
| 8 | 252.9 | 320.1 (+27%) | 387.2 (+53%) | ~41% / ~53% |
MTP keeps winning at every concurrency this box can reach. The multiplier decays as batching fills the GPUs (2.4× → 1.5× → 1.3× JA), but a dense 31B at 253 tok/s aggregate is still verification-hungry, acceptance holds 40%/53% under batch, and the KV budget (17,385 tok) caps realistic concurrency long before any crossover. On this model there is no regime where you should turn MTP off. (Contrast: the MoE 26B sibling — already compute-saturated — breaks even at 8 streams.)
The body: QAT × NVFP4 (the finding, in short)
Full W4A4 NVFP4 breaks non-QAT gemma-4 — the non-QAT 12B collapsed outright on this exact recipe. This 31B's q4_0 QAT weights take it cleanly: fluent Japanese, correct multi-step logic, valid haiku, zero repetition/mojibake, with a plain ultrachat 256×2048 calibration. If you want gemma-4 in NVFP4 W4A4, go through a QAT checkpoint. Full evidence and bake recipe (pure-CPU calibration through the multimodal processor — multi-GPU dispatch silently corrupts gemma4 activations on no-P2P boxes) in the non-MTP card.
Notes
- Abliterated (uncensored). Refusal behavior removed upstream — you are responsible for your deployment. Use responsibly and lawfully.
- NVFP4 is Blackwell-specific; the body will not run on Ampere/Hopper. The bf16 assistant inherits the body's GPU anyway.
- The
assistant/checkpoint is google's, redistributed unmodified under the same Gemma terms; its original model card is included asassistant/README.md. - Gemma is provided under and subject to the Gemma Terms of Use.
Credits
- Original model & MTP assistant: Google DeepMind (Gemma 4, QAT q4_0)
- QAT-unquantize & abliteration: huihui-ai
- NVFP4 quantization, spec-decode measurement & bundle: Lna-Lab · Tooling: llm-compressor / vLLM
Support the Base Model Author (huihui-ai)
If you find the abliterated base useful, please support huihui-ai:
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
- Downloads last month
- 234
Model tree for sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4
Base model
google/gemma-4-31B
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-31B-it-qat-abliterated-MTP-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'