How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

Huihui-gemma-4-12B-it-abliterated-NVFP4A16

NVFP4 (W4A16) quantization of huihui-ai/Huihui-gemma-4-12B-it-abliterated — the abliterated (uncensored) Gemma 4 12B unified model (text + vision + audio).

24 GB → 7.7 GB. Runs on a single 16 GB Blackwell GPU, or shards across several for higher throughput. Up to 118 tok/s single-stream (TP=4 + MTP speculative decode) and ~1117 tok/s aggregate.

Base huihui-ai/Huihui-gemma-4-12B-it-abliterated (abliterated google/gemma-4-12B-it)
Architecture Gemma4UnifiedForConditionalGeneration — 12B dense, 48 layers, 131K ctx
Quantization NVFP4A16 — weights FP4 (group 16, FP8 scales), activations BF16
Format compressed-tensors / nvfp4-pack-quantized (native vLLM)
Tool llm-compressor
Size 7.7 GB · Requires NVIDIA Blackwell (SM120)

Weight-only FP4 (W4A16) keeps activations at BF16, so it is robust where full W4A4 NVFP4 collapses on this architecture.


Quickstart

Requires a Blackwell GPU (SM120 / RTX 50-series / GB10 / B100/B200), Docker with the NVIDIA runtime, and the hf CLI. Gemma 4 unified is brand new — you need vLLM nightly (released ≤ 0.22.1 lack the Gemma4Unified class).

# 1) Download this model (7.7 GB). For spec-decode, also grab the 0.4B MTP draft.
hf download sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 --local-dir ./model
hf download google/gemma-4-12B-it-assistant --local-dir ./draft   # optional, for spec-decode

# 2a) Simplest — single GPU, no speculative decode
docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b --max-model-len 65536 \
  --gpu-memory-utilization 0.92 --trust-remote-code

Multi-GPU — read this if your box has no NVLink

On consumer/entry Blackwell (e.g. RTX PRO 2000) over plain PCIe there is no working GPU P2P, and vLLM tensor-parallel hangs unless you disable both NCCL P2P and vLLM's custom all-reduce:

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \                          # <-- without this, hangs at NCCL init
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \                     # <-- without this, the forward deadlocks
  --max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code

Maximum interactive speed — TP=4 + MTP speculative decode

Google ships a 0.4B MTP draft (google/gemma-4-12B-it-assistant). It nearly doubles single-stream throughput (lossless — the target verifies every token). Use num_speculative_tokens: 3 (the stable optimum; k≥5 collapses acceptance) and --kv-cache-dtype fp8 (NVFP4 KV would break the draft):

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \
  -v $PWD/model:/model:ro -v $PWD/draft:/draft:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b \
  --tensor-parallel-size 4 --disable-custom-all-reduce \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":3}' \
  --max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code

Test it:

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
 '{"model":"gemma4-12b","messages":[{"role":"user","content":"Explain the CAP theorem in one sentence."}]}'

Flag cheat-sheet

Flag / env When Why
vllm/vllm-openai:nightly always only nightly registers Gemma4UnifiedForConditionalGeneration
--trust-remote-code always new arch
NCCL_P2P_DISABLE=1 (env) TP > 1 on no-NVLink else hangs at NCCL init
--disable-custom-all-reduce TP > 1 on no-NVLink else the forward deadlocks
--ipc=host --shm-size 16gb TP > 1 (docker) host-path NCCL needs shared memory
--speculative-config '{"method":"mtp",…,"num_speculative_tokens":3}' interactive ~1.6–1.7× single-stream
--kv-cache-dtype fp8 with spec-decode nvfp4 KV collapses draft acceptance
--max-num-seqs 4 (+ --gpu-memory-utilization 0.95) single GPU, long ctx frees KV room for up to -c 32768 on 16 GB

Benchmarks

Measured on 4× RTX PRO 2000 Blackwell (16 GB, SM120, 288 GB/s, PCIe — no NVLink), TP=4, -c 65536.

Single-stream decode (interactive) — TP sweep, 1 request × 512 tok:

TP GPUs no spec + MTP (k=3) MTP gain
1 1 30.5 55.0 1.80×
2 2 53.2 94.8 1.78×
4 4 73.3 118.5 1.62×

(TP=4 + MTP peaks at 121.0 with k=4, but k=3 is the stable optimum.) MTP gives a steady ~1.6–1.8× at every TP. TP scaling is sub-linear on this no-NVLink box (host-memory all-reduce). Pick by what you have:

goal config single-stream GPUs freed
low-power, 1-GPU resident TP=1 + MTP 55 5
balanced TP=2 + MTP 95 4
fastest interactive TP=4 + MTP 118 2

Aggregate throughput (concurrency sweep, no spec-decode):

concurrency 1 2 4 8 16 32
tok/s (-c65536) 73 145 274 487 796 1117
tok/s (-c131072) 74 145 275 498 792 1100

64K and 128K context decode identically (sliding-window KV). Rule: MTP spec-decode for low concurrency (≤8); turn it off for high-concurrency batch serving (it costs throughput once the batch saturates).

Quality — measured vs BF16 base and an FP8 build (same huihui base)

Greedy side-by-side on EN / 繁體中文 / 日本語 / code / facts / reasoning traps:

  • Standard tasks: identical. Facts (Chernobyl: April 1986, reactor 4), Traditional-Chinese & Japanese explanations, 17×23−100 = 291, 60 km / 45 min = 80 km/h, code — NVFP4 = FP8 = BF16 base, no collapse, no drift.
  • Hard reasoning traps (7 tested): a small, real W4A16 tax. FP8 matched the BF16 base on every trap the base got right; NVFP4 slipped on ~1 of 7 (it answered a Barbara-type syllogism "Yes" where No is correct, plus one minor secondary-detail slip). One age-word-problem even the BF16 base fails — a model limit, not a quant artifact.

Verdict: half the size and faster than FP8, at standard-task parity. Choose FP8 for maximum reasoning fidelity; choose this NVFP4A16 for the best size/speed at ~85–90% reasoning parity — the right default for most local-agent and chat workloads.

Notes

  • Abliterated (uncensored). Use responsibly.
  • NVFP4 is Blackwell-specific; it will not run on Ampere/Hopper.
  • Multimodal vision/audio embedders kept in BF16.

Credits

Support the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai:

Downloads last month
497
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16

Quantized
(5)
this model