Instructions to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16

SGLang

How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16
```

Huihui-gemma-4-12B-it-abliterated-NVFP4A16

NVFP4 (W4A16) quantization of huihui-ai/Huihui-gemma-4-12B-it-abliterated — the abliterated (uncensored) Gemma 4 12B unified model (text + vision + audio).

24 GB → 7.7 GB. Runs on a single 16 GB Blackwell GPU, or shards across several for higher throughput. Up to 118 tok/s single-stream (TP=4 + MTP speculative decode) and ~1117 tok/s aggregate.


Base	huihui-ai/Huihui-gemma-4-12B-it-abliterated (abliterated `google/gemma-4-12B-it`)
Architecture	`Gemma4UnifiedForConditionalGeneration` — 12B dense, 48 layers, 131K ctx
Quantization	NVFP4A16 — weights FP4 (group 16, FP8 scales), activations BF16
Format	`compressed-tensors` / `nvfp4-pack-quantized` (native vLLM)
Tool	llm-compressor
Size	7.7 GB · Requires NVIDIA Blackwell (SM120)

Weight-only FP4 (W4A16) keeps activations at BF16, so it is robust where full W4A4 NVFP4 collapses on this architecture.

Quickstart

Requires a Blackwell GPU (SM120 / RTX 50-series / GB10 / B100/B200), Docker with the NVIDIA runtime, and the hf CLI. Gemma 4 unified is brand new — you need vLLM nightly (released ≤ 0.22.1 lack the Gemma4Unified class).

# 1) Download this model (7.7 GB). For spec-decode, also grab the 0.4B MTP draft.
hf download sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 --local-dir ./model
hf download google/gemma-4-12B-it-assistant --local-dir ./draft   # optional, for spec-decode

# 2a) Simplest — single GPU, no speculative decode
docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b --max-model-len 65536 \
  --gpu-memory-utilization 0.92 --trust-remote-code

Multi-GPU — read this if your box has no NVLink

On consumer/entry Blackwell (e.g. RTX PRO 2000) over plain PCIe there is no working GPU P2P, and vLLM tensor-parallel hangs unless you disable both NCCL P2P and vLLM's custom all-reduce:

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \                          # <-- without this, hangs at NCCL init
  -v $PWD/model:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \                     # <-- without this, the forward deadlocks
  --max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code

Maximum interactive speed — TP=4 + MTP speculative decode

Google ships a 0.4B MTP draft (google/gemma-4-12B-it-assistant). It nearly doubles single-stream throughput (lossless — the target verifies every token). Use num_speculative_tokens: 3 (the stable optimum; k≥5 collapses acceptance) and --kv-cache-dtype fp8 (NVFP4 KV would break the draft):

docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
  -e NCCL_P2P_DISABLE=1 \
  -v $PWD/model:/model:ro -v $PWD/draft:/draft:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-12b \
  --tensor-parallel-size 4 --disable-custom-all-reduce \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":3}' \
  --max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code

Test it:

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
 '{"model":"gemma4-12b","messages":[{"role":"user","content":"Explain the CAP theorem in one sentence."}]}'

Flag cheat-sheet

Flag / env	When	Why
`vllm/vllm-openai:nightly`	always	only nightly registers `Gemma4UnifiedForConditionalGeneration`
`--trust-remote-code`	always	new arch
`NCCL_P2P_DISABLE=1` (env)	TP > 1 on no-NVLink	else hangs at NCCL init
`--disable-custom-all-reduce`	TP > 1 on no-NVLink	else the forward deadlocks
`--ipc=host --shm-size 16gb`	TP > 1 (docker)	host-path NCCL needs shared memory
`--speculative-config '{"method":"mtp",…,"num_speculative_tokens":3}'`	interactive	~1.6–1.7× single-stream
`--kv-cache-dtype fp8`	with spec-decode	nvfp4 KV collapses draft acceptance
`--max-num-seqs 4` (+ `--gpu-memory-utilization 0.95`)	single GPU, long ctx	frees KV room for up to `-c 32768` on 16 GB

Benchmarks

Measured on 4× RTX PRO 2000 Blackwell (16 GB, SM120, 288 GB/s, PCIe — no NVLink), TP=4, -c 65536.

Single-stream decode (interactive) — TP sweep, 1 request × 512 tok:

TP	GPUs	no spec	+ MTP (k=3)	MTP gain
1	1	30.5	55.0	1.80×
2	2	53.2	94.8	1.78×
4	4	73.3	118.5	1.62×

(TP=4 + MTP peaks at 121.0 with k=4, but k=3 is the stable optimum.) MTP gives a steady ~1.6–1.8× at every TP. TP scaling is sub-linear on this no-NVLink box (host-memory all-reduce). Pick by what you have:

goal	config	single-stream	GPUs freed
low-power, 1-GPU resident	TP=1 + MTP	55	5
balanced	TP=2 + MTP	95	4
fastest interactive	TP=4 + MTP	118	2

Aggregate throughput (concurrency sweep, no spec-decode):

concurrency	1	2	4	8	16	32
tok/s (-c65536)	73	145	274	487	796	1117
tok/s (-c131072)	74	145	275	498	792	1100

64K and 128K context decode identically (sliding-window KV). Rule: MTP spec-decode for low concurrency (≤8); turn it off for high-concurrency batch serving (it costs throughput once the batch saturates).

Quality — measured vs BF16 base and an FP8 build (same huihui base)

Greedy side-by-side on EN / 繁體中文 / 日本語 / code / facts / reasoning traps:

Standard tasks: identical. Facts (Chernobyl: April 1986, reactor 4), Traditional-Chinese & Japanese explanations, 17×23−100 = 291, 60 km / 45 min = 80 km/h, code — NVFP4 = FP8 = BF16 base, no collapse, no drift.
Hard reasoning traps (7 tested): a small, real W4A16 tax. FP8 matched the BF16 base on every trap the base got right; NVFP4 slipped on ~1 of 7 (it answered a Barbara-type syllogism "Yes" where No is correct, plus one minor secondary-detail slip). One age-word-problem even the BF16 base fails — a model limit, not a quant artifact.

Verdict: half the size and faster than FP8, at standard-task parity. Choose FP8 for maximum reasoning fidelity; choose this NVFP4A16 for the best size/speed at ~85–90% reasoning parity — the right default for most local-agent and chat workloads.

Notes

Abliterated (uncensored). Use responsibly.
NVFP4 is Blackwell-specific; it will not run on Ampere/Hopper.
Multimodal vision/audio embedders kept in BF16.

Credits

Base model & abliteration: huihui-ai
Original model: Google DeepMind (Gemma 4)
Quantization & serving recipe: Lna-Lab · Tooling: llm-compressor / vLLM

Support the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai:

Ko-fi: https://ko-fi.com/huihuiai
Bitcoin: bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge

Downloads last month: 497

Safetensors

Model size

7B params

Tensor type

F32

BF16

F8_E4M3

Model tree for sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Finetuned

huihui-ai/Huihui-gemma-4-12B-it-abliterated

Quantized

(5)

this model