Instructions to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16") model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16
- SGLang
How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 with Docker Model Runner:
docker model run hf.co/sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Huihui-gemma-4-12B-it-abliterated-NVFP4A16
NVFP4 (W4A16) quantization of huihui-ai/Huihui-gemma-4-12B-it-abliterated — the abliterated (uncensored) Gemma 4 12B unified model (text + vision + audio).
24 GB → 7.7 GB. Runs on a single 16 GB Blackwell GPU, or shards across several for higher throughput. Up to 118 tok/s single-stream (TP=4 + MTP speculative decode) and ~1117 tok/s aggregate.
| Base | huihui-ai/Huihui-gemma-4-12B-it-abliterated (abliterated google/gemma-4-12B-it) |
| Architecture | Gemma4UnifiedForConditionalGeneration — 12B dense, 48 layers, 131K ctx |
| Quantization | NVFP4A16 — weights FP4 (group 16, FP8 scales), activations BF16 |
| Format | compressed-tensors / nvfp4-pack-quantized (native vLLM) |
| Tool | llm-compressor |
| Size | 7.7 GB · Requires NVIDIA Blackwell (SM120) |
Weight-only FP4 (W4A16) keeps activations at BF16, so it is robust where full W4A4 NVFP4 collapses on this architecture.
Quickstart
Requires a Blackwell GPU (SM120 / RTX 50-series / GB10 / B100/B200), Docker with the NVIDIA runtime, and the hf CLI. Gemma 4 unified is brand new — you need vLLM nightly (released ≤ 0.22.1 lack the Gemma4Unified class).
# 1) Download this model (7.7 GB). For spec-decode, also grab the 0.4B MTP draft.
hf download sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16 --local-dir ./model
hf download google/gemma-4-12B-it-assistant --local-dir ./draft # optional, for spec-decode
# 2a) Simplest — single GPU, no speculative decode
docker run --rm --gpus '"device=0"' --ipc=host --shm-size 16gb -p 8000:8000 \
-v $PWD/model:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-12b --max-model-len 65536 \
--gpu-memory-utilization 0.92 --trust-remote-code
Multi-GPU — read this if your box has no NVLink
On consumer/entry Blackwell (e.g. RTX PRO 2000) over plain PCIe there is no working GPU P2P, and vLLM tensor-parallel hangs unless you disable both NCCL P2P and vLLM's custom all-reduce:
docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
-e NCCL_P2P_DISABLE=1 \ # <-- without this, hangs at NCCL init
-v $PWD/model:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-12b \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \ # <-- without this, the forward deadlocks
--max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code
Maximum interactive speed — TP=4 + MTP speculative decode
Google ships a 0.4B MTP draft (google/gemma-4-12B-it-assistant). It nearly doubles single-stream throughput (lossless — the target verifies every token). Use num_speculative_tokens: 3 (the stable optimum; k≥5 collapses acceptance) and --kv-cache-dtype fp8 (NVFP4 KV would break the draft):
docker run --rm --gpus '"device=0,1,2,3"' --ipc=host --shm-size 16gb -p 8000:8000 \
-e NCCL_P2P_DISABLE=1 \
-v $PWD/model:/model:ro -v $PWD/draft:/draft:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-12b \
--tensor-parallel-size 4 --disable-custom-all-reduce \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":3}' \
--max-model-len 65536 --gpu-memory-utilization 0.85 --trust-remote-code
Test it:
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d \
'{"model":"gemma4-12b","messages":[{"role":"user","content":"Explain the CAP theorem in one sentence."}]}'
Flag cheat-sheet
| Flag / env | When | Why |
|---|---|---|
vllm/vllm-openai:nightly |
always | only nightly registers Gemma4UnifiedForConditionalGeneration |
--trust-remote-code |
always | new arch |
NCCL_P2P_DISABLE=1 (env) |
TP > 1 on no-NVLink | else hangs at NCCL init |
--disable-custom-all-reduce |
TP > 1 on no-NVLink | else the forward deadlocks |
--ipc=host --shm-size 16gb |
TP > 1 (docker) | host-path NCCL needs shared memory |
--speculative-config '{"method":"mtp",…,"num_speculative_tokens":3}' |
interactive | ~1.6–1.7× single-stream |
--kv-cache-dtype fp8 |
with spec-decode | nvfp4 KV collapses draft acceptance |
--max-num-seqs 4 (+ --gpu-memory-utilization 0.95) |
single GPU, long ctx | frees KV room for up to -c 32768 on 16 GB |
Benchmarks
Measured on 4× RTX PRO 2000 Blackwell (16 GB, SM120, 288 GB/s, PCIe — no NVLink), TP=4, -c 65536.
Single-stream decode (interactive) — TP sweep, 1 request × 512 tok:
| TP | GPUs | no spec | + MTP (k=3) | MTP gain |
|---|---|---|---|---|
| 1 | 1 | 30.5 | 55.0 | 1.80× |
| 2 | 2 | 53.2 | 94.8 | 1.78× |
| 4 | 4 | 73.3 | 118.5 | 1.62× |
(TP=4 + MTP peaks at 121.0 with k=4, but k=3 is the stable optimum.) MTP gives a steady ~1.6–1.8× at every TP. TP scaling is sub-linear on this no-NVLink box (host-memory all-reduce). Pick by what you have:
| goal | config | single-stream | GPUs freed |
|---|---|---|---|
| low-power, 1-GPU resident | TP=1 + MTP | 55 | 5 |
| balanced | TP=2 + MTP | 95 | 4 |
| fastest interactive | TP=4 + MTP | 118 | 2 |
Aggregate throughput (concurrency sweep, no spec-decode):
| concurrency | 1 | 2 | 4 | 8 | 16 | 32 |
|---|---|---|---|---|---|---|
| tok/s (-c65536) | 73 | 145 | 274 | 487 | 796 | 1117 |
| tok/s (-c131072) | 74 | 145 | 275 | 498 | 792 | 1100 |
64K and 128K context decode identically (sliding-window KV). Rule: MTP spec-decode for low concurrency (≤8); turn it off for high-concurrency batch serving (it costs throughput once the batch saturates).
Quality — measured vs BF16 base and an FP8 build (same huihui base)
Greedy side-by-side on EN / 繁體中文 / 日本語 / code / facts / reasoning traps:
- Standard tasks: identical. Facts (Chernobyl: April 1986, reactor 4), Traditional-Chinese & Japanese explanations,
17×23−100 = 291,60 km / 45 min = 80 km/h, code — NVFP4 = FP8 = BF16 base, no collapse, no drift. - Hard reasoning traps (7 tested): a small, real W4A16 tax. FP8 matched the BF16 base on every trap the base got right; NVFP4 slipped on ~1 of 7 (it answered a Barbara-type syllogism "Yes" where No is correct, plus one minor secondary-detail slip). One age-word-problem even the BF16 base fails — a model limit, not a quant artifact.
Verdict: half the size and faster than FP8, at standard-task parity. Choose FP8 for maximum reasoning fidelity; choose this NVFP4A16 for the best size/speed at ~85–90% reasoning parity — the right default for most local-agent and chat workloads.
Notes
- Abliterated (uncensored). Use responsibly.
- NVFP4 is Blackwell-specific; it will not run on Ampere/Hopper.
- Multimodal vision/audio embedders kept in BF16.
Credits
- Base model & abliteration: huihui-ai
- Original model: Google DeepMind (Gemma 4)
- Quantization & serving recipe: Lna-Lab · Tooling: llm-compressor / vLLM
Support the Base Model Author (huihui-ai)
If you find the abliterated base useful, please support huihui-ai:
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
- Downloads last month
- 497
Model tree for sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16
Base model
google/gemma-4-12B
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-gemma-4-12B-it-abliterated-NVFP4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)