Instructions to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

SGLang

How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with Docker Model Runner:
```
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
```

Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

English | 繁體中文

English

Quantized on 2026-04-21 using llm-compressor with mixed-domain calibration and sensitive-layer protection for maximum accuracy recovery.

Native W4A4 on DGX Spark (SM121) — confirmed working

Unlike earlier NVFP4 models on SM121, this checkpoint runs true W4A4 via FlashInfer CUTLASS NVFP4 MoE kernel (verified in vLLM logs: FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend). Requires:

vLLM 0.19.1rc1.dev374+g1174723eb or later (includes PR #37725 arch-suffix fix)

FlashInfer ≥ 0.6.8 with SM120f compilation (PR #2650)

CUDA ≥ 12.9

MTP speculative decoding supported — MTP layers preserved in BF16 from the original checkpoint via save_mtp_tensors_to_checkpoint.

Abliteration changes the optimal speculative decoding setup — this is a known trade-off, not a defect.

This release's distinguishing feature is mixed-domain calibration (ultrachat_200k chat + Nemotron-Post-Training-Dataset-v2 reasoning, 256 samples total). The calibration recovers quantization accuracy, but it cannot undo the distribution shift introduced upstream by abliteration itself — the DFlash drafter was trained on the original Qwen3.6-35B-A3B weights, and the abliterated residual distribution no longer matches the drafter's prior, so acceptance rate drops.

Measured throughput on DGX Spark:

DFlash (num_speculative_tokens: 15) — ~50 t/s sustained, occasional bursts up to ~100 t/s

MTP (num_speculative_tokens: 1) — ~40 t/s sustained, occasional bursts up to ~70 t/s

Counter-intuitively, MTP with a single speculative token outperforms DFlash on this abliterated variant — MTP reuses the model's own hidden state, so it stays aligned with the abliterated distribution that the mixed-domain calibration was tuned against. Prefer --speculative-config '{"method":"mtp","num_speculative_tokens":1}' as the default; only fall back to DFlash if you specifically need it.

NVFP4 W4A4 quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with the FlashInfer CUTLASS FP4 MoE kernel.

Model Details

Item	Value
Architecture	MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention
Base model	Qwen/Qwen3.6-35B-A3B
Fine-tuned by	huihui-ai (abliteration)
Quantized by	YuYu1015
Model size	~25.1 GB (NVFP4, vs ~71.9 GB BF16 original)
Context length	Up to 262,144 tokens
Thinking mode	Supported (`enable_thinking: true/false`)
Tool calling	Supported (`qwen3_xml` parser)
MTP	Built-in MTP weights included (preserved in BF16)
DFlash	Compatible with z-lab/Qwen3.6-35B-A3B-DFlash

Quantization Details

This model uses a three-strategy stack (ACD) on top of the RedHatAI official flow:

Strategy	Description
A. RedHatAI official baseline	`Qwen3_5MoeForConditionalGeneration` + `save_mtp_tensors_to_checkpoint` (solves OOM on Qwen3.6, preserves MTP)
C. Mixed-domain calibration	`ultrachat_200k` (128 chat) + `Nemotron-Post-Training-Dataset-v2` (128 reasoning) = 256 total
D. Sweet-spot hyperparameters	`num_calibration_samples=256`, `max_seq_length=4096` (quality > quantity)

B (last-layer protection) incompatible with vLLM fused MoE: vLLM's CompressedTensorsMoEMethod requires all projections within a MoE block (gate/up/down × 256 experts + shared_expert) to share the same quantization scheme. Partial ignore triggers ValueError: All MoE projections need to have same quantization scheme but found multiple.

E (SpinQuant R1+R2) incompatible with multi-modal config: llm-compressor's get_head_dim only reads top-level config, not Qwen3.6's nested text_config.

Item	Value
Method	llm-compressor (main) + compressed-tensors (main)
Scheme	NVFP4 W4A4 (E2M1 + FP8 per-group scaling, group size 16)
Format	compressed-tensors
Calibration datasets	HuggingFaceH4/ultrachat_200k (128) + nvidia/Nemotron-Post-Training-Dataset-v2 (128)
Calibration samples (total)	256
Calibration sequence length	4096
MoE calibration	`moe_calibrate_all_experts=True` (via PR #2383)
Hardware	NVIDIA DGX Spark (GB10, 128GB unified memory)
Environment	`transformers>=5.0,<6` + `llm-compressor` main + `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

Layers Preserved in BF16

Layer pattern	Reason
`re:.*lm_head`	Output head, sensitive to quantization noise
`re:.*embed_tokens$`	Input embeddings
`re:visual.` / `re:model.visual.`	Vision encoder
`re:.*mlp.gate$`	MoE router gate (routing decision, must stay BF16)
`re:.*shared_expert_gate$`	Shared expert routing gate
`re:.linear_attn.`	GDN/DeltaNet (Mamba) layers — may output zeros if quantized
`mtp.*` (all MTP weights)	Reattached in BF16 via `save_mtp_tensors_to_checkpoint` after quantization

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (separate drafter, recommended for single-user / low-concurrency):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP (built-in weights, recommended default for this abliterated variant — see warning at top):

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) during high token-rejection rates. The num_speculative_tokens: 1 setting also reduces exposure to this issue.

Serving with vLLM

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Native W4A4 confirmed via FlashInfer CUTLASS NVFP4 MoE backend (no more W4A16 fallback)
Verify in logs: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
FP8 KV cache is not compatible with GDN non-causal attention; use --kv-cache-dtype auto
DFlash requires --attention-backend flash_attn (flashinfer backend + DFlash is incompatible)
--language-model-only skips vision encoder profiling for text-only inference
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Known Limitations

Per-tensor global scales for fused q_proj / k_proj / v_proj may differ, causing a vLLM warning at load time. This is inherent to the llm-compressor per-layer quantization behavior; the impact on accuracy is typically small but measurable on strict tool-calling JSON schemas.
DFlash drafter was trained on the original Qwen3.6-35B-A3B, not the abliterated variant — acceptance rate may be lower than on the original model.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits

Original Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
Abliteration: huihui-ai
NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: llm-compressor by vLLM Project
Official Reference: RedHatAI/Qwen3.6-35B-A3B-NVFP4
Sensitivity Analysis: Diagnosing FP4 Inference (arXiv 2603.08747)

繁體中文

2026-04-21 量化上傳，使用 llm-compressor 搭配混合領域校準與敏感層保護，最大化精度保留。

DGX Spark (SM121) 原生 W4A4 — 已驗證可用

不同於早期 SM121 的 NVFP4 模型，此 checkpoint 透過 FlashInfer CUTLASS NVFP4 MoE kernel 跑真 W4A4（vLLM log 可見 FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend）。需要：

vLLM 0.19.1rc1.dev374+g1174723eb 以上（含 PR #37725 arch-suffix 修復）

FlashInfer ≥ 0.6.8 帶 SM120f 編譯（PR #2650）

CUDA ≥ 12.9

支援 MTP 投機解碼 — MTP 層以 BF16 從原 checkpoint 透過 save_mtp_tensors_to_checkpoint 保留。

Abliteration 會改變投機解碼的最佳設定 — 這是已知取捨，非 bug。

本版本的特色是混合領域校準（ultrachat_200k 對話 + Nemotron-Post-Training-Dataset-v2 推理，共 256 樣本）。校準能恢復量化精度，但無法逆轉 abliteration 在上游造成的分佈偏移 — DFlash drafter 是以原版 Qwen3.6-35B-A3B 權重訓練，abliterated 後的殘差分佈已不再符合 drafter 的先驗，接受率因此下降。

DGX Spark 實測吞吐：

DFlash（num_speculative_tokens: 15） — 約 50 t/s，偶爾飆至 ~100 t/s

MTP（num_speculative_tokens: 1） — 穩定約 40 t/s，偶爾飆至 ~70 t/s

反直覺地，MTP 搭配單一投機 token 在此 abliterated 變體上表現優於 DFlash — MTP 沿用模型自身的 hidden state，與混合領域校準所針對的 abliterated 分佈保持一致。**建議預設使用 --speculative-config '{"method":"mtp","num_speculative_tokens":1}'**，僅在特殊需求時才改用 DFlash。

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 NVFP4 W4A4 量化版，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化，使用 FlashInfer CUTLASS FP4 MoE kernel。

模型資訊

項目	數值
架構	MoE（35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared）+ GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.6-35B-A3B
微調者	huihui-ai（abliteration）
量化者	YuYu1015
模型大小	~25.1 GB（NVFP4，原版 BF16 約 71.9 GB）
Context 長度	最高 262,144 tokens
思考模式	支援（`enable_thinking: true/false`）
工具呼叫	支援（`qwen3_xml` parser）
MTP	內建 MTP 權重（保留 BF16）
DFlash	相容 z-lab/Qwen3.6-35B-A3B-DFlash

量化詳情

此模型在 RedHatAI 官方流程上堆疊三項策略（ACD）：

策略	說明
A. RedHatAI 官方基線	`Qwen3_5MoeForConditionalGeneration` + `save_mtp_tensors_to_checkpoint`（解 Qwen3.6 OOM、保留 MTP）
C. 混合領域校準	`ultrachat_200k`（128 對話）+ `Nemotron-Post-Training-Dataset-v2`（128 推理）共 256
D. 黃金比例參數	`num_calibration_samples=256`、`max_seq_length=4096`（品質 > 數量）

B 策略（最後層保護）與 vLLM fused MoE 不相容：vLLM 的 CompressedTensorsMoEMethod 要求 MoE block 內所有 projection（gate/up/down × 256 experts + shared_expert）必須同 scheme。Partial ignore 會觸發 ValueError: All MoE projections need to have same quantization scheme but found multiple。

E 策略（SpinQuant R1+R2）與 multi-modal config 不相容：llm-compressor 的 get_head_dim 只讀頂層 config，不讀 Qwen3.6 巢狀的 text_config。

項目	數值
方法	llm-compressor（main）+ compressed-tensors（main）
方案	NVFP4 W4A4（E2M1 + FP8 逐群縮放，群組大小 16）
格式	compressed-tensors
校準資料集	HuggingFaceH4/ultrachat_200k（128）+ nvidia/Nemotron-Post-Training-Dataset-v2（128）
校準樣本總數	256
校準序列長度	4096
MoE 校準	`moe_calibrate_all_experts=True`（透過 PR #2383）
量化硬體	NVIDIA DGX Spark（GB10, 128GB 統一記憶體）
環境	`transformers>=5.0,<6` + `llm-compressor` main + `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

保留 BF16 的層

層 pattern	原因
`re:.*lm_head`	輸出頭，對量化雜訊敏感
`re:.*embed_tokens$`	輸入嵌入
`re:visual.` / `re:model.visual.`	視覺編碼器
`re:.*mlp.gate$`	MoE 路由門（routing 決策必須 BF16）
`re:.*shared_expert_gate$`	共享專家路由門
`re:.linear_attn.`	GDN/DeltaNet (Mamba) 層 — 量化後可能輸出零
`mtp.*`（所有 MTP 權重）	量化後透過 `save_mtp_tensors_to_checkpoint` 以 BF16 重新掛回

投機解碼

本模型支援兩種投機解碼方式：

DFlash（獨立 drafter，建議單用戶 / 低併發）：

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP（內建權重，此 abliterated 變體的建議預設 — 詳見頂部警告）：

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

注意：混合 GDN 架構下 MTP 可能觸發 state-rollback bug（vLLM #39273），高 rejection rate 時輸出可能退化。num_speculative_tokens: 1 也能降低觸發機率。

vLLM 部署

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

原生 W4A4 已確認 透過 FlashInfer CUTLASS NVFP4 MoE backend（不再退回 W4A16）
log 驗證：Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
DFlash 需搭配 --attention-backend flash_attn（flashinfer backend + DFlash 不相容）
--language-model-only 跳過視覺編碼器 profiling，加速純文字推理啟動
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

已知限制

Fused q_proj / k_proj / v_proj 的 per-tensor global scale 可能不一致，vLLM 載入時會印警告。這是 llm-compressor per-layer 量化的固有行為，一般精度影響輕微，但在嚴格 tool-calling JSON schema 下可能可測得。
DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練，非 abliterated 變體 — 接受率可能較原版低。