Instructions to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
- SGLang
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
English
Quantized on 2026-04-21 using llm-compressor with mixed-domain calibration and sensitive-layer protection for maximum accuracy recovery.
Native W4A4 on DGX Spark (SM121) — confirmed working
Unlike earlier NVFP4 models on SM121, this checkpoint runs true W4A4 via FlashInfer CUTLASS NVFP4 MoE kernel (verified in vLLM logs:
FlashInferCutlassNvFp4LinearKernel+FLASHINFER_CUTLASS NvFp4 MoE backend). Requires:
- vLLM
0.19.1rc1.dev374+g1174723ebor later (includes PR #37725 arch-suffix fix)- FlashInfer ≥ 0.6.8 with SM120f compilation (PR #2650)
- CUDA ≥ 12.9
MTP speculative decoding supported — MTP layers preserved in BF16 from the original checkpoint via
save_mtp_tensors_to_checkpoint.
Abliteration changes the optimal speculative decoding setup — this is a known trade-off, not a defect.
This release's distinguishing feature is mixed-domain calibration (
ultrachat_200kchat +Nemotron-Post-Training-Dataset-v2reasoning, 256 samples total). The calibration recovers quantization accuracy, but it cannot undo the distribution shift introduced upstream by abliteration itself — the DFlash drafter was trained on the original Qwen3.6-35B-A3B weights, and the abliterated residual distribution no longer matches the drafter's prior, so acceptance rate drops.Measured throughput on DGX Spark:
- DFlash (
num_speculative_tokens: 15) — ~50 t/s sustained, occasional bursts up to ~100 t/s- MTP (
num_speculative_tokens: 1) — ~40 t/s sustained, occasional bursts up to ~70 t/sCounter-intuitively, MTP with a single speculative token outperforms DFlash on this abliterated variant — MTP reuses the model's own hidden state, so it stays aligned with the abliterated distribution that the mixed-domain calibration was tuned against. Prefer
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'as the default; only fall back to DFlash if you specifically need it.
NVFP4 W4A4 quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with the FlashInfer CUTLASS FP4 MoE kernel.
Model Details
| Item | Value |
|---|---|
| Architecture | MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention |
| Base model | Qwen/Qwen3.6-35B-A3B |
| Fine-tuned by | huihui-ai (abliteration) |
| Quantized by | YuYu1015 |
| Model size | ~25.1 GB (NVFP4, vs ~71.9 GB BF16 original) |
| Context length | Up to 262,144 tokens |
| Thinking mode | Supported (enable_thinking: true/false) |
| Tool calling | Supported (qwen3_xml parser) |
| MTP | Built-in MTP weights included (preserved in BF16) |
| DFlash | Compatible with z-lab/Qwen3.6-35B-A3B-DFlash |
Quantization Details
This model uses a three-strategy stack (ACD) on top of the RedHatAI official flow:
| Strategy | Description |
|---|---|
| A. RedHatAI official baseline | Qwen3_5MoeForConditionalGeneration + save_mtp_tensors_to_checkpoint (solves OOM on Qwen3.6, preserves MTP) |
| C. Mixed-domain calibration | ultrachat_200k (128 chat) + Nemotron-Post-Training-Dataset-v2 (128 reasoning) = 256 total |
| D. Sweet-spot hyperparameters | num_calibration_samples=256, max_seq_length=4096 (quality > quantity) |
B (last-layer protection) incompatible with vLLM fused MoE: vLLM's
CompressedTensorsMoEMethodrequires all projections within a MoE block (gate/up/down × 256 experts + shared_expert) to share the same quantization scheme. Partial ignore triggersValueError: All MoE projections need to have same quantization scheme but found multiple.E (SpinQuant R1+R2) incompatible with multi-modal config: llm-compressor's
get_head_dimonly reads top-level config, not Qwen3.6's nestedtext_config.
| Item | Value |
|---|---|
| Method | llm-compressor (main) + compressed-tensors (main) |
| Scheme | NVFP4 W4A4 (E2M1 + FP8 per-group scaling, group size 16) |
| Format | compressed-tensors |
| Calibration datasets | HuggingFaceH4/ultrachat_200k (128) + nvidia/Nemotron-Post-Training-Dataset-v2 (128) |
| Calibration samples (total) | 256 |
| Calibration sequence length | 4096 |
| MoE calibration | moe_calibrate_all_experts=True (via PR #2383) |
| Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) |
| Environment | transformers>=5.0,<6 + llm-compressor main + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
Layers Preserved in BF16
| Layer pattern | Reason |
|---|---|
re:.*lm_head |
Output head, sensitive to quantization noise |
re:.*embed_tokens$ |
Input embeddings |
re:visual.* / re:model.visual.* |
Vision encoder |
re:.*mlp.gate$ |
MoE router gate (routing decision, must stay BF16) |
re:.*shared_expert_gate$ |
Shared expert routing gate |
re:.*linear_attn.* |
GDN/DeltaNet (Mamba) layers — may output zeros if quantized |
mtp.* (all MTP weights) |
Reattached in BF16 via save_mtp_tensors_to_checkpoint after quantization |
Speculative Decoding
This model supports two speculative decoding methods:
DFlash (separate drafter, recommended for single-user / low-concurrency):
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'
MTP (built-in weights, recommended default for this abliterated variant — see warning at top):
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) during high token-rejection rates. The
num_speculative_tokens: 1setting also reduces exposure to this issue.
Serving with vLLM
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3.6-35b-a3b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--attention-backend flash_attn \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--enable-prefix-caching \
--enable-chunked-prefill \
--performance-mode throughput \
--speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) Compatibility Notes
- Native W4A4 confirmed via FlashInfer CUTLASS NVFP4 MoE backend (no more W4A16 fallback)
- Verify in logs:
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM+Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend - FP8 KV cache is not compatible with GDN non-causal attention; use
--kv-cache-dtype auto - DFlash requires
--attention-backend flash_attn(flashinfer backend + DFlash is incompatible) --language-model-onlyskips vision encoder profiling for text-only inference- Clear page cache before starting on UMA:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Known Limitations
- Per-tensor global scales for fused
q_proj/k_proj/v_projmay differ, causing a vLLM warning at load time. This is inherent to the llm-compressor per-layer quantization behavior; the impact on accuracy is typically small but measurable on strict tool-calling JSON schemas. - DFlash drafter was trained on the original Qwen3.6-35B-A3B, not the abliterated variant — acceptance rate may be lower than on the original model.
Safety Warning
This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.
Credits
- Original Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
- Abliteration: huihui-ai
- NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
- Quantization Tool: llm-compressor by vLLM Project
- Official Reference: RedHatAI/Qwen3.6-35B-A3B-NVFP4
- Sensitivity Analysis: Diagnosing FP4 Inference (arXiv 2603.08747)
繁體中文
2026-04-21 量化上傳,使用 llm-compressor 搭配混合領域校準與敏感層保護,最大化精度保留。
DGX Spark (SM121) 原生 W4A4 — 已驗證可用
不同於早期 SM121 的 NVFP4 模型,此 checkpoint 透過 FlashInfer CUTLASS NVFP4 MoE kernel 跑真 W4A4(vLLM log 可見
FlashInferCutlassNvFp4LinearKernel+FLASHINFER_CUTLASS NvFp4 MoE backend)。需要:
- vLLM
0.19.1rc1.dev374+g1174723eb以上(含 PR #37725 arch-suffix 修復)- FlashInfer ≥ 0.6.8 帶 SM120f 編譯(PR #2650)
- CUDA ≥ 12.9
支援 MTP 投機解碼 — MTP 層以 BF16 從原 checkpoint 透過
save_mtp_tensors_to_checkpoint保留。
Abliteration 會改變投機解碼的最佳設定 — 這是已知取捨,非 bug。
本版本的特色是混合領域校準(
ultrachat_200k對話 +Nemotron-Post-Training-Dataset-v2推理,共 256 樣本)。校準能恢復量化精度,但無法逆轉 abliteration 在上游造成的分佈偏移 — DFlash drafter 是以原版 Qwen3.6-35B-A3B 權重訓練,abliterated 後的殘差分佈已不再符合 drafter 的先驗,接受率因此下降。DGX Spark 實測吞吐:
- DFlash(
num_speculative_tokens: 15) — 約 50 t/s,偶爾飆至 ~100 t/s- MTP(
num_speculative_tokens: 1) — 穩定約 40 t/s,偶爾飆至 ~70 t/s反直覺地,MTP 搭配單一投機 token 在此 abliterated 變體上表現優於 DFlash — MTP 沿用模型自身的 hidden state,與混合領域校準所針對的 abliterated 分佈保持一致。**建議預設使用
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'**,僅在特殊需求時才改用 DFlash。
huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 NVFP4 W4A4 量化版,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化,使用 FlashInfer CUTLASS FP4 MoE kernel。
模型資訊
| 項目 | 數值 |
|---|---|
| 架構 | MoE(35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared)+ GDN (Mamba) + Attention 混合 |
| 基礎模型 | Qwen/Qwen3.6-35B-A3B |
| 微調者 | huihui-ai(abliteration) |
| 量化者 | YuYu1015 |
| 模型大小 | ~25.1 GB(NVFP4,原版 BF16 約 71.9 GB) |
| Context 長度 | 最高 262,144 tokens |
| 思考模式 | 支援(enable_thinking: true/false) |
| 工具呼叫 | 支援(qwen3_xml parser) |
| MTP | 內建 MTP 權重(保留 BF16) |
| DFlash | 相容 z-lab/Qwen3.6-35B-A3B-DFlash |
量化詳情
此模型在 RedHatAI 官方流程上堆疊三項策略(ACD):
| 策略 | 說明 |
|---|---|
| A. RedHatAI 官方基線 | Qwen3_5MoeForConditionalGeneration + save_mtp_tensors_to_checkpoint(解 Qwen3.6 OOM、保留 MTP) |
| C. 混合領域校準 | ultrachat_200k(128 對話)+ Nemotron-Post-Training-Dataset-v2(128 推理)共 256 |
| D. 黃金比例參數 | num_calibration_samples=256、max_seq_length=4096(品質 > 數量) |
B 策略(最後層保護)與 vLLM fused MoE 不相容:vLLM 的
CompressedTensorsMoEMethod要求 MoE block 內所有 projection(gate/up/down × 256 experts + shared_expert)必須同 scheme。Partial ignore 會觸發ValueError: All MoE projections need to have same quantization scheme but found multiple。E 策略(SpinQuant R1+R2)與 multi-modal config 不相容:llm-compressor 的
get_head_dim只讀頂層 config,不讀 Qwen3.6 巢狀的text_config。
| 項目 | 數值 |
|---|---|
| 方法 | llm-compressor(main)+ compressed-tensors(main) |
| 方案 | NVFP4 W4A4(E2M1 + FP8 逐群縮放,群組大小 16) |
| 格式 | compressed-tensors |
| 校準資料集 | HuggingFaceH4/ultrachat_200k(128)+ nvidia/Nemotron-Post-Training-Dataset-v2(128) |
| 校準樣本總數 | 256 |
| 校準序列長度 | 4096 |
| MoE 校準 | moe_calibrate_all_experts=True(透過 PR #2383) |
| 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) |
| 環境 | transformers>=5.0,<6 + llm-compressor main + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
保留 BF16 的層
| 層 pattern | 原因 |
|---|---|
re:.*lm_head |
輸出頭,對量化雜訊敏感 |
re:.*embed_tokens$ |
輸入嵌入 |
re:visual.* / re:model.visual.* |
視覺編碼器 |
re:.*mlp.gate$ |
MoE 路由門(routing 決策必須 BF16) |
re:.*shared_expert_gate$ |
共享專家路由門 |
re:.*linear_attn.* |
GDN/DeltaNet (Mamba) 層 — 量化後可能輸出零 |
mtp.*(所有 MTP 權重) |
量化後透過 save_mtp_tensors_to_checkpoint 以 BF16 重新掛回 |
投機解碼
本模型支援兩種投機解碼方式:
DFlash(獨立 drafter,建議單用戶 / 低併發):
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'
MTP(內建權重,此 abliterated 變體的建議預設 — 詳見頂部警告):
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
注意:混合 GDN 架構下 MTP 可能觸發 state-rollback bug(vLLM #39273),高 rejection rate 時輸出可能退化。
num_speculative_tokens: 1也能降低觸發機率。
vLLM 部署
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3.6-35b-a3b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--attention-backend flash_attn \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--enable-prefix-caching \
--enable-chunked-prefill \
--performance-mode throughput \
--speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) 相容性說明
- 原生 W4A4 已確認 透過 FlashInfer CUTLASS NVFP4 MoE backend(不再退回 W4A16)
- log 驗證:
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM+Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend - FP8 KV cache 與 GDN non-causal attention 不相容,請使用
--kv-cache-dtype auto - DFlash 需搭配
--attention-backend flash_attn(flashinfer backend + DFlash 不相容) --language-model-only跳過視覺編碼器 profiling,加速純文字推理啟動- UMA 架構啟動前請先清除 page cache:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
已知限制
- Fused
q_proj/k_proj/v_proj的 per-tensor global scale 可能不一致,vLLM 載入時會印警告。這是 llm-compressor per-layer 量化的固有行為,一般精度影響輕微,但在嚴格 tool-calling JSON schema 下可能可測得。 - DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練,非 abliterated 變體 — 接受率可能較原版低。
安全警告
此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。
致謝
- 原始模型:Qwen/Qwen3.6-35B-A3B,Alibaba Qwen 團隊
- 去審查:huihui-ai
- NVFP4 量化:YuYu1015,於 NVIDIA DGX Spark (GB10) 上完成
- 量化工具:llm-compressor,vLLM Project
- 官方參考:RedHatAI/Qwen3.6-35B-A3B-NVFP4
- 敏感度分析:Diagnosing FP4 Inference (arXiv 2603.08747)
- Downloads last month
- 571
Model tree for YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B