Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

English | 繁體中文


English

Quantized on 2026-04-21 using llm-compressor with mixed-domain calibration and sensitive-layer protection for maximum accuracy recovery.

Native W4A4 on DGX Spark (SM121) — confirmed working

Unlike earlier NVFP4 models on SM121, this checkpoint runs true W4A4 via FlashInfer CUTLASS NVFP4 MoE kernel (verified in vLLM logs: FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend). Requires:

  • vLLM 0.19.1rc1.dev374+g1174723eb or later (includes PR #37725 arch-suffix fix)
  • FlashInfer ≥ 0.6.8 with SM120f compilation (PR #2650)
  • CUDA ≥ 12.9

MTP speculative decoding supported — MTP layers preserved in BF16 from the original checkpoint via save_mtp_tensors_to_checkpoint.

Abliteration changes the optimal speculative decoding setup — this is a known trade-off, not a defect.

This release's distinguishing feature is mixed-domain calibration (ultrachat_200k chat + Nemotron-Post-Training-Dataset-v2 reasoning, 256 samples total). The calibration recovers quantization accuracy, but it cannot undo the distribution shift introduced upstream by abliteration itself — the DFlash drafter was trained on the original Qwen3.6-35B-A3B weights, and the abliterated residual distribution no longer matches the drafter's prior, so acceptance rate drops.

Measured throughput on DGX Spark:

  • DFlash (num_speculative_tokens: 15) — ~50 t/s sustained, occasional bursts up to ~100 t/s
  • MTP (num_speculative_tokens: 1) — ~40 t/s sustained, occasional bursts up to ~70 t/s

Counter-intuitively, MTP with a single speculative token outperforms DFlash on this abliterated variant — MTP reuses the model's own hidden state, so it stays aligned with the abliterated distribution that the mixed-domain calibration was tuned against. Prefer --speculative-config '{"method":"mtp","num_speculative_tokens":1}' as the default; only fall back to DFlash if you specifically need it.

NVFP4 W4A4 quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with the FlashInfer CUTLASS FP4 MoE kernel.

Model Details

Item Value
Architecture MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention
Base model Qwen/Qwen3.6-35B-A3B
Fine-tuned by huihui-ai (abliteration)
Quantized by YuYu1015
Model size ~25.1 GB (NVFP4, vs ~71.9 GB BF16 original)
Context length Up to 262,144 tokens
Thinking mode Supported (enable_thinking: true/false)
Tool calling Supported (qwen3_xml parser)
MTP Built-in MTP weights included (preserved in BF16)
DFlash Compatible with z-lab/Qwen3.6-35B-A3B-DFlash

Quantization Details

This model uses a three-strategy stack (ACD) on top of the RedHatAI official flow:

Strategy Description
A. RedHatAI official baseline Qwen3_5MoeForConditionalGeneration + save_mtp_tensors_to_checkpoint (solves OOM on Qwen3.6, preserves MTP)
C. Mixed-domain calibration ultrachat_200k (128 chat) + Nemotron-Post-Training-Dataset-v2 (128 reasoning) = 256 total
D. Sweet-spot hyperparameters num_calibration_samples=256, max_seq_length=4096 (quality > quantity)

B (last-layer protection) incompatible with vLLM fused MoE: vLLM's CompressedTensorsMoEMethod requires all projections within a MoE block (gate/up/down × 256 experts + shared_expert) to share the same quantization scheme. Partial ignore triggers ValueError: All MoE projections need to have same quantization scheme but found multiple.

E (SpinQuant R1+R2) incompatible with multi-modal config: llm-compressor's get_head_dim only reads top-level config, not Qwen3.6's nested text_config.

Item Value
Method llm-compressor (main) + compressed-tensors (main)
Scheme NVFP4 W4A4 (E2M1 + FP8 per-group scaling, group size 16)
Format compressed-tensors
Calibration datasets HuggingFaceH4/ultrachat_200k (128) + nvidia/Nemotron-Post-Training-Dataset-v2 (128)
Calibration samples (total) 256
Calibration sequence length 4096
MoE calibration moe_calibrate_all_experts=True (via PR #2383)
Hardware NVIDIA DGX Spark (GB10, 128GB unified memory)
Environment transformers>=5.0,<6 + llm-compressor main + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Layers Preserved in BF16

Layer pattern Reason
re:.*lm_head Output head, sensitive to quantization noise
re:.*embed_tokens$ Input embeddings
re:visual.* / re:model.visual.* Vision encoder
re:.*mlp.gate$ MoE router gate (routing decision, must stay BF16)
re:.*shared_expert_gate$ Shared expert routing gate
re:.*linear_attn.* GDN/DeltaNet (Mamba) layers — may output zeros if quantized
mtp.* (all MTP weights) Reattached in BF16 via save_mtp_tensors_to_checkpoint after quantization

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (separate drafter, recommended for single-user / low-concurrency):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP (built-in weights, recommended default for this abliterated variant — see warning at top):

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) during high token-rejection rates. The num_speculative_tokens: 1 setting also reduces exposure to this issue.

Serving with vLLM

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

  • Native W4A4 confirmed via FlashInfer CUTLASS NVFP4 MoE backend (no more W4A16 fallback)
  • Verify in logs: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
  • FP8 KV cache is not compatible with GDN non-causal attention; use --kv-cache-dtype auto
  • DFlash requires --attention-backend flash_attn (flashinfer backend + DFlash is incompatible)
  • --language-model-only skips vision encoder profiling for text-only inference
  • Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Known Limitations

  • Per-tensor global scales for fused q_proj / k_proj / v_proj may differ, causing a vLLM warning at load time. This is inherent to the llm-compressor per-layer quantization behavior; the impact on accuracy is typically small but measurable on strict tool-calling JSON schemas.
  • DFlash drafter was trained on the original Qwen3.6-35B-A3B, not the abliterated variant — acceptance rate may be lower than on the original model.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits


繁體中文

2026-04-21 量化上傳,使用 llm-compressor 搭配混合領域校準與敏感層保護,最大化精度保留。

DGX Spark (SM121) 原生 W4A4 — 已驗證可用

不同於早期 SM121 的 NVFP4 模型,此 checkpoint 透過 FlashInfer CUTLASS NVFP4 MoE kernel 跑真 W4A4(vLLM log 可見 FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend)。需要:

  • vLLM 0.19.1rc1.dev374+g1174723eb 以上(含 PR #37725 arch-suffix 修復)
  • FlashInfer ≥ 0.6.8 帶 SM120f 編譯(PR #2650
  • CUDA ≥ 12.9

支援 MTP 投機解碼 — MTP 層以 BF16 從原 checkpoint 透過 save_mtp_tensors_to_checkpoint 保留。

Abliteration 會改變投機解碼的最佳設定 — 這是已知取捨,非 bug。

本版本的特色是混合領域校準ultrachat_200k 對話 + Nemotron-Post-Training-Dataset-v2 推理,共 256 樣本)。校準能恢復量化精度,但無法逆轉 abliteration 在上游造成的分佈偏移 — DFlash drafter 是以原版 Qwen3.6-35B-A3B 權重訓練,abliterated 後的殘差分佈已不再符合 drafter 的先驗,接受率因此下降。

DGX Spark 實測吞吐:

  • DFlash(num_speculative_tokens: 15 — 約 50 t/s,偶爾飆至 ~100 t/s
  • MTP(num_speculative_tokens: 1 — 穩定約 40 t/s,偶爾飆至 ~70 t/s

反直覺地,MTP 搭配單一投機 token 在此 abliterated 變體上表現優於 DFlash — MTP 沿用模型自身的 hidden state,與混合領域校準所針對的 abliterated 分佈保持一致。**建議預設使用 --speculative-config '{"method":"mtp","num_speculative_tokens":1}'**,僅在特殊需求時才改用 DFlash。

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 NVFP4 W4A4 量化版,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化,使用 FlashInfer CUTLASS FP4 MoE kernel。

模型資訊

項目 數值
架構 MoE(35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared)+ GDN (Mamba) + Attention 混合
基礎模型 Qwen/Qwen3.6-35B-A3B
微調者 huihui-ai(abliteration)
量化者 YuYu1015
模型大小 ~25.1 GB(NVFP4,原版 BF16 約 71.9 GB)
Context 長度 最高 262,144 tokens
思考模式 支援(enable_thinking: true/false
工具呼叫 支援(qwen3_xml parser)
MTP 內建 MTP 權重(保留 BF16)
DFlash 相容 z-lab/Qwen3.6-35B-A3B-DFlash

量化詳情

此模型在 RedHatAI 官方流程上堆疊三項策略(ACD)

策略 說明
A. RedHatAI 官方基線 Qwen3_5MoeForConditionalGeneration + save_mtp_tensors_to_checkpoint(解 Qwen3.6 OOM、保留 MTP)
C. 混合領域校準 ultrachat_200k(128 對話)+ Nemotron-Post-Training-Dataset-v2(128 推理)共 256
D. 黃金比例參數 num_calibration_samples=256max_seq_length=4096(品質 > 數量)

B 策略(最後層保護)與 vLLM fused MoE 不相容:vLLM 的 CompressedTensorsMoEMethod 要求 MoE block 內所有 projection(gate/up/down × 256 experts + shared_expert)必須同 scheme。Partial ignore 會觸發 ValueError: All MoE projections need to have same quantization scheme but found multiple

E 策略(SpinQuant R1+R2)與 multi-modal config 不相容:llm-compressor 的 get_head_dim 只讀頂層 config,不讀 Qwen3.6 巢狀的 text_config

項目 數值
方法 llm-compressor(main)+ compressed-tensors(main)
方案 NVFP4 W4A4(E2M1 + FP8 逐群縮放,群組大小 16)
格式 compressed-tensors
校準資料集 HuggingFaceH4/ultrachat_200k(128)+ nvidia/Nemotron-Post-Training-Dataset-v2(128)
校準樣本總數 256
校準序列長度 4096
MoE 校準 moe_calibrate_all_experts=True(透過 PR #2383
量化硬體 NVIDIA DGX Spark(GB10, 128GB 統一記憶體)
環境 transformers>=5.0,<6 + llm-compressor main + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

保留 BF16 的層

層 pattern 原因
re:.*lm_head 輸出頭,對量化雜訊敏感
re:.*embed_tokens$ 輸入嵌入
re:visual.* / re:model.visual.* 視覺編碼器
re:.*mlp.gate$ MoE 路由門(routing 決策必須 BF16)
re:.*shared_expert_gate$ 共享專家路由門
re:.*linear_attn.* GDN/DeltaNet (Mamba) 層 — 量化後可能輸出零
mtp.*(所有 MTP 權重) 量化後透過 save_mtp_tensors_to_checkpoint 以 BF16 重新掛回

投機解碼

本模型支援兩種投機解碼方式:

DFlash(獨立 drafter,建議單用戶 / 低併發):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP(內建權重,此 abliterated 變體的建議預設 — 詳見頂部警告):

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

注意:混合 GDN 架構下 MTP 可能觸發 state-rollback bug(vLLM #39273),高 rejection rate 時輸出可能退化。num_speculative_tokens: 1 也能降低觸發機率。

vLLM 部署

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

  • 原生 W4A4 已確認 透過 FlashInfer CUTLASS NVFP4 MoE backend(不再退回 W4A16)
  • log 驗證:Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
  • FP8 KV cache 與 GDN non-causal attention 不相容,請使用 --kv-cache-dtype auto
  • DFlash 需搭配 --attention-backend flash_attn(flashinfer backend + DFlash 不相容)
  • --language-model-only 跳過視覺編碼器 profiling,加速純文字推理啟動
  • UMA 架構啟動前請先清除 page cache:sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

已知限制

  • Fused q_proj / k_proj / v_proj 的 per-tensor global scale 可能不一致,vLLM 載入時會印警告。這是 llm-compressor per-layer 量化的固有行為,一般精度影響輕微,但在嚴格 tool-calling JSON schema 下可能可測得。
  • DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練,非 abliterated 變體 — 接受率可能較原版低。

安全警告

此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

Downloads last month
571
Safetensors
Model size
21B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

Quantized
(17)
this model
Quantizations
1 model

Collection including YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

Paper for YuYu1015/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4