--- license: apache-2.0 base_model: - huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated base_model_relation: quantized library_name: transformers pipeline_tag: text-generation tags: - safetensors - qwen3 - moe - nvfp4 - 4-bit - quantized - abliterated - dgx-spark - blackwell - gb10 - sm121 - vllm - llm-compressor language: - en - zh --- # Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 [English](#english) | [繁體中文](#繁體中文) --- ## English > [!TIP] > **Re-quantized on 2026-04-13** with corrected ignore list (`mlp.gate` + `embed_tokens` now preserved in BF16), fixing routing quality issues in the previous release. > [!WARNING] > **NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+** > > As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4. > > If **accuracy and inference speed** are your priority, we recommend the INT4 AutoRound version: > 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)** > > INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121. NVFP4 quantization of [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated), optimized for **NVIDIA DGX Spark (GB10 SM121)**. ### Model Details | Item | Value | | -------------- | ---------------------------------------------------------------------------- | | Architecture | MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing | | Base model | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | | Fine-tuned by | [huihui-ai](https://huggingface.co/huihui-ai) (Thinking 2507 + abliteration) | | Quantized by | [YuYu1015](https://huggingface.co/YuYu1015) | | Model size | ~18.1 GB (NVFP4, vs ~60 GB BF16 original) | | Context length | Up to 131,072 tokens | | Thinking mode | Built-in Chain-of-Thought reasoning (enabled by default) | | Tool calling | Supported (`qwen3_coder` parser) | ### Quantization Details | Item | Value | | --------------------------- | ---------------------------------------------------------------------------------------------------------------- | | Method | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1 | | Scheme | NVFP4 (E2M1 + FP8 per-group scaling, group size 16) | | Format | compressed-tensors v0.14.0.1 | | Calibration dataset | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split) | | Calibration samples | 512 | | Calibration sequence length | 2048 | | MoE expert calibration | `moe_calibrate_all_experts=True` (all experts receive calibration data) | | Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) | | Environment | `transformers==4.57.1` + `llm-compressor==0.10.0.1` | ### Layers Preserved in BF16 The following layers are **not quantized** to preserve model quality: | Layer | Reason | | -------------------- | ------------------------------------------------------------- | | `lm_head` | Output head, sensitive to quantization noise | | `re:.*mlp.gate$` | **MoE routing gate** — critical for expert selection accuracy | | `re:.*embed_tokens$` | Input embeddings | ### Serving with vLLM ```bash vllm serve /path/to/model \ --quantization compressed-tensors \ --served-model-name qwen3-30b \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.90 \ --max-model-len 32768 \ --enable-prefix-caching \ --enable-chunked-prefill \ --trust-remote-code ``` ### DGX Spark (SM121) Compatibility Notes - NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing `cvt.e2m1x2` instruction) - Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely - Qwen3 has no GDN, so `linear_attn` does not need to be excluded - Clear page cache before starting on UMA: `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'` ### Safety Warning This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use. ### Credits - **Original Model**: [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) by Alibaba Qwen Team - **Thinking 2507 & Abliteration**: [huihui-ai](https://huggingface.co/huihui-ai) - **NVFP4 Quantization**: [YuYu1015](https://huggingface.co/YuYu1015) on NVIDIA DGX Spark (GB10) - **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM Project - **Reference**: [RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4) --- ## 繁體中文 > [!TIP] > **2026-04-13 重新量化上傳**,修正先前版本的 ignore list(`mlp.gate` 與 `embed_tokens` 現在保留 BF16),解決 MoE 路由品質問題。 > [!WARNING] > **NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+** > > 截至 2026 年 4 月,NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16(BF16 activation),FP4 的理論吞吐量優勢無法發揮。 > > 若**精度與推理速度**為首要考量,建議改用 INT4 AutoRound 版本: > 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)** > > INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑,校準更完整(品質保留約 99.5%),效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後,NVFP4 的真正優勢才能發揮。 [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) 的 NVFP4 量化版本,針對 **NVIDIA DGX Spark (GB10 SM121)** 最佳化。 ### 模型資訊 | 項目 | 數值 | | ------------ | ----------------------------------------------------------------------------- | | 架構 | MoE(30B 總參數, 3B 活躍),48 層,128 experts,top-8 routing | | 基礎模型 | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | | 微調者 | [huihui-ai](https://huggingface.co/huihui-ai)(Thinking 2507 + abliteration) | | 量化者 | [YuYu1015](https://huggingface.co/YuYu1015) | | 模型大小 | ~18.1 GB(NVFP4,原版 BF16 約 60 GB) | | Context 長度 | 最高 131,072 tokens | | 思考模式 | 內建思維鏈推理(預設啟用) | | 工具呼叫 | 支援(`qwen3_coder` parser) | ### 量化詳情 | 項目 | 數值 | | ------------ | --------------------------------------------------------------------------------------------------------------- | | 方法 | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1 | | 方案 | NVFP4(E2M1 + FP8 逐群縮放,群組大小 16) | | 格式 | compressed-tensors v0.14.0.1 | | 校準資料集 | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` 分割) | | 校準樣本數 | 512 | | 校準序列長度 | 2048 | | MoE 專家校準 | `moe_calibrate_all_experts=True`(所有專家都接收校準資料) | | 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) | | 環境 | `transformers==4.57.1` + `llm-compressor==0.10.0.1` | ### 保留 BF16 的層 以下層**未被量化**以保持模型品質: | 層 | 原因 | | -------------------- | -------------------------------------- | | `lm_head` | 輸出頭,對量化雜訊敏感 | | `re:.*mlp.gate$` | **MoE 路由閘**——對專家選擇精度至關重要 | | `re:.*embed_tokens$` | 輸入嵌入 | ### vLLM 部署 ```bash vllm serve /path/to/model \ --quantization compressed-tensors \ --served-model-name qwen3-30b \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.90 \ --max-model-len 32768 \ --enable-prefix-caching \ --enable-chunked-prefill \ --trust-remote-code ``` ### DGX Spark (SM121) 相容性說明 - NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑尚未支援,缺少 `cvt.e2m1x2` 指令) - Qwen3(非 3.5)沒有 Mamba 層,FP8 KV cache 可以安全使用 - Qwen3 沒有 GDN,`linear_attn` 不需要排除 - UMA 架構啟動前請先清除 page cache:`sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'` ### 安全警告 此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。 ### 致謝 - **原始模型**:[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B),Alibaba Qwen 團隊 - **Thinking 2507 與去審查**:[huihui-ai](https://huggingface.co/huihui-ai) - **NVFP4 量化**:[YuYu1015](https://huggingface.co/YuYu1015),於 NVIDIA DGX Spark (GB10) 上完成 - **量化工具**:[llm-compressor](https://github.com/vllm-project/llm-compressor),vLLM Project - **參考**:[RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)