---
license: apache-2.0
base_model:
  - huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated
base_model_relation: quantized
library_name: transformers
pipeline_tag: text-generation
tags:
  - safetensors
  - qwen3
  - moe
  - nvfp4
  - 4-bit
  - quantized
  - abliterated
  - dgx-spark
  - blackwell
  - gb10
  - sm121
  - vllm
  - llm-compressor
language:
  - en
  - zh
---

# Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

[English](#english) | [繁體中文](#繁體中文)

---

## English

> [!TIP]
> **Re-quantized on 2026-04-13** with corrected ignore list (`mlp.gate` + `embed_tokens` now preserved in BF16), fixing routing quality issues in the previous release.

> [!WARNING]
> **NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+**
>
> As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.
>
> If **accuracy and inference speed** are your priority, we recommend the INT4 AutoRound version:
> 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)**
>
> INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.

NVFP4 quantization of [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated), optimized for **NVIDIA DGX Spark (GB10 SM121)**.

### Model Details

| Item           | Value                                                                        |
| -------------- | ---------------------------------------------------------------------------- |
| Architecture   | MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing            |
| Base model     | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)              |
| Fine-tuned by  | [huihui-ai](https://huggingface.co/huihui-ai) (Thinking 2507 + abliteration) |
| Quantized by   | [YuYu1015](https://huggingface.co/YuYu1015)                                  |
| Model size     | ~18.1 GB (NVFP4, vs ~60 GB BF16 original)                                    |
| Context length | Up to 131,072 tokens                                                         |
| Thinking mode  | Built-in Chain-of-Thought reasoning (enabled by default)                     |
| Tool calling   | Supported (`qwen3_coder` parser)                                             |

### Quantization Details

| Item                        | Value                                                                                                            |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Method                      | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1                                       |
| Scheme                      | NVFP4 (E2M1 + FP8 per-group scaling, group size 16)                                                              |
| Format                      | compressed-tensors v0.14.0.1                                                                                     |
| Calibration dataset         | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split) |
| Calibration samples         | 512                                                                                                              |
| Calibration sequence length | 2048                                                                                                             |
| MoE expert calibration      | `moe_calibrate_all_experts=True` (all experts receive calibration data)                                          |
| Hardware                    | NVIDIA DGX Spark (GB10, 128GB unified memory)                                                                    |
| Environment                 | `transformers==4.57.1` + `llm-compressor==0.10.0.1`                                                              |

### Layers Preserved in BF16

The following layers are **not quantized** to preserve model quality:

| Layer                | Reason                                                        |
| -------------------- | ------------------------------------------------------------- |
| `lm_head`            | Output head, sensitive to quantization noise                  |
| `re:.*mlp.gate$`     | **MoE routing gate** — critical for expert selection accuracy |
| `re:.*embed_tokens$` | Input embeddings                                              |

### Serving with vLLM

```bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code
```

### DGX Spark (SM121) Compatibility Notes

- NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing `cvt.e2m1x2` instruction)
- Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely
- Qwen3 has no GDN, so `linear_attn` does not need to be excluded
- Clear page cache before starting on UMA: `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`

### Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

### Credits

- **Original Model**: [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) by Alibaba Qwen Team
- **Thinking 2507 & Abliteration**: [huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 Quantization**: [YuYu1015](https://huggingface.co/YuYu1015) on NVIDIA DGX Spark (GB10)
- **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM Project
- **Reference**: [RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)

---

## 繁體中文

> [!TIP]
> **2026-04-13 重新量化上傳**，修正先前版本的 ignore list（`mlp.gate` 與 `embed_tokens` 現在保留 BF16），解決 MoE 路由品質問題。

> [!WARNING]
> **NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+**
>
> 截至 2026 年 4 月，NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16（BF16 activation），FP4 的理論吞吐量優勢無法發揮。
>
> 若**精度與推理速度**為首要考量，建議改用 INT4 AutoRound 版本：
> 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)**
>
> INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑，校準更完整（品質保留約 99.5%），效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後，NVFP4 的真正優勢才能發揮。

[huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) 的 NVFP4 量化版本，針對 **NVIDIA DGX Spark (GB10 SM121)** 最佳化。

### 模型資訊

| 項目         | 數值                                                                          |
| ------------ | ----------------------------------------------------------------------------- |
| 架構         | MoE（30B 總參數, 3B 活躍），48 層，128 experts，top-8 routing                 |
| 基礎模型     | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)               |
| 微調者       | [huihui-ai](https://huggingface.co/huihui-ai)（Thinking 2507 + abliteration） |
| 量化者       | [YuYu1015](https://huggingface.co/YuYu1015)                                   |
| 模型大小     | ~18.1 GB（NVFP4，原版 BF16 約 60 GB）                                         |
| Context 長度 | 最高 131,072 tokens                                                           |
| 思考模式     | 內建思維鏈推理（預設啟用）                                                    |
| 工具呼叫     | 支援（`qwen3_coder` parser）                                                  |

### 量化詳情

| 項目         | 數值                                                                                                            |
| ------------ | --------------------------------------------------------------------------------------------------------------- |
| 方法         | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1                                      |
| 方案         | NVFP4（E2M1 + FP8 逐群縮放，群組大小 16）                                                                       |
| 格式         | compressed-tensors v0.14.0.1                                                                                    |
| 校準資料集   | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` 分割) |
| 校準樣本數   | 512                                                                                                             |
| 校準序列長度 | 2048                                                                                                            |
| MoE 專家校準 | `moe_calibrate_all_experts=True`（所有專家都接收校準資料）                                                      |
| 量化硬體     | NVIDIA DGX Spark（GB10, 128GB 統一記憶體）                                                                      |
| 環境         | `transformers==4.57.1` + `llm-compressor==0.10.0.1`                                                             |

### 保留 BF16 的層

以下層**未被量化**以保持模型品質：

| 層                   | 原因                                   |
| -------------------- | -------------------------------------- |
| `lm_head`            | 輸出頭，對量化雜訊敏感                 |
| `re:.*mlp.gate$`     | **MoE 路由閘**——對專家選擇精度至關重要 |
| `re:.*embed_tokens$` | 輸入嵌入                               |

### vLLM 部署

```bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code
```

### DGX Spark (SM121) 相容性說明

- NVFP4 在 SM121 上會退回 W4A16（原生 W4A4 路徑尚未支援，缺少 `cvt.e2m1x2` 指令）
- Qwen3（非 3.5）沒有 Mamba 層，FP8 KV cache 可以安全使用
- Qwen3 沒有 GDN，`linear_attn` 不需要排除
- UMA 架構啟動前請先清除 page cache：`sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`

### 安全警告

此模型已移除安全過濾機制（abliterated），可能產生不當內容。使用者須自行承擔所有風險與法律責任。

### 致謝

- **原始模型**：[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)，Alibaba Qwen 團隊
- **Thinking 2507 與去審查**：[huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 量化**：[YuYu1015](https://huggingface.co/YuYu1015)，於 NVIDIA DGX Spark (GB10) 上完成
- **量化工具**：[llm-compressor](https://github.com/vllm-project/llm-compressor)，vLLM Project
- **參考**：[RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)