Text Generation
Transformers
Safetensors
English
Chinese
qwen3_moe
qwen3
Mixture of Experts
nvfp4
4-bit precision
quantized
abliterated
dgx-spark
blackwell
gb10
sm121
vllm
llm-compressor
conversational
8-bit precision
compressed-tensors
Instructions to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
- SGLang
How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
File size: 12,222 Bytes
a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 b2e26e8 a2bfdb9 b2e26e8 d375e9e b2e26e8 d375e9e b2e26e8 a2bfdb9 d375e9e a2bfdb9 b2e26e8 a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 b2e26e8 a2bfdb9 d375e9e a2bfdb9 b2e26e8 a2bfdb9 d375e9e a2bfdb9 b2e26e8 d375e9e a2bfdb9 b2e26e8 a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 d375e9e a2bfdb9 b2e26e8 a2bfdb9 d375e9e b2e26e8 d375e9e b2e26e8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | ---
license: apache-2.0
base_model:
- huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated
base_model_relation: quantized
library_name: transformers
pipeline_tag: text-generation
tags:
- safetensors
- qwen3
- moe
- nvfp4
- 4-bit
- quantized
- abliterated
- dgx-spark
- blackwell
- gb10
- sm121
- vllm
- llm-compressor
language:
- en
- zh
---
# Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
[English](#english) | [繁體中文](#繁體中文)
---
## English
> [!TIP]
> **Re-quantized on 2026-04-13** with corrected ignore list (`mlp.gate` + `embed_tokens` now preserved in BF16), fixing routing quality issues in the previous release.
> [!WARNING]
> **NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+**
>
> As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.
>
> If **accuracy and inference speed** are your priority, we recommend the INT4 AutoRound version:
> 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)**
>
> INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.
NVFP4 quantization of [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated), optimized for **NVIDIA DGX Spark (GB10 SM121)**.
### Model Details
| Item | Value |
| -------------- | ---------------------------------------------------------------------------- |
| Architecture | MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing |
| Base model | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
| Fine-tuned by | [huihui-ai](https://huggingface.co/huihui-ai) (Thinking 2507 + abliteration) |
| Quantized by | [YuYu1015](https://huggingface.co/YuYu1015) |
| Model size | ~18.1 GB (NVFP4, vs ~60 GB BF16 original) |
| Context length | Up to 131,072 tokens |
| Thinking mode | Built-in Chain-of-Thought reasoning (enabled by default) |
| Tool calling | Supported (`qwen3_coder` parser) |
### Quantization Details
| Item | Value |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Method | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1 |
| Scheme | NVFP4 (E2M1 + FP8 per-group scaling, group size 16) |
| Format | compressed-tensors v0.14.0.1 |
| Calibration dataset | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split) |
| Calibration samples | 512 |
| Calibration sequence length | 2048 |
| MoE expert calibration | `moe_calibrate_all_experts=True` (all experts receive calibration data) |
| Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) |
| Environment | `transformers==4.57.1` + `llm-compressor==0.10.0.1` |
### Layers Preserved in BF16
The following layers are **not quantized** to preserve model quality:
| Layer | Reason |
| -------------------- | ------------------------------------------------------------- |
| `lm_head` | Output head, sensitive to quantization noise |
| `re:.*mlp.gate$` | **MoE routing gate** — critical for expert selection accuracy |
| `re:.*embed_tokens$` | Input embeddings |
### Serving with vLLM
```bash
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3-30b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code
```
### DGX Spark (SM121) Compatibility Notes
- NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing `cvt.e2m1x2` instruction)
- Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely
- Qwen3 has no GDN, so `linear_attn` does not need to be excluded
- Clear page cache before starting on UMA: `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`
### Safety Warning
This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.
### Credits
- **Original Model**: [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) by Alibaba Qwen Team
- **Thinking 2507 & Abliteration**: [huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 Quantization**: [YuYu1015](https://huggingface.co/YuYu1015) on NVIDIA DGX Spark (GB10)
- **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM Project
- **Reference**: [RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)
---
## 繁體中文
> [!TIP]
> **2026-04-13 重新量化上傳**,修正先前版本的 ignore list(`mlp.gate` 與 `embed_tokens` 現在保留 BF16),解決 MoE 路由品質問題。
> [!WARNING]
> **NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+**
>
> 截至 2026 年 4 月,NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16(BF16 activation),FP4 的理論吞吐量優勢無法發揮。
>
> 若**精度與推理速度**為首要考量,建議改用 INT4 AutoRound 版本:
> 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)**
>
> INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑,校準更完整(品質保留約 99.5%),效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後,NVFP4 的真正優勢才能發揮。
[huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) 的 NVFP4 量化版本,針對 **NVIDIA DGX Spark (GB10 SM121)** 最佳化。
### 模型資訊
| 項目 | 數值 |
| ------------ | ----------------------------------------------------------------------------- |
| 架構 | MoE(30B 總參數, 3B 活躍),48 層,128 experts,top-8 routing |
| 基礎模型 | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
| 微調者 | [huihui-ai](https://huggingface.co/huihui-ai)(Thinking 2507 + abliteration) |
| 量化者 | [YuYu1015](https://huggingface.co/YuYu1015) |
| 模型大小 | ~18.1 GB(NVFP4,原版 BF16 約 60 GB) |
| Context 長度 | 最高 131,072 tokens |
| 思考模式 | 內建思維鏈推理(預設啟用) |
| 工具呼叫 | 支援(`qwen3_coder` parser) |
### 量化詳情
| 項目 | 數值 |
| ------------ | --------------------------------------------------------------------------------------------------------------- |
| 方法 | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1 |
| 方案 | NVFP4(E2M1 + FP8 逐群縮放,群組大小 16) |
| 格式 | compressed-tensors v0.14.0.1 |
| 校準資料集 | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` 分割) |
| 校準樣本數 | 512 |
| 校準序列長度 | 2048 |
| MoE 專家校準 | `moe_calibrate_all_experts=True`(所有專家都接收校準資料) |
| 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) |
| 環境 | `transformers==4.57.1` + `llm-compressor==0.10.0.1` |
### 保留 BF16 的層
以下層**未被量化**以保持模型品質:
| 層 | 原因 |
| -------------------- | -------------------------------------- |
| `lm_head` | 輸出頭,對量化雜訊敏感 |
| `re:.*mlp.gate$` | **MoE 路由閘**——對專家選擇精度至關重要 |
| `re:.*embed_tokens$` | 輸入嵌入 |
### vLLM 部署
```bash
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3-30b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code
```
### DGX Spark (SM121) 相容性說明
- NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑尚未支援,缺少 `cvt.e2m1x2` 指令)
- Qwen3(非 3.5)沒有 Mamba 層,FP8 KV cache 可以安全使用
- Qwen3 沒有 GDN,`linear_attn` 不需要排除
- UMA 架構啟動前請先清除 page cache:`sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`
### 安全警告
此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。
### 致謝
- **原始模型**:[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B),Alibaba Qwen 團隊
- **Thinking 2507 與去審查**:[huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 量化**:[YuYu1015](https://huggingface.co/YuYu1015),於 NVIDIA DGX Spark (GB10) 上完成
- **量化工具**:[llm-compressor](https://github.com/vllm-project/llm-compressor),vLLM Project
- **參考**:[RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)
|