Instructions to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

SGLang

How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with Docker Model Runner:
```
docker model run hf.co/YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
```

Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

File size: 12,222 Bytes

a2bfdb9
d375e9e
 
 
a2bfdb9
d375e9e
 
a2bfdb9
d375e9e
a2bfdb9
 
 
d375e9e
a2bfdb9
d375e9e
a2bfdb9
 
 
 
d375e9e
 
 
 
 
a2bfdb9
 
 
 
b2e26e8
a2bfdb9
 
 
b2e26e8
 
d375e9e
 
 
b2e26e8
 
 
 
 
 
 
 
 
 
d375e9e
 
b2e26e8
a2bfdb9
d375e9e
 
 
 
 
 
 
 
 
 
a2bfdb9
b2e26e8
a2bfdb9
d375e9e
 
 
 
 
 
 
 
 
 
 
a2bfdb9
d375e9e
a2bfdb9
d375e9e
a2bfdb9
d375e9e
 
 
 
 
a2bfdb9
d375e9e
a2bfdb9
 
d375e9e
 
a2bfdb9
 
 
d375e9e
 
 
 
 
 
 
a2bfdb9
 
d375e9e
a2bfdb9
d375e9e
 
 
 
a2bfdb9
b2e26e8
a2bfdb9
d375e9e
a2bfdb9
b2e26e8
a2bfdb9
 
d375e9e
a2bfdb9
 
 
 
 
 
b2e26e8
 
d375e9e
 
a2bfdb9
b2e26e8
 
 
 
 
 
 
 
 
a2bfdb9
d375e9e
a2bfdb9
d375e9e
a2bfdb9
d375e9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2bfdb9
d375e9e
 
 
 
 
 
 
 
 
 
 
 
 
 
a2bfdb9
d375e9e
a2bfdb9
d375e9e
 
 
 
a2bfdb9
b2e26e8
a2bfdb9
d375e9e
b2e26e8
 
 
 
d375e9e
b2e26e8

---
license: apache-2.0
base_model:
  - huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated
base_model_relation: quantized
library_name: transformers
pipeline_tag: text-generation
tags:
  - safetensors
  - qwen3
  - moe
  - nvfp4
  - 4-bit
  - quantized
  - abliterated
  - dgx-spark
  - blackwell
  - gb10
  - sm121
  - vllm
  - llm-compressor
language:
  - en
  - zh
---

# Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

[English](#english) | [繁體中文](#繁體中文)

---

## English

> [!TIP]
> **Re-quantized on 2026-04-13** with corrected ignore list (`mlp.gate` + `embed_tokens` now preserved in BF16), fixing routing quality issues in the previous release.

> [!WARNING]
> **NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+**
>
> As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.
>
> If **accuracy and inference speed** are your priority, we recommend the INT4 AutoRound version:
> 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)**
>
> INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.

NVFP4 quantization of [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated), optimized for **NVIDIA DGX Spark (GB10 SM121)**.

### Model Details

| Item           | Value                                                                        |
| -------------- | ---------------------------------------------------------------------------- |
| Architecture   | MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing            |
| Base model     | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)              |
| Fine-tuned by  | [huihui-ai](https://huggingface.co/huihui-ai) (Thinking 2507 + abliteration) |
| Quantized by   | [YuYu1015](https://huggingface.co/YuYu1015)                                  |
| Model size     | ~18.1 GB (NVFP4, vs ~60 GB BF16 original)                                    |
| Context length | Up to 131,072 tokens                                                         |
| Thinking mode  | Built-in Chain-of-Thought reasoning (enabled by default)                     |
| Tool calling   | Supported (`qwen3_coder` parser)                                             |

### Quantization Details

| Item                        | Value                                                                                                            |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Method                      | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1                                       |
| Scheme                      | NVFP4 (E2M1 + FP8 per-group scaling, group size 16)                                                              |
| Format                      | compressed-tensors v0.14.0.1                                                                                     |
| Calibration dataset         | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split) |
| Calibration samples         | 512                                                                                                              |
| Calibration sequence length | 2048                                                                                                             |
| MoE expert calibration      | `moe_calibrate_all_experts=True` (all experts receive calibration data)                                          |
| Hardware                    | NVIDIA DGX Spark (GB10, 128GB unified memory)                                                                    |
| Environment                 | `transformers==4.57.1` + `llm-compressor==0.10.0.1`                                                              |

### Layers Preserved in BF16

The following layers are **not quantized** to preserve model quality:

| Layer                | Reason                                                        |
| -------------------- | ------------------------------------------------------------- |
| `lm_head`            | Output head, sensitive to quantization noise                  |
| `re:.*mlp.gate$`     | **MoE routing gate** — critical for expert selection accuracy |
| `re:.*embed_tokens$` | Input embeddings                                              |

### Serving with vLLM

```bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code
```

### DGX Spark (SM121) Compatibility Notes

- NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing `cvt.e2m1x2` instruction)
- Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely
- Qwen3 has no GDN, so `linear_attn` does not need to be excluded
- Clear page cache before starting on UMA: `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`

### Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

### Credits

- **Original Model**: [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) by Alibaba Qwen Team
- **Thinking 2507 & Abliteration**: [huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 Quantization**: [YuYu1015](https://huggingface.co/YuYu1015) on NVIDIA DGX Spark (GB10)
- **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM Project
- **Reference**: [RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)

---

## 繁體中文

> [!TIP]
> **2026-04-13 重新量化上傳**，修正先前版本的 ignore list（`mlp.gate` 與 `embed_tokens` 現在保留 BF16），解決 MoE 路由品質問題。

> [!WARNING]
> **NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+**
>
> 截至 2026 年 4 月，NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16（BF16 activation），FP4 的理論吞吐量優勢無法發揮。
>
> 若**精度與推理速度**為首要考量，建議改用 INT4 AutoRound 版本：
> 👉 **[YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound](https://huggingface.co/YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound)**
>
> INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑，校準更完整（品質保留約 99.5%），效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後，NVFP4 的真正優勢才能發揮。

[huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) 的 NVFP4 量化版本，針對 **NVIDIA DGX Spark (GB10 SM121)** 最佳化。

### 模型資訊

| 項目         | 數值                                                                          |
| ------------ | ----------------------------------------------------------------------------- |
| 架構         | MoE（30B 總參數, 3B 活躍），48 層，128 experts，top-8 routing                 |
| 基礎模型     | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)               |
| 微調者       | [huihui-ai](https://huggingface.co/huihui-ai)（Thinking 2507 + abliteration） |
| 量化者       | [YuYu1015](https://huggingface.co/YuYu1015)                                   |
| 模型大小     | ~18.1 GB（NVFP4，原版 BF16 約 60 GB）                                         |
| Context 長度 | 最高 131,072 tokens                                                           |
| 思考模式     | 內建思維鏈推理（預設啟用）                                                    |
| 工具呼叫     | 支援（`qwen3_coder` parser）                                                  |

### 量化詳情

| 項目         | 數值                                                                                                            |
| ------------ | --------------------------------------------------------------------------------------------------------------- |
| 方法         | [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.10.0.1                                      |
| 方案         | NVFP4（E2M1 + FP8 逐群縮放，群組大小 16）                                                                       |
| 格式         | compressed-tensors v0.14.0.1                                                                                    |
| 校準資料集   | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` 分割) |
| 校準樣本數   | 512                                                                                                             |
| 校準序列長度 | 2048                                                                                                            |
| MoE 專家校準 | `moe_calibrate_all_experts=True`（所有專家都接收校準資料）                                                      |
| 量化硬體     | NVIDIA DGX Spark（GB10, 128GB 統一記憶體）                                                                      |
| 環境         | `transformers==4.57.1` + `llm-compressor==0.10.0.1`                                                             |

### 保留 BF16 的層

以下層**未被量化**以保持模型品質：

| 層                   | 原因                                   |
| -------------------- | -------------------------------------- |
| `lm_head`            | 輸出頭，對量化雜訊敏感                 |
| `re:.*mlp.gate$`     | **MoE 路由閘**——對專家選擇精度至關重要 |
| `re:.*embed_tokens$` | 輸入嵌入                               |

### vLLM 部署

```bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code
```

### DGX Spark (SM121) 相容性說明

- NVFP4 在 SM121 上會退回 W4A16（原生 W4A4 路徑尚未支援，缺少 `cvt.e2m1x2` 指令）
- Qwen3（非 3.5）沒有 Mamba 層，FP8 KV cache 可以安全使用
- Qwen3 沒有 GDN，`linear_attn` 不需要排除
- UMA 架構啟動前請先清除 page cache：`sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`

### 安全警告

此模型已移除安全過濾機制（abliterated），可能產生不當內容。使用者須自行承擔所有風險與法律責任。

### 致謝

- **原始模型**：[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)，Alibaba Qwen 團隊
- **Thinking 2507 與去審查**：[huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 量化**：[YuYu1015](https://huggingface.co/YuYu1015)，於 NVIDIA DGX Spark (GB10) 上完成
- **量化工具**：[llm-compressor](https://github.com/vllm-project/llm-compressor)，vLLM Project
- **參考**：[RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)