Instructions to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

SGLang

How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 with Docker Model Runner:
```
docker model run hf.co/YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
```

YuYu1015 commited on Apr 8

Commit

a2bfdb9

verified ·

1 Parent(s): 390419d

Create README.md

Browse files

Files changed (1) hide show

README.md +241 -0

README.md ADDED Viewed

	@@ -0,0 +1,241 @@

+---
+license: other
+license_name: qwen
+license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
+base_model: huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated
+base_model_relation: quantized
+language:
+  - en
+  - zh
+tags:
+  - qwen3
+  - moe
+  - nvfp4
+  - abliterated
+  - quantized
+  - vllm
+  - dgx-spark
+  - blackwell
+  - gb10
+  - sm121
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
+NVFP4 quantized version of [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated), optimized for NVIDIA DGX Spark (GB10 / SM121).
+> [繁體中文版本](#繁體中文)
+---
+## Model Details
+| | |
+|---|---|
+| **Base Model** | [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) |
+| **Original Model** | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
+| **Architecture** | Mixture-of-Experts (MoE), 30B total / 3B active parameters per token |
+| **Thinking Mode** | Built-in Chain-of-Thought reasoning (CoT), enabled by default |
+| **Abliteration** | Refusal removal via [huihui-ai](https://huggingface.co/huihui-ai) |
+| **Quantization** | NVFP4 (W4A4, E2M1 + FP8 per-group scaling, group size 16) |
+| **Original Size** | ~60 GB (BF16) |
+| **Quantized Size** | **~17 GB (NVFP4)** |
+| **Context Length** | Up to 131,072 tokens |
+## Quantization Details
+Quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor) with the following configuration:
+| Parameter | Value |
+|---|---|
+| **Scheme** | NVFP4 — 4-bit floating point (E2M1) with FP8 (E4M3) per-group scaling, group size 16 |
+| **Calibration Dataset** | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split) |
+| **Calibration Samples** | 512 |
+| **Max Sequence Length** | 2048 |
+| **Ignored Layers** | `lm_head` (kept in BF16 for output quality) |
+| **Tool** | llm-compressor 0.10.0.1 + compressed-tensors 0.14.0.1 |
+| **Hardware** | NVIDIA DGX Spark (GB10, 128GB unified memory) |
+## Performance on DGX Spark
+Benchmarked on a single NVIDIA DGX Spark (GB10 / SM121):
+| Metric | Value |
+|---|---|
+| **Generation Throughput** | **~60 tok/s** (single user) |
+| **NVFP4 Backend** | FLASHINFER_CUTLASS (native) |
+| **KV Cache** | FP8 (E4M3) |
+| **Memory Usage** | ~21 GB (model) + KV cache |
+| **Driver** | 590.48+ (CUDA 13.1+) |
+> **Why native CUTLASS?** Qwen3 (non-3.5) does not have Mamba layers, enabling native FLASHINFER_CUTLASS on SM121 without Marlin fallback. Qwen3.5 models with Mamba are limited to ~44 tok/s via Marlin.
+## Usage with vLLM
+```bash
+docker run --gpus all --ipc host -p 8000:8000 \
+  -v /path/to/models:/models \
+  nvcr.io/nvidia/vllm:26.03-py3 \
+  python -m vllm.entrypoints.openai.api_server \
+    --model /models/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 \
+    --served-model-name qwen3-30b \
+    --trust-remote-code \
+    --max-model-len 32768 \
+    --gpu-memory-utilization 0.95 \
+    --kv-cache-dtype fp8 \
+    --max-num-seqs 4 \
+    --enable-prefix-caching \
+    --stream-interval 1 \
+    --reasoning-parser qwen3 \
+    --enable-auto-tool-choice \
+    --tool-call-parser qwen3_coder
+```
+### DGX Spark (UMA) Note
+DGX Spark uses unified memory architecture. Clear page cache before starting:
+```bash
+sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+```
+### Thinking Mode
+Thinking is enabled by default. Use `--reasoning-parser qwen3` to separate thinking into the `delta.reasoning` field in streaming responses.
+Users can add `/no_think` in their prompt to disable thinking for a single turn.
+### Function Calling
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")
+response = client.chat.completions.create(
+    model="qwen3-30b",
+    messages=[{"role": "user", "content": "What's the weather in Taipei?"}],
+    tools=[{
+        "type": "function",
+        "function": {
+            "name": "get_weather",
+            "description": "Get weather for a city",
+            "parameters": {
+                "type": "object",
+                "properties": {"city": {"type": "string"}},
+                "required": ["city"]
+            }
+        }
+    }],
+    tool_choice="auto"
+)
+```
+## Reproduce Quantization
+**Environment:** `nvcr.io/nvidia/pytorch:26.03-py3` + `llmcompressor==0.10.0.1` + `compressed-tensors==0.14.0.1` + `transformers>=4.56,<4.58`
+```python
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+MODEL_ID = "huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated"
+OUTPUT_DIR = "Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID, dtype="auto", device_map="auto", trust_remote_code=True)
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
+ds = ds.shuffle(seed=42)
+ds = ds.map(lambda x: {
+    "text": tokenizer.apply_chat_template(x["messages"], tokenize=False)})
+ds = ds.map(lambda x: tokenizer(
+    x["text"], padding=False, max_length=2048,
+    truncation=True, add_special_tokens=False),
+    remove_columns=ds.column_names)
+recipe = QuantizationModifier(
+    targets="Linear", scheme="NVFP4", ignore=["lm_head"])
+oneshot(model=model, dataset=ds, recipe=recipe,
+       max_seq_length=2048, num_calibration_samples=512)
+model.save_pretrained(OUTPUT_DIR, save_compressed=True)
+tokenizer.save_pretrained(OUTPUT_DIR)
+```
+## Safety Warning
+This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards.
+## Credits
+- **Original Model**: [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) by Alibaba Qwen Team
+- **Abliteration**: [huihui-ai](https://huggingface.co/huihui-ai)
+- **NVFP4 Quantization**: [YuYu1015](https://huggingface.co/YuYu1015) on NVIDIA DGX Spark (GB10)
+- **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM Project
+- **Reference**: [RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)
+---
+# 繁體中文
+## 模型資訊
+[huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) 的 NVFP4 量化版本，針對 NVIDIA DGX Spark (GB10 / SM121) 優化。
+| | |
+|---|---|
+| **基礎模型** | [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) |
+| **原始模型** | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
+| **架構** | 混合專家模型 (MoE)，總參數 30B / 每 token 啟用 3B |
+| **思考模式** | 內建思維鏈推理 (CoT)，預設啟用 |
+| **去審查** | 由 [huihui-ai](https://huggingface.co/huihui-ai) 移除拒絕機制 |
+| **量化方式** | NVFP4 (W4A4, E2M1 + FP8 逐群縮放, 群組大小 16) |
+| **原始大小** | ~60 GB (BF16) |
+| **量化後大小** | **~17 GB (NVFP4)** |
+| **上下文長度** | 最大 131,072 tokens |
+## 量化細節
+使用 [llm-compressor](https://github.com/vllm-project/llm-compressor) 進行 NVFP4 量化：
+| 參數 | 值 |
+|---|---|
+| **量化方案** | NVFP4 — 4 位元浮點 (E2M1) + FP8 (E4M3) 逐群縮放，群組大小 16 |
+| **校準資料集** | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` 分割) |
+| **校準樣本數** | 512 |
+| **最大序列長度** | 2048 |
+| **保留層** | `lm_head`（維持 BF16 以確保輸出品質） |
+| **量化工具** | llm-compressor 0.10.0.1 + compressed-tensors 0.14.0.1 |
+| **量化硬體** | NVIDIA DGX Spark (GB10, 128GB 統一記憶體) |
+## DGX Spark 效能
+在單台 NVIDIA DGX Spark (GB10 / SM121) 上的實測結果：
+| 指標 | 數值 |
+|---|---|
+| **生成吞吐量** | **~60 tok/s**（單用戶） |
+| **NVFP4 後端** | FLASHINFER_CUTLASS（原生路徑） |
+| **KV Cache** | FP8 (E4M3) |
+| **記憶體用量** | ~21 GB（模型）+ KV cache |
+| **驅動程式** | 590.48+（CUDA 13.1+） |
+> **為什麼能用原生 CUTLASS？** Qwen3（非 3.5）沒有 Mamba 層，因此 SM121 上可以直接使用 FLASHINFER_CUTLASS 原生路徑。Qwen3.5 有 Mamba 層，只能退回 Marlin fallback（~44 tok/s）。
+## 思考模式
+此模型預設啟用 Thinking 模式，回覆會包含 `<think>...</think>` 思考過程。
+使用 `--reasoning-parser qwen3` 時，vLLM 會自動將思考內容分離到串流的 `delta.reasoning` 欄位。
+用戶可在 prompt 中加入 `/no_think` 關閉單次思考。
+## 安全警告
+此模型已移除安全過濾機制（abliterated），可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任，並確保使用方式符合當地法規與倫理標準。