YuYu1015 commited on
Commit
a2bfdb9
·
verified ·
1 Parent(s): 390419d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +241 -0
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: qwen
4
+ license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
5
+ base_model: huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated
6
+ base_model_relation: quantized
7
+ language:
8
+ - en
9
+ - zh
10
+ tags:
11
+ - qwen3
12
+ - moe
13
+ - nvfp4
14
+ - abliterated
15
+ - quantized
16
+ - vllm
17
+ - dgx-spark
18
+ - blackwell
19
+ - gb10
20
+ - sm121
21
+ library_name: transformers
22
+ pipeline_tag: text-generation
23
+ ---
24
+
25
+ # Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
26
+
27
+ NVFP4 quantized version of [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated), optimized for NVIDIA DGX Spark (GB10 / SM121).
28
+
29
+ > [繁體中文版本](#繁體中文)
30
+
31
+ ---
32
+
33
+ ## Model Details
34
+
35
+ | | |
36
+ |---|---|
37
+ | **Base Model** | [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) |
38
+ | **Original Model** | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
39
+ | **Architecture** | Mixture-of-Experts (MoE), 30B total / 3B active parameters per token |
40
+ | **Thinking Mode** | Built-in Chain-of-Thought reasoning (CoT), enabled by default |
41
+ | **Abliteration** | Refusal removal via [huihui-ai](https://huggingface.co/huihui-ai) |
42
+ | **Quantization** | NVFP4 (W4A4, E2M1 + FP8 per-group scaling, group size 16) |
43
+ | **Original Size** | ~60 GB (BF16) |
44
+ | **Quantized Size** | **~17 GB (NVFP4)** |
45
+ | **Context Length** | Up to 131,072 tokens |
46
+
47
+ ## Quantization Details
48
+
49
+ Quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor) with the following configuration:
50
+
51
+ | Parameter | Value |
52
+ |---|---|
53
+ | **Scheme** | NVFP4 — 4-bit floating point (E2M1) with FP8 (E4M3) per-group scaling, group size 16 |
54
+ | **Calibration Dataset** | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split) |
55
+ | **Calibration Samples** | 512 |
56
+ | **Max Sequence Length** | 2048 |
57
+ | **Ignored Layers** | `lm_head` (kept in BF16 for output quality) |
58
+ | **Tool** | llm-compressor 0.10.0.1 + compressed-tensors 0.14.0.1 |
59
+ | **Hardware** | NVIDIA DGX Spark (GB10, 128GB unified memory) |
60
+
61
+ ## Performance on DGX Spark
62
+
63
+ Benchmarked on a single NVIDIA DGX Spark (GB10 / SM121):
64
+
65
+ | Metric | Value |
66
+ |---|---|
67
+ | **Generation Throughput** | **~60 tok/s** (single user) |
68
+ | **NVFP4 Backend** | FLASHINFER_CUTLASS (native) |
69
+ | **KV Cache** | FP8 (E4M3) |
70
+ | **Memory Usage** | ~21 GB (model) + KV cache |
71
+ | **Driver** | 590.48+ (CUDA 13.1+) |
72
+
73
+ > **Why native CUTLASS?** Qwen3 (non-3.5) does not have Mamba layers, enabling native FLASHINFER_CUTLASS on SM121 without Marlin fallback. Qwen3.5 models with Mamba are limited to ~44 tok/s via Marlin.
74
+
75
+ ## Usage with vLLM
76
+
77
+ ```bash
78
+ docker run --gpus all --ipc host -p 8000:8000 \
79
+ -v /path/to/models:/models \
80
+ nvcr.io/nvidia/vllm:26.03-py3 \
81
+ python -m vllm.entrypoints.openai.api_server \
82
+ --model /models/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4 \
83
+ --served-model-name qwen3-30b \
84
+ --trust-remote-code \
85
+ --max-model-len 32768 \
86
+ --gpu-memory-utilization 0.95 \
87
+ --kv-cache-dtype fp8 \
88
+ --max-num-seqs 4 \
89
+ --enable-prefix-caching \
90
+ --stream-interval 1 \
91
+ --reasoning-parser qwen3 \
92
+ --enable-auto-tool-choice \
93
+ --tool-call-parser qwen3_coder
94
+ ```
95
+
96
+ ### DGX Spark (UMA) Note
97
+
98
+ DGX Spark uses unified memory architecture. Clear page cache before starting:
99
+
100
+ ```bash
101
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
102
+ ```
103
+
104
+ ### Thinking Mode
105
+
106
+ Thinking is enabled by default. Use `--reasoning-parser qwen3` to separate thinking into the `delta.reasoning` field in streaming responses.
107
+
108
+ Users can add `/no_think` in their prompt to disable thinking for a single turn.
109
+
110
+ ### Function Calling
111
+
112
+ ```python
113
+ from openai import OpenAI
114
+
115
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")
116
+
117
+ response = client.chat.completions.create(
118
+ model="qwen3-30b",
119
+ messages=[{"role": "user", "content": "What's the weather in Taipei?"}],
120
+ tools=[{
121
+ "type": "function",
122
+ "function": {
123
+ "name": "get_weather",
124
+ "description": "Get weather for a city",
125
+ "parameters": {
126
+ "type": "object",
127
+ "properties": {"city": {"type": "string"}},
128
+ "required": ["city"]
129
+ }
130
+ }
131
+ }],
132
+ tool_choice="auto"
133
+ )
134
+ ```
135
+
136
+ ## Reproduce Quantization
137
+
138
+ **Environment:** `nvcr.io/nvidia/pytorch:26.03-py3` + `llmcompressor==0.10.0.1` + `compressed-tensors==0.14.0.1` + `transformers>=4.56,<4.58`
139
+
140
+ ```python
141
+ from datasets import load_dataset
142
+ from transformers import AutoModelForCausalLM, AutoTokenizer
143
+ from llmcompressor import oneshot
144
+ from llmcompressor.modifiers.quantization import QuantizationModifier
145
+
146
+ MODEL_ID = "huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated"
147
+ OUTPUT_DIR = "Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4"
148
+
149
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
150
+ model = AutoModelForCausalLM.from_pretrained(
151
+ MODEL_ID, dtype="auto", device_map="auto", trust_remote_code=True)
152
+
153
+ ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
154
+ ds = ds.shuffle(seed=42)
155
+ ds = ds.map(lambda x: {
156
+ "text": tokenizer.apply_chat_template(x["messages"], tokenize=False)})
157
+ ds = ds.map(lambda x: tokenizer(
158
+ x["text"], padding=False, max_length=2048,
159
+ truncation=True, add_special_tokens=False),
160
+ remove_columns=ds.column_names)
161
+
162
+ recipe = QuantizationModifier(
163
+ targets="Linear", scheme="NVFP4", ignore=["lm_head"])
164
+ oneshot(model=model, dataset=ds, recipe=recipe,
165
+ max_seq_length=2048, num_calibration_samples=512)
166
+
167
+ model.save_pretrained(OUTPUT_DIR, save_compressed=True)
168
+ tokenizer.save_pretrained(OUTPUT_DIR)
169
+ ```
170
+
171
+ ## Safety Warning
172
+
173
+ This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards.
174
+
175
+ ## Credits
176
+
177
+ - **Original Model**: [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) by Alibaba Qwen Team
178
+ - **Abliteration**: [huihui-ai](https://huggingface.co/huihui-ai)
179
+ - **NVFP4 Quantization**: [YuYu1015](https://huggingface.co/YuYu1015) on NVIDIA DGX Spark (GB10)
180
+ - **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM Project
181
+ - **Reference**: [RedHatAI/Qwen3-30B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4)
182
+
183
+ ---
184
+
185
+ # 繁體中文
186
+
187
+ ## 模型資訊
188
+
189
+ [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) 的 NVFP4 量化版本,針對 NVIDIA DGX Spark (GB10 / SM121) 優化。
190
+
191
+ | | |
192
+ |---|---|
193
+ | **基礎模型** | [huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated) |
194
+ | **原始模型** | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
195
+ | **架構** | 混合專家模型 (MoE),總參數 30B / 每 token 啟用 3B |
196
+ | **思考模式** | 內建思維鏈推理 (CoT),預設啟用 |
197
+ | **去審查** | 由 [huihui-ai](https://huggingface.co/huihui-ai) 移除拒絕機制 |
198
+ | **量化方式** | NVFP4 (W4A4, E2M1 + FP8 逐群縮放, 群組大小 16) |
199
+ | **原始大小** | ~60 GB (BF16) |
200
+ | **量化後大小** | **~17 GB (NVFP4)** |
201
+ | **上下文長度** | 最大 131,072 tokens |
202
+
203
+ ## 量化細節
204
+
205
+ 使用 [llm-compressor](https://github.com/vllm-project/llm-compressor) 進行 NVFP4 量化:
206
+
207
+ | 參數 | 值 |
208
+ |---|---|
209
+ | **量化方案** | NVFP4 — 4 位元浮點 (E2M1) + FP8 (E4M3) 逐群縮放,群組大小 16 |
210
+ | **校準資料集** | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` 分割) |
211
+ | **校準樣本數** | 512 |
212
+ | **最大序列長度** | 2048 |
213
+ | **保留層** | `lm_head`(維持 BF16 以確保輸出品質) |
214
+ | **量化工具** | llm-compressor 0.10.0.1 + compressed-tensors 0.14.0.1 |
215
+ | **量化硬體** | NVIDIA DGX Spark (GB10, 128GB 統一記憶體) |
216
+
217
+ ## DGX Spark 效能
218
+
219
+ 在單台 NVIDIA DGX Spark (GB10 / SM121) 上的實測結果:
220
+
221
+ | 指標 | 數值 |
222
+ |---|---|
223
+ | **生成吞吐量** | **~60 tok/s**(單用戶) |
224
+ | **NVFP4 後端** | FLASHINFER_CUTLASS(原生路徑) |
225
+ | **KV Cache** | FP8 (E4M3) |
226
+ | **記憶體用量** | ~21 GB(模型)+ KV cache |
227
+ | **驅動程式** | 590.48+(CUDA 13.1+) |
228
+
229
+ > **為什麼能用原生 CUTLASS?** Qwen3(非 3.5)沒有 Mamba 層,因此 SM121 上可以直接使用 FLASHINFER_CUTLASS 原生路徑。Qwen3.5 有 Mamba 層,只能退回 Marlin fallback(~44 tok/s)。
230
+
231
+ ## 思考模式
232
+
233
+ 此模型預設啟用 Thinking 模式,回覆會包含 `<think>...</think>` 思考過程。
234
+
235
+ 使用 `--reasoning-parser qwen3` 時,vLLM 會自動將思考內容分離到串流的 `delta.reasoning` 欄位。
236
+
237
+ 用戶可在 prompt 中加入 `/no_think` 關閉單次思考。
238
+
239
+ ## 安全警告
240
+
241
+ 此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。