Qwen3.5-122B-A10B-abliterix-FP8
FP8 W8A8 quantization of wangzhang/Qwen3.5-122B-A10B-abliterix — the 122B/A10B abliterated (uncensored) Qwen3.5 MoE — packaged for vLLM serving on NVIDIA Blackwell hardware (DGX Spark GB10 / SM121).
Summary
| Base model | wangzhang/Qwen3.5-122B-A10B-abliterix (BF16) |
| Quantization | FP8 W8A8 (float8_e4m3fn) — per-channel symmetric weight, per-token dynamic activation |
| Format | compressed-tensors / float-quantized (vLLM-native) |
| Activation scale | dynamic, per-token (no calibration set required, computed online) |
| Weight scale | static, per-output-channel, BF16 |
| Skipped modules | lm_head, *.mlp.gate, *.mlp.shared_expert_gate, all *norm*, all 1D tensors (Mamba/GDN A_log, dt_bias, conv1d, in_proj_*) — kept BF16 |
| Shards | 6 × ~19 GB safetensors |
| Total size on disk | 116 GB |
| Tested vLLM image | ghcr.io/bjk110/vllm-spark:v022-d568 |
| Runtime stack | NGC pytorch:26.04-py3 base • PyTorch 2.12.0a0 • CUDA 13.0 • vLLM v0.21.0 + PR #35568 cherry-pick • FlashInfer v0.6.11.post3 • NCCL 2.30.4 • Triton 3.7.0 • TensorRT 5.8.1 |
| Topology | 2× DGX Spark GB10, TP=2 over 200 Gbps RoCE |
Why this quantization
The base model retains the Abliterix-trained uncensored behavior (0.5% refusal rate, KL divergence 0.0115 vs the Qwen3.5-122B-A10B baseline) while dropping weight memory from BF16 (230 GB) to FP8 (116 GB), enough to fit on two DGX Spark nodes (2 × 119 GiB unified memory).
A 40% smaller companion repo with NVFP4 W4A4 weights is at bjk110/Qwen3.5-122B-A10B-abliterix-NVFP4 (~70 GB).
Quantization method
Direct safetensors-level conversion, not via llmcompressor. Reason: llmcompressor 0.10 pins transformers <=4.57.6, but Qwen3.5MoeForCausalLM is only available in transformers >=5.5, and the model repo does not ship modeling .py files (no trust_remote_code shortcut). The direct script needs only torch + safetensors.
For each 2D Linear weight W (shape [out, in]) that is not in the skip list:
absmax = W.abs().amax(dim=1, keepdim=True).clamp(min=1e-12) # per-row
scale = absmax / 448.0 # FP8_E4M3FN max
W_q = (W / scale).clamp(-448.0, 448.0).to(torch.float8_e4m3fn)
# stored as: {key}.weight (fp8_e4m3fn), {key}.weight_scale (bf16)
Activations are quantized at inference time by vLLM (per-token dynamic).
config.json quantization block
{
"quant_method": "compressed-tensors",
"format": "float-quantized",
"config_groups": {
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 8, "type": "float", "strategy": "channel",
"symmetric": true, "dynamic": false, "observer": "minmax"
},
"input_activations": {
"num_bits": 8, "type": "float", "strategy": "token",
"symmetric": true, "dynamic": true
},
"output_activations": null
}
},
"ignore": [
"lm_head",
"re:.*\\.mlp\\.gate$",
"re:.*\\.mlp\\.shared_expert_gate$"
]
}
Serving with vLLM
# (Tested with the v022-d568 image — see DGX Spark notes below.)
vllm serve bjk110/Qwen3.5-122B-A10B-abliterix-FP8 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--quantization compressed-tensors \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.88 \
--enable-chunked-prefill \
--reasoning-parser qwen3
On a single NVIDIA H100 (80 GB) the model does not fit — --tensor-parallel-size 2 or higher is required.
DGX Spark (GB10, SM121) notes
NVIDIA DGX Spark uses SM121, which on stock vLLM v0.21.0 was excluded from the Marlin/CUTLASS FP8 codepaths (the gates were SM120-only). vLLM PR #35568 (commit 06d020bb6) widens those gates to the SM12x family. With that fix applied, the boot log reports
Selected CutlassFP8ScaledMMLinearKernel for CompressedTensorsW8A8Fp8
confirming the CUTLASS FP8 GEMM path is active.
Runtime stack (image v022-d568)
The image is the cumulative top of the v022 forward-stack build chain, rooted in NGC nvcr.io/nvidia/pytorch:26.04-py3 (CUDA 13.0, PyTorch 2.12.0a0). Each layer corresponds to one published image tag:
| Stack layer | Component / version | Image tag |
|---|---|---|
| Base | NGC pytorch:26.04-py3 (CUDA 13.0, PyTorch 2.12.0a0) |
v022-ngc2604 |
| Inference | vLLM v0.21.0 | v022-vllm021 |
| FP4/FP8 attention & MoE kernels | FlashInfer v0.6.11.post3 | v022-fi0611 |
| Triton | 3.7.0 | v022-trt37 |
| TensorRT runtime | 5.8.1 | v022-tx581 |
| Collective comm | NCCL 2.30.4 | v022-nccl234 |
| SM121 enablement | vLLM PR #35568 cherry-pick (SM120 → SM12x gates) | v022-d568 ← this |
Building on NGC 26.04 (vs. the older 26.03 base used by v021) gives the SM121 GPU the matching CUDA 13.0 driver/runtime split that the Blackwell FP8/NVFP4 kernels expect, and is required for FlashInfer v0.6.11.post3 (which assumes CUDA 13 headers).
Lineage
| Stage | Repo / Tag |
|---|---|
| BF16 baseline | Qwen/Qwen3.5-122B-A10B |
| Abliterix abliteration (BF16) | wangzhang/Qwen3.5-122B-A10B-abliterix |
| FP8 W8A8 (this repo) | bjk110/Qwen3.5-122B-A10B-abliterix-FP8 |
| NVFP4 W4A4 sibling | bjk110/Qwen3.5-122B-A10B-abliterix-NVFP4 |
Citation
@software{abliterix,
author = {Wu, Wangzhang},
title = {Abliterix: Automated LLM Abliteration},
year = {2026},
url = {https://github.com/wuwangzhang1216/abliterix}
}
Acknowledgements
- Wu Wangzhang for the original Abliterix framework and the BF16 abliterated checkpoint.
- Qwen team for the Qwen3.5-122B-A10B base model.
- vLLM
compressed-tensorsintegration team — runtime FP8 W8A8 dispatch. - DGX Spark SM121 enablement: vLLM PR #35568 by Blake Ledden (Second Nature Computing) + contributors.
- Downloads last month
- 4
Model tree for bjk110/Qwen3.5-122B-A10B-abliterix-FP8
Base model
Qwen/Qwen3.5-122B-A10B