Qwen3.5-122B-A10B-abliterix-FP8

FP8 W8A8 quantization of wangzhang/Qwen3.5-122B-A10B-abliterix — the 122B/A10B abliterated (uncensored) Qwen3.5 MoE — packaged for vLLM serving on NVIDIA Blackwell hardware (DGX Spark GB10 / SM121).

Summary


Base model	wangzhang/Qwen3.5-122B-A10B-abliterix (BF16)
Quantization	FP8 W8A8 (`float8_e4m3fn`) — per-channel symmetric weight, per-token dynamic activation
Format	`compressed-tensors` / `float-quantized` (vLLM-native)
Activation scale	dynamic, per-token (no calibration set required, computed online)
Weight scale	static, per-output-channel, BF16
Skipped modules	`lm_head`, `.mlp.gate`, `.mlp.shared_expert_gate`, all `norm`, all 1D tensors (Mamba/GDN `A_log`, `dt_bias`, `conv1d`, `in_proj_*`) — kept BF16
Shards	6 × ~19 GB safetensors
Total size on disk	116 GB
Tested vLLM image	`ghcr.io/bjk110/vllm-spark:v022-d568`
Runtime stack	NGC `pytorch:26.04-py3` base • PyTorch 2.12.0a0 • CUDA 13.0 • vLLM v0.21.0 + PR #35568 cherry-pick • FlashInfer v0.6.11.post3 • NCCL 2.30.4 • Triton 3.7.0 • TensorRT 5.8.1
Topology	2× DGX Spark GB10, TP=2 over 200 Gbps RoCE

Why this quantization

The base model retains the Abliterix-trained uncensored behavior (0.5% refusal rate, KL divergence 0.0115 vs the Qwen3.5-122B-A10B baseline) while dropping weight memory from BF16 (~~230 GB) to FP8 (~~116 GB), enough to fit on two DGX Spark nodes (2 × 119 GiB unified memory).

A 40% smaller companion repo with NVFP4 W4A4 weights is at bjk110/Qwen3.5-122B-A10B-abliterix-NVFP4 (~70 GB).

Quantization method

Direct safetensors-level conversion, not via llmcompressor. Reason: llmcompressor 0.10 pins transformers <=4.57.6, but Qwen3.5MoeForCausalLM is only available in transformers >=5.5, and the model repo does not ship modeling .py files (no trust_remote_code shortcut). The direct script needs only torch + safetensors.

For each 2D Linear weight W (shape [out, in]) that is not in the skip list:

absmax  = W.abs().amax(dim=1, keepdim=True).clamp(min=1e-12)   # per-row
scale   = absmax / 448.0                                       # FP8_E4M3FN max
W_q     = (W / scale).clamp(-448.0, 448.0).to(torch.float8_e4m3fn)
# stored as: {key}.weight (fp8_e4m3fn), {key}.weight_scale (bf16)

Activations are quantized at inference time by vLLM (per-token dynamic).

`config.json` quantization block

{
  "quant_method": "compressed-tensors",
  "format": "float-quantized",
  "config_groups": {
    "group_0": {
      "targets": ["Linear"],
      "weights": {
        "num_bits": 8, "type": "float", "strategy": "channel",
        "symmetric": true, "dynamic": false, "observer": "minmax"
      },
      "input_activations": {
        "num_bits": 8, "type": "float", "strategy": "token",
        "symmetric": true, "dynamic": true
      },
      "output_activations": null
    }
  },
  "ignore": [
    "lm_head",
    "re:.*\\.mlp\\.gate$",
    "re:.*\\.mlp\\.shared_expert_gate$"
  ]
}

Serving with vLLM

# (Tested with the v022-d568 image — see DGX Spark notes below.)
vllm serve bjk110/Qwen3.5-122B-A10B-abliterix-FP8 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --max-model-len 32768 \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.88 \
    --enable-chunked-prefill \
    --reasoning-parser qwen3

On a single NVIDIA H100 (80 GB) the model does not fit — --tensor-parallel-size 2 or higher is required.

DGX Spark (GB10, SM121) notes

NVIDIA DGX Spark uses SM121, which on stock vLLM v0.21.0 was excluded from the Marlin/CUTLASS FP8 codepaths (the gates were SM120-only). vLLM PR #35568 (commit 06d020bb6) widens those gates to the SM12x family. With that fix applied, the boot log reports

Selected CutlassFP8ScaledMMLinearKernel for CompressedTensorsW8A8Fp8

confirming the CUTLASS FP8 GEMM path is active.

Runtime stack (image `v022-d568`)

The image is the cumulative top of the v022 forward-stack build chain, rooted in NGC nvcr.io/nvidia/pytorch:26.04-py3 (CUDA 13.0, PyTorch 2.12.0a0). Each layer corresponds to one published image tag:

Stack layer	Component / version	Image tag
Base	NGC `pytorch:26.04-py3` (CUDA 13.0, PyTorch 2.12.0a0)	`v022-ngc2604`
Inference	vLLM v0.21.0	`v022-vllm021`
FP4/FP8 attention & MoE kernels	FlashInfer v0.6.11.post3	`v022-fi0611`
Triton	3.7.0	`v022-trt37`
TensorRT runtime	5.8.1	`v022-tx581`
Collective comm	NCCL 2.30.4	`v022-nccl234`
SM121 enablement	vLLM PR #35568 cherry-pick (SM120 → SM12x gates)	`v022-d568` ← this

Building on NGC 26.04 (vs. the older 26.03 base used by v021) gives the SM121 GPU the matching CUDA 13.0 driver/runtime split that the Blackwell FP8/NVFP4 kernels expect, and is required for FlashInfer v0.6.11.post3 (which assumes CUDA 13 headers).

Lineage

Stage	Repo / Tag
BF16 baseline	Qwen/Qwen3.5-122B-A10B
Abliterix abliteration (BF16)	wangzhang/Qwen3.5-122B-A10B-abliterix
FP8 W8A8 (this repo)	bjk110/Qwen3.5-122B-A10B-abliterix-FP8
NVFP4 W4A4 sibling	bjk110/Qwen3.5-122B-A10B-abliterix-NVFP4

Citation

@software{abliterix,
  author = {Wu, Wangzhang},
  title  = {Abliterix: Automated LLM Abliteration},
  year   = {2026},
  url    = {https://github.com/wuwangzhang1216/abliterix}
}

Acknowledgements

Wu Wangzhang for the original Abliterix framework and the BF16 abliterated checkpoint.
Qwen team for the Qwen3.5-122B-A10B base model.
vLLM compressed-tensors integration team — runtime FP8 W8A8 dispatch.
DGX Spark SM121 enablement: vLLM PR #35568 by Blake Ledden (Second Nature Computing) + contributors.

Downloads last month: 4

Safetensors

Model size

122B params

Tensor type

BF16

F8_E4M3

Model tree for bjk110/Qwen3.5-122B-A10B-abliterix-FP8

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

wangzhang/Qwen3.5-122B-A10B-abliterix

Quantized

(17)

this model