Qwen3.5-122B-A10B-abliterix-FP8

FP8 W8A8 quantization of wangzhang/Qwen3.5-122B-A10B-abliterix — the 122B/A10B abliterated (uncensored) Qwen3.5 MoE — packaged for vLLM serving on NVIDIA Blackwell hardware (DGX Spark GB10 / SM121).

Summary

Base model wangzhang/Qwen3.5-122B-A10B-abliterix (BF16)
Quantization FP8 W8A8 (float8_e4m3fn) — per-channel symmetric weight, per-token dynamic activation
Format compressed-tensors / float-quantized (vLLM-native)
Activation scale dynamic, per-token (no calibration set required, computed online)
Weight scale static, per-output-channel, BF16
Skipped modules lm_head, *.mlp.gate, *.mlp.shared_expert_gate, all *norm*, all 1D tensors (Mamba/GDN A_log, dt_bias, conv1d, in_proj_*) — kept BF16
Shards 6 × ~19 GB safetensors
Total size on disk 116 GB
Tested vLLM image ghcr.io/bjk110/vllm-spark:v022-d568
Runtime stack NGC pytorch:26.04-py3 base • PyTorch 2.12.0a0 • CUDA 13.0 • vLLM v0.21.0 + PR #35568 cherry-pick • FlashInfer v0.6.11.post3 • NCCL 2.30.4 • Triton 3.7.0 • TensorRT 5.8.1
Topology 2× DGX Spark GB10, TP=2 over 200 Gbps RoCE

Why this quantization

The base model retains the Abliterix-trained uncensored behavior (0.5% refusal rate, KL divergence 0.0115 vs the Qwen3.5-122B-A10B baseline) while dropping weight memory from BF16 (230 GB) to FP8 (116 GB), enough to fit on two DGX Spark nodes (2 × 119 GiB unified memory).

A 40% smaller companion repo with NVFP4 W4A4 weights is at bjk110/Qwen3.5-122B-A10B-abliterix-NVFP4 (~70 GB).

Quantization method

Direct safetensors-level conversion, not via llmcompressor. Reason: llmcompressor 0.10 pins transformers <=4.57.6, but Qwen3.5MoeForCausalLM is only available in transformers >=5.5, and the model repo does not ship modeling .py files (no trust_remote_code shortcut). The direct script needs only torch + safetensors.

For each 2D Linear weight W (shape [out, in]) that is not in the skip list:

absmax  = W.abs().amax(dim=1, keepdim=True).clamp(min=1e-12)   # per-row
scale   = absmax / 448.0                                       # FP8_E4M3FN max
W_q     = (W / scale).clamp(-448.0, 448.0).to(torch.float8_e4m3fn)
# stored as: {key}.weight (fp8_e4m3fn), {key}.weight_scale (bf16)

Activations are quantized at inference time by vLLM (per-token dynamic).

config.json quantization block

{
  "quant_method": "compressed-tensors",
  "format": "float-quantized",
  "config_groups": {
    "group_0": {
      "targets": ["Linear"],
      "weights": {
        "num_bits": 8, "type": "float", "strategy": "channel",
        "symmetric": true, "dynamic": false, "observer": "minmax"
      },
      "input_activations": {
        "num_bits": 8, "type": "float", "strategy": "token",
        "symmetric": true, "dynamic": true
      },
      "output_activations": null
    }
  },
  "ignore": [
    "lm_head",
    "re:.*\\.mlp\\.gate$",
    "re:.*\\.mlp\\.shared_expert_gate$"
  ]
}

Serving with vLLM

# (Tested with the v022-d568 image — see DGX Spark notes below.)
vllm serve bjk110/Qwen3.5-122B-A10B-abliterix-FP8 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --max-model-len 32768 \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.88 \
    --enable-chunked-prefill \
    --reasoning-parser qwen3

On a single NVIDIA H100 (80 GB) the model does not fit — --tensor-parallel-size 2 or higher is required.

DGX Spark (GB10, SM121) notes

NVIDIA DGX Spark uses SM121, which on stock vLLM v0.21.0 was excluded from the Marlin/CUTLASS FP8 codepaths (the gates were SM120-only). vLLM PR #35568 (commit 06d020bb6) widens those gates to the SM12x family. With that fix applied, the boot log reports

Selected CutlassFP8ScaledMMLinearKernel for CompressedTensorsW8A8Fp8

confirming the CUTLASS FP8 GEMM path is active.

Runtime stack (image v022-d568)

The image is the cumulative top of the v022 forward-stack build chain, rooted in NGC nvcr.io/nvidia/pytorch:26.04-py3 (CUDA 13.0, PyTorch 2.12.0a0). Each layer corresponds to one published image tag:

Stack layer Component / version Image tag
Base NGC pytorch:26.04-py3 (CUDA 13.0, PyTorch 2.12.0a0) v022-ngc2604
Inference vLLM v0.21.0 v022-vllm021
FP4/FP8 attention & MoE kernels FlashInfer v0.6.11.post3 v022-fi0611
Triton 3.7.0 v022-trt37
TensorRT runtime 5.8.1 v022-tx581
Collective comm NCCL 2.30.4 v022-nccl234
SM121 enablement vLLM PR #35568 cherry-pick (SM120 → SM12x gates) v022-d568 ← this

Building on NGC 26.04 (vs. the older 26.03 base used by v021) gives the SM121 GPU the matching CUDA 13.0 driver/runtime split that the Blackwell FP8/NVFP4 kernels expect, and is required for FlashInfer v0.6.11.post3 (which assumes CUDA 13 headers).

Lineage

Stage Repo / Tag
BF16 baseline Qwen/Qwen3.5-122B-A10B
Abliterix abliteration (BF16) wangzhang/Qwen3.5-122B-A10B-abliterix
FP8 W8A8 (this repo) bjk110/Qwen3.5-122B-A10B-abliterix-FP8
NVFP4 W4A4 sibling bjk110/Qwen3.5-122B-A10B-abliterix-NVFP4

Citation

@software{abliterix,
  author = {Wu, Wangzhang},
  title  = {Abliterix: Automated LLM Abliteration},
  year   = {2026},
  url    = {https://github.com/wuwangzhang1216/abliterix}
}

Acknowledgements

  • Wu Wangzhang for the original Abliterix framework and the BF16 abliterated checkpoint.
  • Qwen team for the Qwen3.5-122B-A10B base model.
  • vLLM compressed-tensors integration team — runtime FP8 W8A8 dispatch.
  • DGX Spark SM121 enablement: vLLM PR #35568 by Blake Ledden (Second Nature Computing) + contributors.
Downloads last month
4
Safetensors
Model size
122B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bjk110/Qwen3.5-122B-A10B-abliterix-FP8

Quantized
(17)
this model