GLM-5.1-478B-A42B-REAP-NVFP4

NVFP4 quantization of zai-org/GLM-5.1, further REAP-pruned from 256 → 160 routed experts per MoE layer. Tuned to run at 200,000-token context on a 4× 96 GB Blackwell workstation.

Total params 478.4B
Activated / token 42.7B
Routed experts / MoE layer 160 (was 256 in base)
Active experts / token 8 routed + 1 shared
Layers 78 (3 dense + 75 MoE) + 1 MTP / NEXTN
Hidden size 6144
Attention MLA-DSA, 64 heads
Max position 202,752
Quantization NVFP4, group_size=16 (modelopt_fp4)
On-disk size 285 GB (85 shards)
License MIT (inherited from GLM-5.1)

Measured performance

Single-user, batch size 1, decode tok/s at various prompt lengths on our reference rig (baseline dense-MLA path; see IndexCache below for substantially faster long-context numbers):

Context tok/s (baseline) tok/s (+ IndexCache)
256 46.5 46.4
4 k 41.8 41.7
16 k 26.4 – 38.6 40.9
32 k 36.5
64 k 29.5
128 k 21.3
150 k – 165 k 22.4 18.4

Under live mixed traffic (1,495 decode samples, baseline config):

Context range p50 tok/s
< 1 k 42.7
1 – 8 k 44.3
8 – 32 k 36.3
32 – 100 k 27.7

Per-rank VRAM at 202,752 ctx: weights 77.2 GB, KV pool 11.3 GB (270 k tokens), CUDA graphs 0.3 GB, ~5 GB free.


Quick start (4× 96 GB Blackwell)

# 1. Download the weights
hf download 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4 --local-dir ./GLM-5.1-478B-A42B-REAP-NVFP4

# 2. Install the pinned inference stack (see "Exact versions" below)
python3.12 -m venv venv && source venv/bin/activate
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3

# 3. Apply the required NSA-disable patch (see "Required sglang patch" below)

# 4. Launch
./launch.sh   # see full script below

Reference rig

  • 4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB, compute capability 12.0 (sm_120)
  • NVIDIA driver 580.126.18, CUDA 12.9 userspace
  • Ubuntu / Pop!_OS 22.04, Python 3.12

This is what the tuning targets. The same recipe works on 4× B200 (sm_100), 8× Hopper (sm_90) with fewer or more aggressive quantization, and other Blackwell configurations — see the hardware compatibility matrix at the bottom of this page.


Exact versions (pinned from the running venv)

Everything below is reproducible from:

pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3

The resolver pulls in the whole stack at these versions:

sglang                   0.5.10.post1
torch                    2.9.1+cu129
triton                   3.5.1
transformers             5.3.0
tokenizers               0.22.2
safetensors              0.8.0rc0
numpy                    2.4.4

flashinfer-python        0.6.7.post3
flashinfer-cubin         0.6.7.post3
nvidia-cutlass-dsl       4.5.0.dev0
nvidia-cublas-cu12       12.9.1.4
nvidia-cudnn-cu12        9.10.2.21
nvidia-nccl-cu12         2.27.5
nvidia-cuda-nvrtc-cu12   12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-nvjitlink-cu12    12.9.86
nvidia-nvshmem-cu12      3.3.20

Verify:

python -c "import sglang, torch, flashinfer; print(sglang.__version__, torch.__version__, flashinfer.__version__)"
# 0.5.10.post1 2.9.1+cu129 0.6.7.post3

Required sglang patch (SM120 only)

GLM-5.1's config advertises GlmMoeDsaForCausalLM, which sglang routes through DeepSeek Sparse Attention by default. Every NSA backend in sglang 0.5.10.post1 is built for sm_90a / sm_100f only and fails at launch on sm_120. Route GLM-5 through the stable dense-MLA path by excluding it from the NSA architecture list:

Edit <venv>/lib/python3.12/site-packages/sglang/srt/configs/model_config.py, function is_deepseek_nsa():

def is_deepseek_nsa(config) -> bool:
    architectures = (
        config.get("architectures") if isinstance(config, dict)
        else getattr(config, "architectures", None)
    )
    index_topk = (
        config.get("index_topk") if isinstance(config, dict)
        else getattr(config, "index_topk", None)
    )
    # Keep GLM-5 on dense MLA until sm_120 NSA kernels ship.
    return (
        architectures is not None
        and architectures[0] in [
            "DeepseekV3ForCausalLM",
            "DeepseekV32ForCausalLM",
            "DeepseekV3ForCausalLMNextN",
            "MistralLarge3ForCausalLM",
            "PixtralForConditionalGeneration",
        ]
        and index_topk is not None
    )

(Only the architectures list changes — GlmMoeDsaForCausalLM is removed.)

After the patch, sglang auto-picks triton for attention on sm_120. Confirm in the startup log: attention_backend='triton'.

On sm_90 (Hopper) and sm_100 (B200) this patch is not needed — the native NSA kernels work. Skip to the launch section.


Launch

#!/usr/bin/env bash
set -euo pipefail

MODEL=/path/to/GLM-5.1-478B-A42B-REAP-NVFP4
VENV=/path/to/sglang-venv

# Route NCCL over PCIe (no NVLink on workstation Blackwell)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3        # four Blackwell GPUs

# DeepGEMM has no sm_120 kernels; keep it off.
export SGLANG_ENABLE_JIT_DEEPGEMM=0
export SGLANG_ENABLE_DEEP_GEMM=0
export SGLANG_DISABLE_DEEP_GEMM=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=True
export FLASHINFER_DISABLE_VERSION_CHECK=1

# NCCL tuning for PCIe-only (no IB, no NVLink)
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=PIX
export NCCL_SHM_DISABLE=0
export NCCL_BUFFSIZE=4194304
export NCCL_MIN_NCHANNELS=8
export NCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export NCCL_CUMEM_HOST_ENABLE=0
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1

exec "$VENV/bin/python" -m sglang.launch_server \
  --model-path        "$MODEL" \
  --served-model-name GLM-5.1-478B-A42B-REAP-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --context-length    202752 \
  --max-running-requests 1 \
  --mem-fraction-static 0.94 \
  --chunked-prefill-size 4096 \
  --page-size         128 \
  --quantization      modelopt_fp4 \
  --kv-cache-dtype    fp8_e4m3 \
  --triton-attention-num-kv-splits 64 \
  --moe-runner-backend cutlass \
  --fp4-gemm-backend  flashinfer_cudnn \
  --cuda-graph-max-bs 4 \
  --pre-warm-nccl \
  --tool-call-parser  glm47 \
  --reasoning-parser  glm45 \
  --chat-template     "$MODEL/chat_template.jinja" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
  --watchdog-timeout  1800 \
  --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'

The final --json-model-override-args flag enables IndexCache, which reuses the DSA indexer output across layers according to a 78-character F/S pattern. This is now the default shipped recipe — for details, rationale, and the pattern breakdown see the IndexCache section below. If you want to disable it, remove that one flag.

On sm_90 / sm_100 you will want --attention-backend flashinfer and --fp4-gemm-backend b12x instead — see the 555B sibling card for that recipe.


IndexCache (enabled by default)

The launch script above already includes IndexCache via --json-model-override-args. This section documents what the flag does and how it was measured.

GLM-5.1's DeepSeek Sparse Attention (DSA) recomputes the top-k sparse index at every layer, on every prefill chunk and every decode step. Profiling in sglang#21663 measured the indexer alone at ~81% of prefill wall time at 200k context. IndexCache reuses the top-k indices across layers according to a 78-character F/S pattern — one character per transformer layer: F = that layer runs its own indexer, S = that layer reuses the nearest upstream F layer's indices. The pattern used here was greedy-searched upstream and is the one shipped in the SGLang GLM-5.1 cookbook.

With 23 F and 55 S layers, only ~30% of layers run the indexer — the other 70% of indexer cost is skipped. No extra VRAM (indices are transient, live on the MLA KV pool only for the duration of one step).

The flag in the launch script above:

--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'

To fall back to the pre-IndexCache baseline (e.g., for A/B testing), remove this single flag.

Measured on the reference rig (max_tokens=100, bs=1, streaming, 1 warmup + 2 measured reps):

Prompt tokens TTFT (warm) Cold-prefill tok/s Decode tok/s
16,025 0.23 s 40.9
32,025 0.39 s 36.5
64,025 0.71 s 29.5
128,025 1.34 s 4,629 21.3
165,025 1.76 s 18.4

Per-layer F/S pattern for reference (layer 0 leftmost):

layer   0000000001111111111222222222233333333334444444444555555555566666666667777777777
idx     0123456789012345678901234567890123456789012345678901234567890123456789012345678
mask    FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS

Coherence spot-checked with needle-in-haystack at 32k and 100k, 11-turn tool-use chat, and GSM-style arithmetic — no quality regressions observed relative to baseline.

Mutual exclusion: IndexCache and MTP are not both enabled in this recipe. MTP requires --page-size 64 and caps context at 65,536 (bf16 KV); IndexCache keeps --page-size 128, fp8_e4m3 KV, and the full 202,752-token window. For the workstation 200k-ctx target, IndexCache is the right pick.


Key flag decisions (why these specific values)

These were measured on the reference rig; defaults were not.

--triton-attention-num-kv-splits 64 — biggest single win. Default is 8. At bs=1 decode on sm_120, raising kv-splits gave:

Context splits=8 splits=64
4 k 39.7 41.8
16 k 26.4 38.6
150 k 5.2 22.4

Coherence verified across arithmetic, factual recall, needle-in-haystack @ 32 k and @ 100 k, and 11-turn chat.

--mem-fraction-static 0.94 — decode is kernel-bound at bs=1, not memory-bound. 0.94 vs 0.97 gives identical tok/s and ~5 GB/rank of headroom for graph recapture and prefill scratch.

--kv-cache-dtype fp8_e4m3 — halves KV memory vs bf16. Required to fit 202 k context in budget.

--attention-backend is intentionally omitted — sglang auto-selects triton on sm_120 for this architecture after the NSA patch. Flashinfer attention is skipped because it requires PCIe P2P atomics not available on the workstation board.

--page-size 128 — the non-MTP default. Drop to 64 only if enabling speculative decode.


MTP / NEXTN speculative decode (optional)

The checkpoint includes an MTP head for layer 78, stitched from the original 256-expert source using the layer-77 REAP keep-map as a proxy.

Without MTP (this page's default) With MTP
Decode tok/s (short) ~46 ~90 (1.93×)
Max context 202,752 ~65,536
KV dtype fp8_e4m3 bf16 (required by NEXTN)
Page size 128 64 (required by NEXTN)

MTP is opt-in because the workstation target is long context, not peak short-prompt throughput. Enable with:

# Replace three lines in the launch script:
--context-length 65536 \
--page-size 64 \
--kv-cache-dtype auto \
# and add:
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-attention-mode decode \
--speculative-moe-runner-backend cutlass

Also drop --mem-fraction-static to 0.88 — the draft worker adds ~5 GB/rank.


Sampling recommendations

General chat / reasoning:

temperature=0.5  top_p=0.95  frequency_penalty=0.3  repetition_penalty=1.05

Strict-answer (MCQ, tool-use benchmarks):

temperature=0.0  repetition_penalty=1.05

Keep repetition_penalty=1.05 everywhere. Pure greedy with no penalty can loop on pathological low-entropy prompts (e.g., repeated filler tokens).


Lineage & license

zai-org/GLM-5.1  (official, 744B bf16, 256 experts, MIT)
    │
    ├── community NVFP4 quantization via NVIDIA Model Optimizer
    │     (e.g. lukealonso/GLM-5.1-NVFP4, ~434 GB, 256 experts)
    │
    ├── Local REAP pass 1: 256 → 192 experts
    │     0xSero/GLM-5.1-555B-A14B-REAP-NVFP4
    │
    └── Local REAP pass 2: 192 → 160 experts
          0xSero/GLM-5.1-478B-A42B-REAP-NVFP4   ← this model

Both REAP passes were done locally using pooled token-weighted observations from:

Prune scripts and MTP-stitch script are in the repo tree.

License: MIT, inherited from zai-org/GLM-5.1.

Citation (REAP method):

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv},
}

GLM-5.1 REAP Family — Hardware Compatibility

All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.

Quick picker

You have Use
8× H100/H200 80GB (Hopper, sm_90) GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via modelopt_fp4 + triton path)
4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120) GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config
4× B200 180GB (sm_100) GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4
8× B200 / Blackwell datacenter GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends)
8× A100 80GB (Ampere, sm_80) GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16
CPU / Apple Silicon / consumer GPU with llama.cpp GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF

Full family

Variant Format Size Experts/layer Activated/token Min VRAM (TP) Inference engine Best on
GLM-5.1-555B-A14B-REAP BF16 ~1125 GB 192 ~14B 8× 141 GB (H200) sglang / vllm Hopper
GLM-5.1-444B-A14B-REAP BF16 ~910 GB 154 ~14B 8× 114 GB sglang / vllm Ampere / Hopper
GLM-5.1-555B-A14B-REAP-NVFP4 NVFP4 (4-bit) ~320 GB 192 ~14B 4× 80 GB (B200), 8× 48 GB sglang --quantization modelopt_fp4 Blackwell (native); Hopper (triton path)
GLM-5.1-478B-A42B-REAP-NVFP4 NVFP4 (4-bit) ~285 GB 160 ~42B 4× 80 GB Blackwell sglang --quantization modelopt_fp4 4× RTX PRO 6000 Blackwell @ 200k ctx
GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 GPTQ W4A16 ~297 GB 192 ~14B 4× 80 GB vllm / sglang --quantization gptq_marlin Hopper (best), works on Ampere
GLM-5.1-555B-A14B-REAP-GGUF GGUF (Q2–Q8) ~348 GB 192 ~14B Varies by quant llama.cpp CPU / Apple / consumer CUDA
GLM-5.1-444B-A14B-REAP-GGUF GGUF (Q2–Q8) ~283 GB 154 ~14B Varies by quant llama.cpp CPU / Apple / consumer CUDA

Notes

  • NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
  • NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention + b12x or flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card.
  • NVFP4 on Blackwell Workstation (sm_120): use --attention-backend triton (not flashinfer — PCIe P2P atomics unavailable on the consumer board), --moe-runner-backend cutlass, --fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide.
  • GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
  • REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
  • Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.

Pointer to active inference recipe

See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.

Citation

@misc{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year={2025},
  eprint={2510.13999},
  archivePrefix={arXiv},
}
Downloads last month
3,744
Safetensors
Model size
280B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4

Base model

zai-org/GLM-5.1
Quantized
(40)
this model

Collection including 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4

Paper for 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4