GLM-5.1-478B-A42B-REAP-NVFP4

NVFP4 quantization of zai-org/GLM-5.1, further REAP-pruned from 256 → 160 routed experts per MoE layer. Tuned to run at 200,000-token context on a 4× 96 GB Blackwell workstation.


Total params	478.4B
Activated / token	42.7B
Routed experts / MoE layer	160 (was 256 in base)
Active experts / token	8 routed + 1 shared
Layers	78 (3 dense + 75 MoE) + 1 MTP / NEXTN
Hidden size	6144
Attention	MLA-DSA, 64 heads
Max position	202,752
Quantization	NVFP4, group_size=16 (`modelopt_fp4`)
On-disk size	285 GB (85 shards)
License	MIT (inherited from GLM-5.1)

Measured performance

Single-user, batch size 1, decode tok/s at various prompt lengths on our reference rig (baseline dense-MLA path; see IndexCache below for substantially faster long-context numbers):

Context	tok/s (baseline)	tok/s (+ IndexCache)
256	46.5	46.4
4 k	41.8	41.7
16 k	26.4 – 38.6	40.9
32 k	—	36.5
64 k	—	29.5
128 k	—	21.3
150 k – 165 k	22.4	18.4

Under live mixed traffic (1,495 decode samples, baseline config):

Context range	p50 tok/s
< 1 k	42.7
1 – 8 k	44.3
8 – 32 k	36.3
32 – 100 k	27.7

Per-rank VRAM at 202,752 ctx: weights 77.2 GB, KV pool 11.3 GB (270 k tokens), CUDA graphs 0.3 GB, ~5 GB free.

Quick start (4× 96 GB Blackwell)

# 1. Download the weights
hf download 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4 --local-dir ./GLM-5.1-478B-A42B-REAP-NVFP4

# 2. Install the pinned inference stack (see "Exact versions" below)
python3.12 -m venv venv && source venv/bin/activate
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3

# 3. Apply the required NSA-disable patch (see "Required sglang patch" below)

# 4. Launch
./launch.sh   # see full script below

Reference rig

4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB, compute capability 12.0 (sm_120)
NVIDIA driver 580.126.18, CUDA 12.9 userspace
Ubuntu / Pop!_OS 22.04, Python 3.12

This is what the tuning targets. The same recipe works on 4× B200 (sm_100), 8× Hopper (sm_90) with fewer or more aggressive quantization, and other Blackwell configurations — see the hardware compatibility matrix at the bottom of this page.

Exact versions (pinned from the running venv)

Everything below is reproducible from:

pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3

The resolver pulls in the whole stack at these versions:

sglang                   0.5.10.post1
torch                    2.9.1+cu129
triton                   3.5.1
transformers             5.3.0
tokenizers               0.22.2
safetensors              0.8.0rc0
numpy                    2.4.4

flashinfer-python        0.6.7.post3
flashinfer-cubin         0.6.7.post3
nvidia-cutlass-dsl       4.5.0.dev0
nvidia-cublas-cu12       12.9.1.4
nvidia-cudnn-cu12        9.10.2.21
nvidia-nccl-cu12         2.27.5
nvidia-cuda-nvrtc-cu12   12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-nvjitlink-cu12    12.9.86
nvidia-nvshmem-cu12      3.3.20

Verify:

python -c "import sglang, torch, flashinfer; print(sglang.__version__, torch.__version__, flashinfer.__version__)"
# 0.5.10.post1 2.9.1+cu129 0.6.7.post3

Required sglang patch (SM120 only)

GLM-5.1's config advertises GlmMoeDsaForCausalLM, which sglang routes through DeepSeek Sparse Attention by default. Every NSA backend in sglang 0.5.10.post1 is built for sm_90a / sm_100f only and fails at launch on sm_120. Route GLM-5 through the stable dense-MLA path by excluding it from the NSA architecture list:

Edit <venv>/lib/python3.12/site-packages/sglang/srt/configs/model_config.py, function is_deepseek_nsa():

def is_deepseek_nsa(config) -> bool:
    architectures = (
        config.get("architectures") if isinstance(config, dict)
        else getattr(config, "architectures", None)
    )
    index_topk = (
        config.get("index_topk") if isinstance(config, dict)
        else getattr(config, "index_topk", None)
    )
    # Keep GLM-5 on dense MLA until sm_120 NSA kernels ship.
    return (
        architectures is not None
        and architectures[0] in [
            "DeepseekV3ForCausalLM",
            "DeepseekV32ForCausalLM",
            "DeepseekV3ForCausalLMNextN",
            "MistralLarge3ForCausalLM",
            "PixtralForConditionalGeneration",
        ]
        and index_topk is not None
    )

(Only the architectures list changes — GlmMoeDsaForCausalLM is removed.)

After the patch, sglang auto-picks triton for attention on sm_120. Confirm in the startup log: attention_backend='triton'.

On sm_90 (Hopper) and sm_100 (B200) this patch is not needed — the native NSA kernels work. Skip to the launch section.

Launch

#!/usr/bin/env bash
set -euo pipefail

MODEL=/path/to/GLM-5.1-478B-A42B-REAP-NVFP4
VENV=/path/to/sglang-venv

# Route NCCL over PCIe (no NVLink on workstation Blackwell)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3        # four Blackwell GPUs

# DeepGEMM has no sm_120 kernels; keep it off.
export SGLANG_ENABLE_JIT_DEEPGEMM=0
export SGLANG_ENABLE_DEEP_GEMM=0
export SGLANG_DISABLE_DEEP_GEMM=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=True
export FLASHINFER_DISABLE_VERSION_CHECK=1

# NCCL tuning for PCIe-only (no IB, no NVLink)
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=PIX
export NCCL_SHM_DISABLE=0
export NCCL_BUFFSIZE=4194304
export NCCL_MIN_NCHANNELS=8
export NCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export NCCL_CUMEM_HOST_ENABLE=0
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1

exec "$VENV/bin/python" -m sglang.launch_server \
  --model-path        "$MODEL" \
  --served-model-name GLM-5.1-478B-A42B-REAP-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --context-length    202752 \
  --max-running-requests 1 \
  --mem-fraction-static 0.94 \
  --chunked-prefill-size 4096 \
  --page-size         128 \
  --quantization      modelopt_fp4 \
  --kv-cache-dtype    fp8_e4m3 \
  --triton-attention-num-kv-splits 64 \
  --moe-runner-backend cutlass \
  --fp4-gemm-backend  flashinfer_cudnn \
  --cuda-graph-max-bs 4 \
  --pre-warm-nccl \
  --tool-call-parser  glm47 \
  --reasoning-parser  glm45 \
  --chat-template     "$MODEL/chat_template.jinja" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
  --watchdog-timeout  1800 \
  --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'

The final --json-model-override-args flag enables IndexCache, which reuses the DSA indexer output across layers according to a 78-character F/S pattern. This is now the default shipped recipe — for details, rationale, and the pattern breakdown see the IndexCache section below. If you want to disable it, remove that one flag.

On sm_90 / sm_100 you will want --attention-backend flashinfer and --fp4-gemm-backend b12x instead — see the 555B sibling card for that recipe.

IndexCache (enabled by default)

The launch script above already includes IndexCache via --json-model-override-args. This section documents what the flag does and how it was measured.

GLM-5.1's DeepSeek Sparse Attention (DSA) recomputes the top-k sparse index at every layer, on every prefill chunk and every decode step. Profiling in sglang#21663 measured the indexer alone at ~81% of prefill wall time at 200k context. IndexCache reuses the top-k indices across layers according to a 78-character F/S pattern — one character per transformer layer: F = that layer runs its own indexer, S = that layer reuses the nearest upstream F layer's indices. The pattern used here was greedy-searched upstream and is the one shipped in the SGLang GLM-5.1 cookbook.

With 23 F and 55 S layers, only ~30% of layers run the indexer — the other 70% of indexer cost is skipped. No extra VRAM (indices are transient, live on the MLA KV pool only for the duration of one step).

The flag in the launch script above:

--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'

To fall back to the pre-IndexCache baseline (e.g., for A/B testing), remove this single flag.

Measured on the reference rig (max_tokens=100, bs=1, streaming, 1 warmup + 2 measured reps):

Prompt tokens	TTFT (warm)	Cold-prefill tok/s	Decode tok/s
16,025	0.23 s	—	40.9
32,025	0.39 s	—	36.5
64,025	0.71 s	—	29.5
128,025	1.34 s	4,629	21.3
165,025	1.76 s	—	18.4

Per-layer F/S pattern for reference (layer 0 leftmost):

layer   0000000001111111111222222222233333333334444444444555555555566666666667777777777
idx     0123456789012345678901234567890123456789012345678901234567890123456789012345678
mask    FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS

Coherence spot-checked with needle-in-haystack at 32k and 100k, 11-turn tool-use chat, and GSM-style arithmetic — no quality regressions observed relative to baseline.

Mutual exclusion: IndexCache and MTP are not both enabled in this recipe. MTP requires --page-size 64 and caps context at 65,536 (bf16 KV); IndexCache keeps --page-size 128, fp8_e4m3 KV, and the full 202,752-token window. For the workstation 200k-ctx target, IndexCache is the right pick.

Key flag decisions (why these specific values)

These were measured on the reference rig; defaults were not.

--triton-attention-num-kv-splits 64 — biggest single win. Default is 8. At bs=1 decode on sm_120, raising kv-splits gave:

Context	splits=8	splits=64
4 k	39.7	41.8
16 k	26.4	38.6
150 k	5.2	22.4

Coherence verified across arithmetic, factual recall, needle-in-haystack @ 32 k and @ 100 k, and 11-turn chat.

--mem-fraction-static 0.94 — decode is kernel-bound at bs=1, not memory-bound. 0.94 vs 0.97 gives identical tok/s and ~5 GB/rank of headroom for graph recapture and prefill scratch.

--kv-cache-dtype fp8_e4m3 — halves KV memory vs bf16. Required to fit 202 k context in budget.

--attention-backend is intentionally omitted — sglang auto-selects triton on sm_120 for this architecture after the NSA patch. Flashinfer attention is skipped because it requires PCIe P2P atomics not available on the workstation board.

--page-size 128 — the non-MTP default. Drop to 64 only if enabling speculative decode.

MTP / NEXTN speculative decode (optional)

The checkpoint includes an MTP head for layer 78, stitched from the original 256-expert source using the layer-77 REAP keep-map as a proxy.

	Without MTP (this page's default)	With MTP
Decode tok/s (short)	~46	~90 (1.93×)
Max context	202,752	~65,536
KV dtype	fp8_e4m3	bf16 (required by NEXTN)
Page size	128	64 (required by NEXTN)

MTP is opt-in because the workstation target is long context, not peak short-prompt throughput. Enable with:

# Replace three lines in the launch script:
--context-length 65536 \
--page-size 64 \
--kv-cache-dtype auto \
# and add:
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-attention-mode decode \
--speculative-moe-runner-backend cutlass

Also drop --mem-fraction-static to 0.88 — the draft worker adds ~5 GB/rank.

Sampling recommendations

General chat / reasoning:

temperature=0.5  top_p=0.95  frequency_penalty=0.3  repetition_penalty=1.05

Strict-answer (MCQ, tool-use benchmarks):

temperature=0.0  repetition_penalty=1.05

Keep repetition_penalty=1.05 everywhere. Pure greedy with no penalty can loop on pathological low-entropy prompts (e.g., repeated filler tokens).

Lineage & license

zai-org/GLM-5.1  (official, 744B bf16, 256 experts, MIT)
    │
    ├── community NVFP4 quantization via NVIDIA Model Optimizer
    │     (e.g. lukealonso/GLM-5.1-NVFP4, ~434 GB, 256 experts)
    │
    ├── Local REAP pass 1: 256 → 192 experts
    │     0xSero/GLM-5.1-555B-A14B-REAP-NVFP4
    │
    └── Local REAP pass 2: 192 → 160 experts
          0xSero/GLM-5.1-478B-A42B-REAP-NVFP4   ← this model

Both REAP passes were done locally using pooled token-weighted observations from:

0xSero/glm51-layerwise-reap-observations — per-block metrics, full layer coverage.
0xSero/glm-5-special — consolidated observer state, ~85 M tokens over ~7.6 k samples.

Prune scripts and MTP-stitch script are in the repo tree.

License: MIT, inherited from zai-org/GLM-5.1.

Citation (REAP method):

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv},
}

GLM-5.1 REAP Family — Hardware Compatibility

All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.

Quick picker

You have	Use
8× H100/H200 80GB (Hopper, sm_90)	GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via `modelopt_fp4` + triton path)
4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120)	GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config
4× B200 180GB (sm_100)	GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4
8× B200 / Blackwell datacenter	GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends)
8× A100 80GB (Ampere, sm_80)	GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16
CPU / Apple Silicon / consumer GPU with llama.cpp	GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF

Full family

Variant	Format	Size	Experts/layer	Activated/token	Min VRAM (TP)	Inference engine	Best on
GLM-5.1-555B-A14B-REAP	BF16	~1125 GB	192	~14B	8× 141 GB (H200)	sglang / vllm	Hopper
GLM-5.1-444B-A14B-REAP	BF16	~910 GB	154	~14B	8× 114 GB	sglang / vllm	Ampere / Hopper
GLM-5.1-555B-A14B-REAP-NVFP4	NVFP4 (4-bit)	~320 GB	192	~14B	4× 80 GB (B200), 8× 48 GB	sglang `--quantization modelopt_fp4`	Blackwell (native); Hopper (triton path)
GLM-5.1-478B-A42B-REAP-NVFP4	NVFP4 (4-bit)	~285 GB	160	~42B	4× 80 GB Blackwell	sglang `--quantization modelopt_fp4`	4× RTX PRO 6000 Blackwell @ 200k ctx
GLM-5.1-555B-A14B-REAP-GPTQ-W4A16	GPTQ W4A16	~297 GB	192	~14B	4× 80 GB	vllm / sglang `--quantization gptq_marlin`	Hopper (best), works on Ampere
GLM-5.1-555B-A14B-REAP-GGUF	GGUF (Q2–Q8)	~348 GB	192	~14B	Varies by quant	llama.cpp	CPU / Apple / consumer CUDA
GLM-5.1-444B-A14B-REAP-GGUF	GGUF (Q2–Q8)	~283 GB	154	~14B	Varies by quant	llama.cpp	CPU / Apple / consumer CUDA

Notes

NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention + b12x or flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card.
NVFP4 on Blackwell Workstation (sm_120): use --attention-backend triton (not flashinfer — PCIe P2P atomics unavailable on the consumer board), --moe-runner-backend cutlass, --fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide.
GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.

Pointer to active inference recipe

See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.

Citation

@misc{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year={2025},
  eprint={2510.13999},
  archivePrefix={arXiv},
}