---
base_model:
- zai-org/GLM-5
pipeline_tag: text-generation
---

# GLM-5-CPU-NUMA4-AMXINT8

[zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) quantized to the AMXINT8 format for inference with sglang + ktransformers, packed specifically for inference on **4** NUMA nodes.

To run, please ensure that your CPU supports the AMX instruction set (Intel Xeon processor, Sapphire Rapids or newer), and make note of your NUMA node count. Install `kt-kernal` and `sglang-kt` following the [official documentation](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).

Then, download the official weights of zai-org/GLM-5 (in either [BF16](https://huggingface.co/zai-org/GLM-5) or [FP8](https://huggingface.co/zai-org/GLM-5-FP8)), as well as this CPU-optimized quantized model, and prepare your launch command:

```
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
python -m sglang.launch_server \
  --model /path/to/GLM-5-FP8 \
  --kt-method AMXINT8 \
  --kt-weight-path /path/to/GLM-5-CPU-NUMA4-AMXINT8 \
  --kt-cpuinfer 128 \
  --kt-threadpool-count 4 \
  --kt-num-gpu-experts 16 \
  --kt-max-deferred-experts-per-token 0 \
  --kt-expert-placement-strategy uniform \
  --trust-remote-code \
  --mem-fraction-static 0.98 \
  --served-model-name zai-org/GLM-5 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --enable-p2p-check \
  --disable-shared-experts-fusion \
  --chunked-prefill-size 4096 \
  --context-length 131072 \
  --max-total-tokens 131072 \
  --max-running-requests 1 \
  --attention-backend flashinfer \
  --fp8-gemm-backend cutlass \
  --kv-cache-dtype bf16 \
  --reasoning-parser glm45 \
  --tool-call-parser glm47
```

## Notes:
- `GlmMoeDsaForCausalLM` requires at least `transformers` 5.2.0, which is not the default version pinned by `sglang-kt` at the time of writing
- Note that DSA (DeepSeek Sparse Attention) is not currently supported on non-enterprise GPU architectures, so attention will fall back to standard MLA with the specified `--attention-backend`
- `--kt-cpuinfer` should be set to the total number of physical CPU cores across all NUMA nodes
- `--tensor-parallel-size 1` should be set to the number of GPUs
- The optimal choices for `--attention-backend` and `--fp8-gemm-backend` depend on the CUDA architecture of your GPUs - please check the sglang documentation
- `--kt-num-gpu-experts`, `--mem-fraction-static`, `--chunked-prefill-size`, `--context-length`, `--max-total-tokens`, and `--max-running-requests` should be adjusted depending on constraints of your hardware
- Please review the official `kt-kernel` documentation for details