--- base_model: - zai-org/GLM-5 pipeline_tag: text-generation --- # GLM-5-CPU-NUMA4-AMXINT8 [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) quantized to the AMXINT8 format for inference with sglang + ktransformers, packed specifically for inference on **4** NUMA nodes. To run, please ensure that your CPU supports the AMX instruction set (Intel Xeon processor, Sapphire Rapids or newer), and make note of your NUMA node count. Install `kt-kernal` and `sglang-kt` following the [official documentation](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md). Then, download the official weights of zai-org/GLM-5 (in either [BF16](https://huggingface.co/zai-org/GLM-5) or [FP8](https://huggingface.co/zai-org/GLM-5-FP8)), as well as this CPU-optimized quantized model, and prepare your launch command: ``` PYTORCH_ALLOC_CONF=expandable_segments:True \ SGLANG_ENABLE_JIT_DEEPGEMM=0 \ python -m sglang.launch_server \ --model /path/to/GLM-5-FP8 \ --kt-method AMXINT8 \ --kt-weight-path /path/to/GLM-5-CPU-NUMA4-AMXINT8 \ --kt-cpuinfer 128 \ --kt-threadpool-count 4 \ --kt-num-gpu-experts 16 \ --kt-max-deferred-experts-per-token 0 \ --kt-expert-placement-strategy uniform \ --trust-remote-code \ --mem-fraction-static 0.98 \ --served-model-name zai-org/GLM-5 \ --enable-mixed-chunk \ --tensor-parallel-size 1 \ --enable-p2p-check \ --disable-shared-experts-fusion \ --chunked-prefill-size 4096 \ --context-length 131072 \ --max-total-tokens 131072 \ --max-running-requests 1 \ --attention-backend flashinfer \ --fp8-gemm-backend cutlass \ --kv-cache-dtype bf16 \ --reasoning-parser glm45 \ --tool-call-parser glm47 ``` ## Notes: - `GlmMoeDsaForCausalLM` requires at least `transformers` 5.2.0, which is not the default version pinned by `sglang-kt` at the time of writing - Note that DSA (DeepSeek Sparse Attention) is not currently supported on non-enterprise GPU architectures, so attention will fall back to standard MLA with the specified `--attention-backend` - `--kt-cpuinfer` should be set to the total number of physical CPU cores across all NUMA nodes - `--tensor-parallel-size 1` should be set to the number of GPUs - The optimal choices for `--attention-backend` and `--fp8-gemm-backend` depend on the CUDA architecture of your GPUs - please check the sglang documentation - `--kt-num-gpu-experts`, `--mem-fraction-static`, `--chunked-prefill-size`, `--context-length`, `--max-total-tokens`, and `--max-running-requests` should be adjusted depending on constraints of your hardware - Please review the official `kt-kernel` documentation for details