How to use from
Docker Model Runner
docker model run hf.co/sokann/GLM-5.2-GGUF-2.244bpw
Quick Links

GLM-5.2-GGUF-2.244bpw

This is a 2.2 BPW quantized model for the GPU riches with more combined RAM + VRAM than common sense.

The quant aims to achieve best-in-class performance, by relying on SOTA quants from ik_llama.cpp:

  • Routed experts tensors use the IQ2_KT quant
  • All other tensors use the Q6_0 quant

Coupled with the recent enhancements of MTP support (#1890) and graph parallel support (#1821), it should run at decent speed as well.

Note: For now, we need to apply a small diff, ported from mainline #24770. (UPDATE: no need to apply this diff anymore, after the merging of https://github.com/ikawrakow/ik_llama.cpp/pull/2017)

Chat Template

The official chat template for GLM-5.2 differs a bit from GLM-5.1, and it doesn't seem to work correctly in ik_llama.cpp (and likely llama.cpp as well).

For symptoms, look out for this log pattern:

Cmmon part does not match fully
Cache :
</tool_response><|assistant|><think></think><tool_call>bash<arg_key>command</arg_key><arg_value>grep -rni "log" /home/sayap/repo/pi-mono/packages/c
prompt:
</tool_response><|assistant|><think>Let me search more specifically for logging request/response in the docs.</think><tool_call>bash<arg_key>command</arg_key><arg_value>grep -rni "log

where the reasoning content "Let me search more specifically for logging request/response in the docs." is actually for the previous tool call, while the current tool call, grep -rni "log" ..., was made without reasoning.

After wrongly duplicating the reasoning content from the previous tool call to the current tool call, the quality of the output will then be degraded going forward.

To fix this, I got this diff from GLM-5.2 using the z.ai coding plan:

--- chat_template.jinja.orig
+++ chat_template.jinja
@@ -62,13 +62,14 @@
 {%- elif m.role == 'assistant' -%}
 <|assistant|>
 {%- set content = visible_text(m.content) %}
+{%- set reasoning_content = "" %}
 {%- if m.reasoning_content is string %}
     {%- set reasoning_content = m.reasoning_content %}
 {%- elif '</think>' in content %}
     {%- set reasoning_content = content.split('</think>')[0].split('<think>')[-1] %}
     {%- set content = content.split('</think>')[-1] %}
 {%- endif %}
-{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content is defined -%}
+{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
 {{ '<think>' + reasoning_content +  '</think>'}}
 {%- else -%}
 {{ '<think></think>' }}

chat_template.jinja with the diff applied. (UPDATE: no need to modify the chat template anymore, after the merging of https://github.com/ikawrakow/ik_llama.cpp/pull/2018)

Size

Size from llama-server output:

llm_load_print_meta: model size       = 196.756 GiB (2.244 BPW)
llm_load_print_meta: repeating layers = 195.315 GiB (2.233 BPW, 751.427 B parameters)

Buffer size with -cmoe --no-mmap:

llm_load_tensors: offloaded 80/80 layers to GPU
llm_load_tensors:        CPU buffer size = 187545.34 MiB
llm_load_tensors: CUDA_Split buffer size = 13916.52 MiB
llm_load_tensors:      CUDA1 buffer size =   961.97 MiB

Quality

Recipe
# Attention
blk\..*\.attn_k_b\.weight=q6_0
blk\..*\.attn_v_b\.weight=q6_0

blk\..*\.attn_kv_a_mqa\.weight=q6_0
blk\..*\.attn_q_a\.weight=q6_0
blk\..*\.attn_q_b\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0

# First 3 Dense Layers
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=q6_0

# Shared Expert Layers
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q6_0

# Routed Experts Layers
blk\..*\.ffn_(up|gate|down)_exps\.weight=iq2_kt

# Indexer
blk\..*\.indexer\.proj\.weight=q6_0
blk\..*\.indexer\.attn_k\.weight=q6_0
blk\..*\.indexer\.attn_q_b\.weight=q6_0

# NextN MTP Layer
blk\..*\.nextn\.embed_tokens\.weight=q6_0
blk\..*\.nextn\.shared_head_head\.weight=q6_0
blk\..*\.nextn\.eh_proj\.weight=q6_0

# Non-Repeating Layers
token_embd\.weight=q6_0
output\.weight=q6_0

PPL result with wiki.test.raw:

Final estimate: PPL over 565 chunks for n_ctx=512 = 3.8402 +/- 0.02168

This quant uses the imatrix from unsloth (thanks!), and seems to perform well enough in actual tasks.

Flags

To have usable context size, we have to sacrifice PP a bit by going with the much slower -mla 1, which doesn't use as much VRAM compared to the usual -mla 3.

These flags allow a 102400 context size on my machine with 184 GiB of RAM and 2 x 24 GiB of VRAM:

--no-mmap -ngl 99 -cmoe -sm graph \
-ot blk\.([345])\.ffn_.*_exps\.weight=CUDA0 \
-ot blk\.6\.ffn_(up|gate)_exps\.weight=CUDA0 \
-ot blk\.(40|41|42)\.ffn_.*_exps\.weight=CUDA1 \
-mla 1 -amb 512 \
-c 102400 -ctk q6_0 -khad \
-b 2048 -ub 2048 -wgt 1 \
-cram 0 -muge -cuda graphs=1 \
--jinja --parallel-tool-calls \
--chat-template-kwargs {"reasoning_effort": "high"} \
--spec-type mtp:n_max=4,p_min=0.5
  • 11 routed experts tensors on CUDA0, 9 routed experts tensors on CUDA1, the rest on CPU.
    • MTP needs some VRAM on the last GPU, i.e. CUDA1, so we put less routed experts tensors there.
  • -mla 1 to squeeze 102400 context in Q6, -khad to reduce quantization error.
  • amb 512 and -wgt 1 to reduce CUDA compute buffer a little.
  • -ub 2048 to allow GPU offload for prompt processing.
  • --spec-type mtp:n_max=4,p_min=0.5 to enable MTP (draft up to 4 tokens, with at least 50% token probability).

Speed comparison for tasks that can really benefit from MTP:

  • without MTP:
prompt eval time =   22486.83 ms /  5188 tokens (    4.33 ms per token,   230.71 tokens per second)
       eval time =   54116.43 ms /   639 tokens (   84.69 ms per token,    11.81 tokens per second)
      total time =   76603.26 ms /  5827 tokens
  • with MTP:
prompt eval time =   23404.00 ms /  5188 tokens (    4.51 ms per token,   221.67 tokens per second)
       eval time =   35463.34 ms /   639 tokens (   55.50 ms per token,    18.02 tokens per second)
      total time =   58867.34 ms /  5827 tokens
draft acceptance rate = 0.98259 (  508 accepted /   517 generated)
statistics mtp: #calls(b,g,a) = 1 130 130, #gen drafts = 130, #acc drafts = 129, #gen tokens = 517, #acc tokens = 508, dur(b,g,a) = 0.001, 1239.635, 0.063 ms
Downloads last month
6,922
GGUF
Model size
753B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sokann/GLM-5.2-GGUF-2.244bpw

Base model

zai-org/GLM-5.2
Quantized
(73)
this model