Instructions to use sokann/GLM-5.2-GGUF-2.244bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sokann/GLM-5.2-GGUF-2.244bpw with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sokann/GLM-5.2-GGUF-2.244bpw",
	filename="GLM-5.2-GGUF-2.244bpw.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sokann/GLM-5.2-GGUF-2.244bpw with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw
# Run inference directly in the terminal:
llama cli -hf sokann/GLM-5.2-GGUF-2.244bpw

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw
# Run inference directly in the terminal:
llama cli -hf sokann/GLM-5.2-GGUF-2.244bpw

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sokann/GLM-5.2-GGUF-2.244bpw
# Run inference directly in the terminal:
./llama-cli -hf sokann/GLM-5.2-GGUF-2.244bpw

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sokann/GLM-5.2-GGUF-2.244bpw
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sokann/GLM-5.2-GGUF-2.244bpw

Use Docker

docker model run hf.co/sokann/GLM-5.2-GGUF-2.244bpw

LM Studio
Jan
Ollama
How to use sokann/GLM-5.2-GGUF-2.244bpw with Ollama:
```
ollama run hf.co/sokann/GLM-5.2-GGUF-2.244bpw
```

Unsloth Studio

How to use sokann/GLM-5.2-GGUF-2.244bpw with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sokann/GLM-5.2-GGUF-2.244bpw to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sokann/GLM-5.2-GGUF-2.244bpw to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sokann/GLM-5.2-GGUF-2.244bpw to start chatting

How to use sokann/GLM-5.2-GGUF-2.244bpw with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sokann/GLM-5.2-GGUF-2.244bpw"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sokann/GLM-5.2-GGUF-2.244bpw with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sokann/GLM-5.2-GGUF-2.244bpw

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use sokann/GLM-5.2-GGUF-2.244bpw with Docker Model Runner:
```
docker model run hf.co/sokann/GLM-5.2-GGUF-2.244bpw
```

Lemonade

How to use sokann/GLM-5.2-GGUF-2.244bpw with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sokann/GLM-5.2-GGUF-2.244bpw

Run and chat with the model

lemonade run user.GLM-5.2-GGUF-2.244bpw-{{QUANT_TAG}}

List all available models

lemonade list

GLM-5.2-GGUF-2.244bpw

This is a 2.2 BPW quantized model for the GPU riches with more combined RAM + VRAM than common sense.

The quant aims to achieve best-in-class performance, by relying on SOTA quants from ik_llama.cpp:

Routed experts tensors use the IQ2_KT quant
All other tensors use the Q6_0 quant

Coupled with the recent enhancements of MTP support (#1890) and graph parallel support (#1821), it should run at decent speed as well.

Note: For now, we need to apply a small diff, ported from mainline #24770. (UPDATE: no need to apply this diff anymore, after the merging of https://github.com/ikawrakow/ik_llama.cpp/pull/2017)

Chat Template

The official chat template for GLM-5.2 differs a bit from GLM-5.1, and it doesn't seem to work correctly in ik_llama.cpp (and likely llama.cpp as well).

For symptoms, look out for this log pattern:

Cmmon part does not match fully
Cache :
</tool_response><|assistant|><think></think><tool_call>bash<arg_key>command</arg_key><arg_value>grep -rni "log" /home/sayap/repo/pi-mono/packages/c
prompt:
</tool_response><|assistant|><think>Let me search more specifically for logging request/response in the docs.</think><tool_call>bash<arg_key>command</arg_key><arg_value>grep -rni "log

where the reasoning content "Let me search more specifically for logging request/response in the docs." is actually for the previous tool call, while the current tool call, grep -rni "log" ..., was made without reasoning.

After wrongly duplicating the reasoning content from the previous tool call to the current tool call, the quality of the output will then be degraded going forward.

To fix this, I got this diff from GLM-5.2 using the z.ai coding plan:

--- chat_template.jinja.orig
+++ chat_template.jinja
@@ -62,13 +62,14 @@
 {%- elif m.role == 'assistant' -%}
 <|assistant|>
 {%- set content = visible_text(m.content) %}
+{%- set reasoning_content = "" %}
 {%- if m.reasoning_content is string %}
     {%- set reasoning_content = m.reasoning_content %}
 {%- elif '</think>' in content %}
     {%- set reasoning_content = content.split('</think>')[0].split('<think>')[-1] %}
     {%- set content = content.split('</think>')[-1] %}
 {%- endif %}
-{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content is defined -%}
+{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
 {{ '<think>' + reasoning_content +  '</think>'}}
 {%- else -%}
 {{ '<think></think>' }}

chat_template.jinja with the diff applied. (UPDATE: no need to modify the chat template anymore, after the merging of https://github.com/ikawrakow/ik_llama.cpp/pull/2018)

Size

Size from llama-server output:

llm_load_print_meta: model size       = 196.756 GiB (2.244 BPW)
llm_load_print_meta: repeating layers = 195.315 GiB (2.233 BPW, 751.427 B parameters)

Buffer size with -cmoe --no-mmap:

llm_load_tensors: offloaded 80/80 layers to GPU
llm_load_tensors:        CPU buffer size = 187545.34 MiB
llm_load_tensors: CUDA_Split buffer size = 13916.52 MiB
llm_load_tensors:      CUDA1 buffer size =   961.97 MiB

Quality

Recipe

# Attention
blk\..*\.attn_k_b\.weight=q6_0
blk\..*\.attn_v_b\.weight=q6_0

blk\..*\.attn_kv_a_mqa\.weight=q6_0
blk\..*\.attn_q_a\.weight=q6_0
blk\..*\.attn_q_b\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0

# First 3 Dense Layers
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=q6_0

# Shared Expert Layers
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q6_0

# Routed Experts Layers
blk\..*\.ffn_(up|gate|down)_exps\.weight=iq2_kt

# Indexer
blk\..*\.indexer\.proj\.weight=q6_0
blk\..*\.indexer\.attn_k\.weight=q6_0
blk\..*\.indexer\.attn_q_b\.weight=q6_0

# NextN MTP Layer
blk\..*\.nextn\.embed_tokens\.weight=q6_0
blk\..*\.nextn\.shared_head_head\.weight=q6_0
blk\..*\.nextn\.eh_proj\.weight=q6_0

# Non-Repeating Layers
token_embd\.weight=q6_0
output\.weight=q6_0

PPL result with wiki.test.raw:

Final estimate: PPL over 565 chunks for n_ctx=512 = 3.8402 +/- 0.02168

This quant uses the imatrix from unsloth (thanks!), and seems to perform well enough in actual tasks.

Flags

To have usable context size, we have to sacrifice PP a bit by going with the much slower -mla 1, which doesn't use as much VRAM compared to the usual -mla 3.

These flags allow a 102400 context size on my machine with 184 GiB of RAM and 2 x 24 GiB of VRAM:

--no-mmap -ngl 99 -cmoe -sm graph \
-ot blk\.([345])\.ffn_.*_exps\.weight=CUDA0 \
-ot blk\.6\.ffn_(up|gate)_exps\.weight=CUDA0 \
-ot blk\.(40|41|42)\.ffn_.*_exps\.weight=CUDA1 \
-mla 1 -amb 512 \
-c 102400 -ctk q6_0 -khad \
-b 2048 -ub 2048 -wgt 1 \
-cram 0 -muge -cuda graphs=1 \
--jinja --parallel-tool-calls \
--chat-template-kwargs {"reasoning_effort": "high"} \
--spec-type mtp:n_max=4,p_min=0.5

11 routed experts tensors on CUDA0, 9 routed experts tensors on CUDA1, the rest on CPU.
- MTP needs some VRAM on the last GPU, i.e. CUDA1, so we put less routed experts tensors there.
-mla 1 to squeeze 102400 context in Q6, -khad to reduce quantization error.
amb 512 and -wgt 1 to reduce CUDA compute buffer a little.
-ub 2048 to allow GPU offload for prompt processing.
--spec-type mtp:n_max=4,p_min=0.5 to enable MTP (draft up to 4 tokens, with at least 50% token probability).

Speed comparison for tasks that can really benefit from MTP:

without MTP:

prompt eval time =   22486.83 ms /  5188 tokens (    4.33 ms per token,   230.71 tokens per second)
       eval time =   54116.43 ms /   639 tokens (   84.69 ms per token,    11.81 tokens per second)
      total time =   76603.26 ms /  5827 tokens

with MTP:

prompt eval time =   23404.00 ms /  5188 tokens (    4.51 ms per token,   221.67 tokens per second)
       eval time =   35463.34 ms /   639 tokens (   55.50 ms per token,    18.02 tokens per second)
      total time =   58867.34 ms /  5827 tokens
draft acceptance rate = 0.98259 (  508 accepted /   517 generated)
statistics mtp: #calls(b,g,a) = 1 130 130, #gen drafts = 130, #acc drafts = 129, #gen tokens = 517, #acc tokens = 508, dur(b,g,a) = 0.001, 1239.635, 0.063 ms

Downloads last month: 6,922

GGUF

Model size

753B params

Architecture

glm-dsa

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/GLM-5.2-GGUF-2.244bpw

Base model

zai-org/GLM-5.2

Quantized

(73)

this model