Instructions to use sokann/GLM-5.2-GGUF-2.244bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sokann/GLM-5.2-GGUF-2.244bpw with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sokann/GLM-5.2-GGUF-2.244bpw", filename="GLM-5.2-GGUF-2.244bpw.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use sokann/GLM-5.2-GGUF-2.244bpw with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw # Run inference directly in the terminal: llama cli -hf sokann/GLM-5.2-GGUF-2.244bpw
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw # Run inference directly in the terminal: llama cli -hf sokann/GLM-5.2-GGUF-2.244bpw
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sokann/GLM-5.2-GGUF-2.244bpw # Run inference directly in the terminal: ./llama-cli -hf sokann/GLM-5.2-GGUF-2.244bpw
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sokann/GLM-5.2-GGUF-2.244bpw # Run inference directly in the terminal: ./build/bin/llama-cli -hf sokann/GLM-5.2-GGUF-2.244bpw
Use Docker
docker model run hf.co/sokann/GLM-5.2-GGUF-2.244bpw
- LM Studio
- Jan
- Ollama
How to use sokann/GLM-5.2-GGUF-2.244bpw with Ollama:
ollama run hf.co/sokann/GLM-5.2-GGUF-2.244bpw
- Unsloth Studio
How to use sokann/GLM-5.2-GGUF-2.244bpw with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sokann/GLM-5.2-GGUF-2.244bpw to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sokann/GLM-5.2-GGUF-2.244bpw to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sokann/GLM-5.2-GGUF-2.244bpw to start chatting
- Pi
How to use sokann/GLM-5.2-GGUF-2.244bpw with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sokann/GLM-5.2-GGUF-2.244bpw" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sokann/GLM-5.2-GGUF-2.244bpw with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/GLM-5.2-GGUF-2.244bpw
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sokann/GLM-5.2-GGUF-2.244bpw
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use sokann/GLM-5.2-GGUF-2.244bpw with Docker Model Runner:
docker model run hf.co/sokann/GLM-5.2-GGUF-2.244bpw
- Lemonade
How to use sokann/GLM-5.2-GGUF-2.244bpw with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sokann/GLM-5.2-GGUF-2.244bpw
Run and chat with the model
lemonade run user.GLM-5.2-GGUF-2.244bpw-{{QUANT_TAG}}List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)GLM-5.2-GGUF-2.244bpw
This is a 2.2 BPW quantized model for the GPU riches with more combined RAM + VRAM than common sense.
The quant aims to achieve best-in-class performance, by relying on SOTA quants from ik_llama.cpp:
- Routed experts tensors use the IQ2_KT quant
- All other tensors use the Q6_0 quant
Coupled with the recent enhancements of MTP support (#1890) and graph parallel support (#1821), it should run at decent speed as well.
Note: For now, we need to apply a small diff, ported from mainline #24770. (UPDATE: no need to apply this diff anymore, after the merging of https://github.com/ikawrakow/ik_llama.cpp/pull/2017)
Chat Template
The official chat template for GLM-5.2 differs a bit from GLM-5.1, and it doesn't seem to work correctly in ik_llama.cpp (and likely llama.cpp as well).
For symptoms, look out for this log pattern:
Cmmon part does not match fully
Cache :
</tool_response><|assistant|><think></think><tool_call>bash<arg_key>command</arg_key><arg_value>grep -rni "log" /home/sayap/repo/pi-mono/packages/c
prompt:
</tool_response><|assistant|><think>Let me search more specifically for logging request/response in the docs.</think><tool_call>bash<arg_key>command</arg_key><arg_value>grep -rni "log
where the reasoning content "Let me search more specifically for logging request/response in the docs." is actually for the previous tool call, while the current tool call, grep -rni "log" ..., was made without reasoning.
After wrongly duplicating the reasoning content from the previous tool call to the current tool call, the quality of the output will then be degraded going forward.
To fix this, I got this diff from GLM-5.2 using the z.ai coding plan:
--- chat_template.jinja.orig
+++ chat_template.jinja
@@ -62,13 +62,14 @@
{%- elif m.role == 'assistant' -%}
<|assistant|>
{%- set content = visible_text(m.content) %}
+{%- set reasoning_content = "" %}
{%- if m.reasoning_content is string %}
{%- set reasoning_content = m.reasoning_content %}
{%- elif '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].split('<think>')[-1] %}
{%- set content = content.split('</think>')[-1] %}
{%- endif %}
-{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content is defined -%}
+{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
{{ '<think>' + reasoning_content + '</think>'}}
{%- else -%}
{{ '<think></think>' }}
chat_template.jinja with the diff applied. (UPDATE: no need to modify the chat template anymore, after the merging of https://github.com/ikawrakow/ik_llama.cpp/pull/2018)
Size
Size from llama-server output:
llm_load_print_meta: model size = 196.756 GiB (2.244 BPW)
llm_load_print_meta: repeating layers = 195.315 GiB (2.233 BPW, 751.427 B parameters)
Buffer size with -cmoe --no-mmap:
llm_load_tensors: offloaded 80/80 layers to GPU
llm_load_tensors: CPU buffer size = 187545.34 MiB
llm_load_tensors: CUDA_Split buffer size = 13916.52 MiB
llm_load_tensors: CUDA1 buffer size = 961.97 MiB
Quality
Recipe
# Attention
blk\..*\.attn_k_b\.weight=q6_0
blk\..*\.attn_v_b\.weight=q6_0
blk\..*\.attn_kv_a_mqa\.weight=q6_0
blk\..*\.attn_q_a\.weight=q6_0
blk\..*\.attn_q_b\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0
# First 3 Dense Layers
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=q6_0
# Shared Expert Layers
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q6_0
# Routed Experts Layers
blk\..*\.ffn_(up|gate|down)_exps\.weight=iq2_kt
# Indexer
blk\..*\.indexer\.proj\.weight=q6_0
blk\..*\.indexer\.attn_k\.weight=q6_0
blk\..*\.indexer\.attn_q_b\.weight=q6_0
# NextN MTP Layer
blk\..*\.nextn\.embed_tokens\.weight=q6_0
blk\..*\.nextn\.shared_head_head\.weight=q6_0
blk\..*\.nextn\.eh_proj\.weight=q6_0
# Non-Repeating Layers
token_embd\.weight=q6_0
output\.weight=q6_0
PPL result with wiki.test.raw:
Final estimate: PPL over 565 chunks for n_ctx=512 = 3.8402 +/- 0.02168
This quant uses the imatrix from unsloth (thanks!), and seems to perform well enough in actual tasks.
Flags
To have usable context size, we have to sacrifice PP a bit by going with the much slower -mla 1, which doesn't use as much VRAM compared to the usual -mla 3.
These flags allow a 102400 context size on my machine with 184 GiB of RAM and 2 x 24 GiB of VRAM:
--no-mmap -ngl 99 -cmoe -sm graph \
-ot blk\.([345])\.ffn_.*_exps\.weight=CUDA0 \
-ot blk\.6\.ffn_(up|gate)_exps\.weight=CUDA0 \
-ot blk\.(40|41|42)\.ffn_.*_exps\.weight=CUDA1 \
-mla 1 -amb 512 \
-c 102400 -ctk q6_0 -khad \
-b 2048 -ub 2048 -wgt 1 \
-cram 0 -muge -cuda graphs=1 \
--jinja --parallel-tool-calls \
--chat-template-kwargs {"reasoning_effort": "high"} \
--spec-type mtp:n_max=4,p_min=0.5
- 11 routed experts tensors on CUDA0, 9 routed experts tensors on CUDA1, the rest on CPU.
- MTP needs some VRAM on the last GPU, i.e. CUDA1, so we put less routed experts tensors there.
-mla 1to squeeze 102400 context in Q6,-khadto reduce quantization error.amb 512and-wgt 1to reduce CUDA compute buffer a little.-ub 2048to allow GPU offload for prompt processing.--spec-type mtp:n_max=4,p_min=0.5to enable MTP (draft up to 4 tokens, with at least 50% token probability).
Speed comparison for tasks that can really benefit from MTP:
- without MTP:
prompt eval time = 22486.83 ms / 5188 tokens ( 4.33 ms per token, 230.71 tokens per second)
eval time = 54116.43 ms / 639 tokens ( 84.69 ms per token, 11.81 tokens per second)
total time = 76603.26 ms / 5827 tokens
- with MTP:
prompt eval time = 23404.00 ms / 5188 tokens ( 4.51 ms per token, 221.67 tokens per second)
eval time = 35463.34 ms / 639 tokens ( 55.50 ms per token, 18.02 tokens per second)
total time = 58867.34 ms / 5827 tokens
draft acceptance rate = 0.98259 ( 508 accepted / 517 generated)
statistics mtp: #calls(b,g,a) = 1 130 130, #gen drafts = 130, #acc drafts = 129, #gen tokens = 517, #acc tokens = 508, dur(b,g,a) = 0.001, 1239.635, 0.063 ms
- Downloads last month
- 6,922
We're not able to determine the quantization variants.
Model tree for sokann/GLM-5.2-GGUF-2.244bpw
Base model
zai-org/GLM-5.2
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sokann/GLM-5.2-GGUF-2.244bpw", filename="GLM-5.2-GGUF-2.244bpw.gguf", )