---
base_model: zai-org/GLM-4.7-Flash
base_model_relation: quantized
language:
- en
- zh
library_name: gguf
license: mit
pipeline_tag: text-generation
tags:
- text-generation-inference
- glm
- moe
- flash
- gguf
- glm4_moe_lite
---
# GLM-4.7-Flash-GGUF
## Description
This repository contains **GGUF** format model files for [Zhipu AI's GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
**GLM-4.7-Flash** is a highly efficient **30B-A3B Mixture-of-Experts (MoE)** model. It is designed to be the strongest model in the 30B parameter class, offering a powerful option for lightweight deployment that perfectly balances performance and efficiency.
## Evaluation Results
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B-Thinking-2507 | GPT-OSS-20B |
|--------------------|---------------|-----------------------------|-------------|
| AIME 25 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| LCB v6 | 64.0 | 66.0 | 61.0 |
| HLE | 14.4 | 9.8 | 10.9 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| τ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
## Files & Quantization
To see the available files, please verify the **Files and versions** tab.
## How to Run (llama.cpp)
**Recommended Parameters:**
* **Temperature:** `1.0` (Standard) or `0.7` (For stricter adherence)
* **Top-P:** `0.95`
* **Context:** `-c` (Adjust based on available RAM).
### CLI Example
```bash
./llama-cli -m GLM-4.7-Flash.Q4_K_M.gguf \
-c 8192 \
--temp 1.0 \
--top-p 0.95 \
-p "User: Write a Python script to calculate Fibonacci numbers.\nAssistant:" \
-cnv
```
### Server Example
```bash
./llama-server -m GLM-4.7-Flash.Q4_K_M.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 16384 \
-ngl 99
```