GLM-5.2-DQ4plus-q8 / README.md
bibproj's picture
Update README.md
7183e77 verified
|
Raw
History Blame Contribute Delete
3.29 kB
---
license: mit
library_name: mlx
tags:
- mlx
- transformers
pipeline_tag: text-generation
base_model: zai-org/GLM-5.2
---
# mlx-community/GLM-5.2-DQ4plus-q8
This model [mlx-community/GLM-5.2-DQ4plus-q8](https://huggingface.co/mlx-community/GLM-5.2-DQ4plus-q8) was converted to MLX format from [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) using mlx-lm version **0.31.3** (with PR #1410).
This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of GLM-5.2 fits comfortably. But we can do better. Using research results, we aim to get better results from a slightly larger and smarter quantization. It should also not be so large that it leaves no memory for a useful context window.
You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj
```bash
pip install mlx-lm
mlx_lm.generate --model mlx-community/GLM-5.2-DQ4plus-q8 --prompt "Hallo"
```
---
## What is this DQ4plus-q8?
In the Arxiv paper [Quantitative Analysis of Performance Drop in DeepSeek Model Quantization](https://arxiv.org/abs/2505.02390) the authors write,
> We further propose `DQ3_K_M`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3_K_M` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4_K_M`) approach in most tasks.
and
> dynamic 3-bit quantization method (`DQ3_K_M`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks.
The resulting multi-bitwidth quantization has been well tested and documented.
---
## How can you create your own DQ4plus-q8 quants?
This time the recipe is a bit different from that of a normal DQ3_K_M. To make to the quant perform better under stress, only the `up` and `gate` expert tensors are quantized to 4-bit, and the `down` expert to a mix of 5-bit and 6-bit. All the other tensors are kept at 8-bit. You could say that this quant has an 8-bit "brain" and 4-bit/5-bit/6-bit experts.
In the `convert.py` file of mlx-lm on your system ( [you can see the original code here](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/convert.py) ), replace the code inside `def mixed_quant_predicate()` with something like
```python
# Build a mixed quant like "DQ4plus-q8" similar to the "DQ3" of Arxiv paper https://arxiv.org/abs/2505.02390
# Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
q_bits = 8
# For "switch experts"
if "switch_mlp.up_proj" in path:
q_bits = 4
if "switch_mlp.gate_proj" in path:
q_bits = 4
if "switch_mlp.down_proj" in path:
q_bits = 5
# Blocks up to 5 are higher quality
if index < 5:
q_bits = 6
# Every 5th block is "medium" quality
if (index % 5) == 0:
q_bits = 6
print("path:", path, "index:", index, "q_bits:", q_bits)
return {"group_size": group_size, "bits": q_bits, "mode": mode}
```
Then create your GLM-5.2-DQ4plus-q8 quant with
```bash
mlx_lm.convert --hf-path zai-org/GLM-5.2 --mlx-path GLM-5.2-DQ4plus-q8 -q --quant-predicate mixed_3_4
```
---
Enjoy!