---
license: mit
library_name: mlx
tags:
- mlx
- transformers
pipeline_tag: text-generation
base_model: zai-org/GLM-5.2
---

# mlx-community/GLM-5.2-DQ4plus-q8

This model [mlx-community/GLM-5.2-DQ4plus-q8](https://huggingface.co/mlx-community/GLM-5.2-DQ4plus-q8) was converted to MLX format from [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) using mlx-lm version **0.31.3** (with PR #1410).

This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of GLM-5.2 fits comfortably. But we can do better. Using research results, we aim to get better results from a slightly larger and smarter quantization. It should also not be so large that it leaves no memory for a useful context window.

You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj

```bash
pip install mlx-lm

mlx_lm.generate --model mlx-community/GLM-5.2-DQ4plus-q8 --prompt "Hallo"
```

---

## What is this DQ4plus-q8?

In the Arxiv paper [Quantitative Analysis of Performance Drop in DeepSeek Model Quantization](https://arxiv.org/abs/2505.02390) the authors write,

> We further propose `DQ3_K_M`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3_K_M` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4_K_M`) approach in most tasks.

and

> dynamic 3-bit quantization method (`DQ3_K_M`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks.

The resulting multi-bitwidth quantization has been well tested and documented.

---

## How can you create your own DQ4plus-q8 quants?

This time the recipe is a bit different from that of a normal DQ3_K_M. To make to the quant perform better under stress, only the `up` and `gate` expert tensors are quantized to 4-bit, and the `down` expert to a mix of 5-bit and 6-bit. All the other tensors are kept at 8-bit. You could say that this quant has an 8-bit "brain" and 4-bit/5-bit/6-bit experts.

In the `convert.py` file of mlx-lm on your system ( [you can see the original code here](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/convert.py) ), replace the code inside `def mixed_quant_predicate()` with something like

```python
        # Build a mixed quant like "DQ4plus-q8" similar to the "DQ3" of Arxiv paper https://arxiv.org/abs/2505.02390
        #    Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
        q_bits = 8
        # For "switch experts"
        if "switch_mlp.up_proj" in path:
           q_bits = 4
        if "switch_mlp.gate_proj" in path:
           q_bits = 4
        if "switch_mlp.down_proj" in path:
           q_bits = 5
           # Blocks up to 5 are higher quality
           if index < 5:
              q_bits = 6
           # Every 5th block is "medium" quality
           if (index % 5) == 0:
              q_bits = 6
        print("path:", path, "index:", index, "q_bits:", q_bits)
        return {"group_size": group_size, "bits": q_bits, "mode": mode}
```

Then create your GLM-5.2-DQ4plus-q8 quant with

```bash
mlx_lm.convert --hf-path zai-org/GLM-5.2 --mlx-path GLM-5.2-DQ4plus-q8 -q --quant-predicate mixed_3_4
```

---

Enjoy!