mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8

This model mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8 was converted to MLX format from moonshotai/Kimi-K2.6 using mlx-lm version 0.31.2.

After the success of the first Kimi "DQ3_K_M" model and the K2.5, this is a new update for Kimi-K2.6!

This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Kimi K2 does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window.

You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj

pip install mlx-lm

mlx_lm.generate --model mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8--temp 0.6 --min-p 0.01 --max-tokens 4096 --trust-remote-code --prompt "Hallo"

What is this DQ3_K_M?

In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write,

We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks.

and

dynamic 3-bit quantization method (DQ3_K_M) that outperforms the 3-bit quantization implementation in llama.cpp and achieves performance comparable to 4-bit quantization across multiple benchmarks.

The resulting multi-bitwidth quantization has been well tested and documented.


How can you create your own DQ3_K_M quants?

The recipe is the same as that for the K2.5 model. Both are a bit different from that of the first Kimi "DQ3_K_M" model, which was described there. To make to the quant perform better under stress, only the expert tensors are quantized to a mix of 3-bit and 4-bit. All the other tensors are kept at 8-bit. You could say that this quant has an 8-bit "brain" and 3-bit/4-bit experts. The sizes of all three these quants are roughly the same. The 8-bit routing does reduce the tokens/second by a few %. You get a slightly slower TG, but better quality results.

In the convert.py file of mlx-lm on your system ( you can see the original code here ), replace the code inside def mixed_quant_predicate() with something like

        index = (
            int(path.split(".")[layer_location])
            if len(path.split(".")) > layer_location
            else 0
        )
        # Build a mixed quant like "DQ3" similar to the "DQ3" of Arxiv paper https://arxiv.org/abs/2505.02390
        #    Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
        q_bits = 8
        if "switch_mlp.up_proj" in path:
           q_bits = 3
        if "switch_mlp.gate_proj" in path:
           q_bits = 3
        if "switch_mlp.down_proj" in path:
           q_bits = 3
           # Layers up to 5 are higher quality
           if index < 5:
              q_bits = 5
           # Every 5th layer is "medium" quality
           if (index % 5) == 0:
              q_bits = 4
        print("path:", path, "index:", index, "q_bits:", q_bits)
        return {"group_size": group_size, "bits": q_bits, "mode": mode}

Then create your DQ3_K_M quant with

mlx_lm.convert --hf-path moonshotai/Kimi-K2.6 --mlx-path your-model-DQ3_K_M -q --quant-predicate mixed_3_4 --trust-remote-code

NOTE*: With Kimi-K2.5 and Kimi-K2.6 you need to first dequantize the model before you can create the MLX quant. This step requires just over 2TB of additional disk space.


Enjoy!

Downloads last month
226,485
Safetensors
Model size
1T params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8

Quantized
(34)
this model

Paper for mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8