--- language: en tags: - mlx pipeline_tag: text-generation library_name: mlx license_name: modified-mit base_model: - moonshotai/Kimi-K2.5 --- [Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) optimized to run _even more comfortably_ on a Mac Studio M3 512G. My [2.8 bit quants](https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-2.8bit) fit into 380G memory. This 2.5 bit one hovers around 350G, while matching the original 2.8 bit quant in quality. The main motivation to compress even further was to support a full "Claude Code in a box" system, which requires not just an Opus replacement (Kimi K2.5) but also Haiku and Sonnet replacements (Qwen 3.5) for background tasks and subagents. # Usage ```sh # Start server at http://localhost:8080/v1/chat/completions uvx --from mlx-lm --with tiktoken \ mlx_lm.server \ --host 127.0.0.1 --port 8080 \ --trust-remote-code \ --model spicyneuron/Kimi-K2.5-MLX-2.5bit # Kimi K2.5 requires tiktoken + remote code for the tokenizer ``` # Methodology Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922), drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same: - Sensitive layers like MoE routing, attention, and output embeddings get higher precision (BF16, 8, 4) - More tolerant layers like MoE experts get lower precision (2, 3) This one is much smaller than [Unsloth's UD-Q2_K_XL](https://huggingface.co/unsloth/Kimi-K2.5-GGUF/tree/main/UD-Q2_K_XL) in size, and loads and runs noticeably faster thanks to MLX. # Performance | Prompt Size | GGUF | MLX 3 bit | MLX 2.8 bit v1 | MLX 2.8 bit v2 | **MLX 2.5 bit** | |------------:|---------:|----------:|---------------:|---------------:|------------:| | 1000 | 148.82 | 216.976 | 224.878 | 224.094 | **226.368** | | 5000 | 130.90 | 230.227 | 235.595 | 231.966 | **237.426** | | 10000 | 113.32 | 219.792 | 222.464 | 218.455 | **223.846** | | 20000 | 89.72 | 186.549 | 187.915 | 186.169 | **188.502** | | Gen Size | GGUF | MLX 3 bit | MLX 2.8 bit v1 | MLX 2.8 bit v2 | **MLX 2.5 bit** | |------------:|---------:|----------:|---------------:|---------------:|------------:| | 500 | 23.38 | 25.781 | 27.443 | 26.586 | **27.571** | | 1000 | 22.37 | 25.210 | 26.491 | 24.285 | **26.853** | | 2000 | 21.89 | 23.944 | 24.573 | 22.603 | **24.689** | | 5000 | 20.52 | 20.758 | 21.030 | 20.499 | **21.192** | # Perplexity (MLX quants) | Model | Perplexity | Relative | Relative % | |-----------------------|-----------------|----------|------------| | MLX 3 bit | 3.798 ± 0.021 | — | — | | MLX 2.8 bit v1 | 3.768 ± 0.021 | -0.030 | -0.79% | | MLX 2.8 bit v2 | 3.702 ± 0.020 | -0.096 | -2.53% | | **MLX 2.5 bit** | **3.777 ± 0.020** | **-0.021** | **-0.55%** | ``` # llama.cpp 8130 llama-bench -fa 1 --batch-size 2048 --ubatch-size 2048 --repetitions 5 # mlx_lm v0.30.7 mlx_lm.benchmark --num-trials 5 mlx_lm.perplexity --sequence-length 1000 --seed 222 ```