--- quantized_by: ubergarm pipeline_tag: text-generation base_model: moonshotai/Kimi-K2.6 license: other license_name: modified-mit license_link: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE base_model_relation: quantized tags: - mla - imatrix - conversational - ik_llama.cpp --- ## imatrix Quantization of moonshotai/Kimi-K2.6 Except for the `Q4_X`, the other quants in this collection **REQUIRE** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. ## Big Thanks Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models! Finally, I *really* appreciate all the support from [aifoundry.org](https://aifoundry.org) so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants! ## Quant Collection Perplexity computed against *wiki.test.raw*. (lower is "better") ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity vs Model Size.") ![KLD Chart](images/kld.png "Chart showing KLD vs Model Size.") ## Q4_X 543.617 GiB (4.549 BPW) PPL over 568 chunks for n_ctx=512 = 1.8433 +/- 0.00721 This quant is the "full size" model made using the `Q4_X` patch to match moonshot official `int4` released as described below. It does *not* use imatrix and is compatible on *both* ik and mainline llama.cpp
👈 Secret Recipe ```bash #!/usr/bin/env bash # https://github.com/ikawrakow/ik_llama.cpp/pull/1556#issuecomment-4282712006 # Q4_0 (patched) routed experts approximating original QAT design # Q8_0 everything else custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=q4_0 blk\..*\.ffn_(gate|up)_exps\.weight=q4_0 token_embd\.weight=q8_0 output\.weight=q8_0 " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-Q4_X.gguf \ Q8_0 \ 128 ```
## IQ3_K 459.945 GiB (3.849 BPW) PPL over 568 chunks for n_ctx=512 = 1.9012 +/- 0.00753 *Note*: Just on this quant, imatrix was applied *only* to `ffn_(gate|up)_exps` tensors that are `iq3_k`. Also this recipe is just a smooch bigger than previous Kimi-K2.5 version, but still fits nicely in under 512GB.
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) ## NOTE: imatrix is *only* applied to the iq3_k tensors for this recipe blk\..*\.ffn_down_exps\.weight=q4_0 blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k ## NOTE: previous recipe used iq6_k for both of these token_embd\.weight=q8_0 output\.weight=q8_0 " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \ --include-weights ffn_gate_exps \ --include-weights ffn_up_exps \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-IQ3_K.gguf \ IQ3_K \ 128 ```
## smol-IQ3_KS 388.258 GiB (3.249 BPW) PPL over 568 chunks for n_ctx=512 = 1.9810 +/- 0.00800
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq3_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ3_KS.gguf \ IQ3_KS \ 128 ```
## smol-IQ2_KL 329.195 GiB (2.755 BPW) PPL over 568 chunks for n_ctx=512 = 2.2190 +/- 0.00936
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq2_kl blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ2_KL.gguf \ IQ2_KL \ 128 ```
## smol-IQ2_KS 270.133 GiB (2.261 BPW) PPL over 568 chunks for n_ctx=512 = 2.6723 +/- 0.01209
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq2_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ2_KS.gguf \ IQ2_KS \ 128 ```
## smol-IQ1_KT 218.936 GiB (1.832 BPW) PPL over 568 chunks for n_ctx=512 = 3.3252 +/- 0.01613 *only for the desperate* Also keep in mind `KT` trellis quants generally are slower token generation given likely compute bottleneck if running on CPU, but if it is all you can fit then well... They are fast on GPU similar to EXL3.
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq1_kt blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ1_KT.gguf \ IQ1_KT \ 128 ```
## Quick Start ```bash # Clone and checkout $ git clone https://github.com/ikawrakow/ik_llama.cpp $ cd ik_llama.cpp # Build for hybrid CPU+CUDA (or set GGML_CUDA=OFF for CPU only) $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON $ cmake --build build --config Release -j $(nproc) # Hybrid CPU+GPU Inference # MLA model architechtures don't support `-sm graph` # try it with `-fit on` but you can dial it yourself e.g. `-ngl 999 --n-cpu-moe 60` etc... ./build/bin/llama-server \ --model "$model"\ --alias ubergarm/Kimi-K2.6-GGUF \ -muge \ --merge-qkv \ --ctx-size 131072 \ -ctk f16 \ -mla 3 \ -amb 512 \ -fit \ --parallel 1 \ -ub 4096 -b 4096 \ --threads 16 \ --threads-batch 16 \ --host 127.0.0.1 \ --port 8080 \ --no-mmap \ --jinja # CPU-only inference numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-server \ --model "$model"\ --alias ubergarm/Kimi-K2.6-GGUF \ -muge \ --merge-qkv \ --ctx-size 131072 \ -ctk f16 \ -mla 3 \ --parallel 1 \ -ub 4096 -b 4096 \ --threads 96 \ --threads-batch 128 \ --numa numactl \ --host 127.0.0.1 \ --port 8080 \ --no-mmap \ --jinja ``` Bring your own jinja chat template with `--chat-template-file myTemplate.jinja` e.g. [this one provided by DrRos](https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/4#69e91ea0bca19b1cb0d11d4e). I also vibe patched one to behave more like Qwen3.6 which is working well with pi coding harness `--chat-template-file Kimi-K2.6-chat-template.jinja` and `-cram 8192` (8GiB RAM) prompt cache without busting cache causing long kv-cache processing. Seems to be working with spec-decoding e.g. `--spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12` Increase prompt cache with stuff like `-cram 16384 --prompt-cache-all`. ## Q4_X Patch https://github.com/ikawrakow/ik_llama.cpp/pull/1556#issuecomment-4282712006 ## References * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) * [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584) * [Great mainline MoE optimizd quants AesSedai/Kimi-K2.6-GGUF and maybe mmproj too](https://huggingface.co/AesSedai/Kimi-K2.6-GGUF)