Instructions to use ubergarm/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/Kimi-K2.6-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/Kimi-K2.6-GGUF", filename="IQ3_K/Kimi-K2.6-IQ3_K-00001-of-00012.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ubergarm/Kimi-K2.6-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Use Docker
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use ubergarm/Kimi-K2.6-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/Kimi-K2.6-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/Kimi-K2.6-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Ollama
How to use ubergarm/Kimi-K2.6-GGUF with Ollama:
ollama run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Unsloth Studio new
How to use ubergarm/Kimi-K2.6-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
- Pi new
How to use ubergarm/Kimi-K2.6-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/Kimi-K2.6-GGUF:Q2_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/Kimi-K2.6-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/Kimi-K2.6-GGUF:Q2_K
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/Kimi-K2.6-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Lemonade
How to use ubergarm/Kimi-K2.6-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/Kimi-K2.6-GGUF:Q2_K
Run and chat with the model
lemonade run user.Kimi-K2.6-GGUF-Q2_K
List all available models
lemonade list
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/Kimi-K2.6-GGUF:Q2_KRun Hermes
hermes- imatrix Quantization of moonshotai/Kimi-K2.6
- Big Thanks
- Quant Collection
- Q4_X 543.617 GiB (4.549 BPW)
- IQ3_K 459.945 GiB (3.849 BPW)
- smol-IQ3_KS 388.258 GiB (3.249 BPW)
- smol-IQ2_KL 329.195 GiB (2.755 BPW)
- smol-IQ2_KS 270.133 GiB (2.261 BPW)
- smol-IQ1_KT 218.936 GiB (1.832 BPW)
- Quick Start
- Q4_X Patch
- References
imatrix Quantization of moonshotai/Kimi-K2.6
Except for the Q4_X, the other quants in this collection REQUIRE ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Finally, I really appreciate all the support from aifoundry.org so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants!
Quant Collection
Perplexity computed against wiki.test.raw. (lower is "better")
Q4_X 543.617 GiB (4.549 BPW)
PPL over 568 chunks for n_ctx=512 = 1.8433 +/- 0.00721
This quant is the "full size" model made using the Q4_X patch to match moonshot official int4 released as described below. It does not use imatrix and is compatible on both ik and mainline llama.cpp
๐ Secret Recipe
#!/usr/bin/env bash
# https://github.com/ikawrakow/ik_llama.cpp/pull/1556#issuecomment-4282712006
# Q4_0 (patched) routed experts approximating original QAT design
# Q8_0 everything else
custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=q4_0
blk\..*\.ffn_(gate|up)_exps\.weight=q4_0
token_embd\.weight=q8_0
output\.weight=q8_0
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-Q4_X.gguf \
Q8_0 \
128
IQ3_K 459.945 GiB (3.849 BPW)
PPL over 568 chunks for n_ctx=512 = 1.9012 +/- 0.00753
Note: Just on this quant, imatrix was applied only to ffn_(gate|up)_exps tensors that are iq3_k. Also this recipe is just a smooch bigger than previous Kimi-K2.5 version, but still fits nicely in under 512GB.
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-60] (CPU)
## NOTE: imatrix is *only* applied to the iq3_k tensors for this recipe
blk\..*\.ffn_down_exps\.weight=q4_0
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
## NOTE: previous recipe used iq6_k for both of these
token_embd\.weight=q8_0
output\.weight=q8_0
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \
--include-weights ffn_gate_exps \
--include-weights ffn_up_exps \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-IQ3_K.gguf \
IQ3_K \
128
smol-IQ3_KS 388.258 GiB (3.249 BPW)
PPL over 568 chunks for n_ctx=512 = 1.9810 +/- 0.00800
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ3_KS.gguf \
IQ3_KS \
128
smol-IQ2_KL 329.195 GiB (2.755 BPW)
PPL over 568 chunks for n_ctx=512 = 2.2190 +/- 0.00936
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ2_KL.gguf \
IQ2_KL \
128
smol-IQ2_KS 270.133 GiB (2.261 BPW)
PPL over 568 chunks for n_ctx=512 = 2.6723 +/- 0.01209
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ2_KS.gguf \
IQ2_KS \
128
smol-IQ1_KT 218.936 GiB (1.832 BPW)
PPL over 568 chunks for n_ctx=512 = 3.3252 +/- 0.01613
only for the desperate
Also keep in mind KT trellis quants generally are slower token generation given likely compute bottleneck if running on CPU, but if it is all you can fit then well... They are fast on GPU similar to EXL3.
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq1_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/Kimi-K2.6-GGUF/imatrix-Kimi-K2.6-Q4_X.dat \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-384x14B-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Kimi-K2.6-GGUF/Kimi-K2.6-smol-IQ1_KT.gguf \
IQ1_KT \
128
Quick Start
# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp
# Build for hybrid CPU+CUDA (or set GGML_CUDA=OFF for CPU only)
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
$ cmake --build build --config Release -j $(nproc)
# Hybrid CPU+GPU Inference
# MLA model architechtures don't support `-sm graph`
# try it with `-fit on` but you can dial it yourself e.g. `-ngl 999 --n-cpu-moe 60` etc...
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Kimi-K2.6-GGUF \
-muge \
--merge-qkv \
--ctx-size 131072 \
-ctk f16 \
-mla 3 \
-amb 512 \
-fit \
--parallel 1 \
-ub 4096 -b 4096 \
--threads 16 \
--threads-batch 16 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--jinja
# CPU-only inference
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Kimi-K2.6-GGUF \
-muge \
--merge-qkv \
--ctx-size 131072 \
-ctk f16 \
-mla 3 \
--parallel 1 \
-ub 4096 -b 4096 \
--threads 96 \
--threads-batch 128 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--jinja
Bring your own jinja chat template with --chat-template-file myTemplate.jinja e.g. this one provided by DrRos. I also vibe patched one to behave more like Qwen3.6 which is working well with pi coding harness --chat-template-file Kimi-K2.6-chat-template.jinja and -cram 8192 (8GiB RAM) prompt cache without busting cache causing long kv-cache processing.
Seems to be working with spec-decoding e.g. --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12
Increase prompt cache with stuff like -cram 16384 --prompt-cache-all.
Q4_X Patch
https://github.com/ikawrakow/ik_llama.cpp/pull/1556#issuecomment-4282712006
References
- Downloads last month
- 2,625
2-bit
Model tree for ubergarm/Kimi-K2.6-GGUF
Base model
moonshotai/Kimi-K2.6

Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K