Instructions to use ubergarm/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/Kimi-K2.6-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/Kimi-K2.6-GGUF", filename="IQ3_K/Kimi-K2.6-IQ3_K-00001-of-00012.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ubergarm/Kimi-K2.6-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Use Docker
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use ubergarm/Kimi-K2.6-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/Kimi-K2.6-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/Kimi-K2.6-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Ollama
How to use ubergarm/Kimi-K2.6-GGUF with Ollama:
ollama run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Unsloth Studio new
How to use ubergarm/Kimi-K2.6-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/Kimi-K2.6-GGUF to start chatting
- Pi new
How to use ubergarm/Kimi-K2.6-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/Kimi-K2.6-GGUF:Q2_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/Kimi-K2.6-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/Kimi-K2.6-GGUF:Q2_K
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/Kimi-K2.6-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
- Lemonade
How to use ubergarm/Kimi-K2.6-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/Kimi-K2.6-GGUF:Q2_K
Run and chat with the model
lemonade run user.Kimi-K2.6-GGUF-Q2_K
List all available models
lemonade list
MTP Support
As broadcasted in LocalLLaMA, there is now MTP support in llama.cpp!
Any chance we can get MTP layers for Kimi-K2.6 added to the first GGUF?
srv load_model: creating MTP draft context against the target model '/models/ubergarm/Kimi-K2.6-Q4_X-GGUF/Kimi-K2.6-Q4_X-00001-of-00014.gguf'
llama_init_from_model: context type MTP requested but model doesn't contain MTP layers
srv load_model: failed to create MTP context
MTP is going gang busters lately on all the models!
I'm not 100% sure how all the inference engines are doing it, e.g.
- do i have to re-convert the entire model and preserve the MTP tensors using the latest mainline llama.cpp
convert_hf_to_gguf.pyand quantize again? - can I just extract the MTP layer/tensors into a seperate GGUF and pass it in to llama-server at runtime (ik and mainline might handle this differently now, its been moving fast and hard to keep up)
- might be able to do #2 but kinda modified and only need to download the 1st small metadata GGUF and the new MTP gguf and the new metadata will point to the new MTP GGUF
Unfortunately, the big rigs of Wendell's are down for some maintenance right now, but I'll keep my eye on this.
I see @tarruda is going back trying to add MTP to big Qwen3.5 and has some notes here:
Apparently there's a new --mtp flag on convert_hf_to_gguf.py to create a new GGUF with only the MTP model, avoiding having to recreate the original GGUF...
https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/10#6a091ab9a80e0113bb2e0868
EDIT Also keep an eye on this new ik PR 1821 which may add -sm graph for MLA models like Kimi π€
I have the same questions, and I've assessed that it will probably be faster for me to just redo everything from scratch than offer both options and add the missing tensors.
Kimi
Planned retest on a 16Γ24GB rig once it's back online
Lol. I take it there's no benefit if half the model is running on the CPU?
can I just extract the MTP layer/tensors into a seperate GGUF and pass it in to llama-server at runtime
I like this idea, like mmproj.
I generated MTP version of Qwen 3.5 397b but IMO it was not worth it if you are memory constrained: When the MTP layers are quantized to Q8_0, the weights grew in size by about 6G. Also, the extra RAM was being used despite not enabling MTP, though this could be a temporary llama.cpp issue.
I generated MTP version of Qwen 3.5 397b but IMO it was not worth it if you are memory constrained: When the MTP layers are quantized to Q8_0, the weights grew in size by about 6G. Also, the extra RAM was being used despite not enabling MTP, though this could be a temporary llama.cpp issue.
Isn't there an option in llama to disable these tensors and not have them loaded in any way? I was under the impression people working on the MTP implementation already thought of this, as this is in my view the most sensible thing to do and also quite straightforward to implement.
@ubergarm , these options are available with convert_hf_to_gguf.py:
parser.add_argument(
"--mtp", action="store_true",
help="(Experimental) Export only the multi-token prediction (MTP) head as a separate GGUF, suitable for use as a speculative draft. Output file name will get a '-MTP' suffix.",
)
parser.add_argument(
"--no-mtp", action="store_true",
help="(Experimental) Exclude the multi-token prediction (MTP) head from the converted GGUF. Pair with --mtp on a second run to publish trunk and MTP as two files. Note: the split form duplicates embeddings, so the bundled default is more space-efficient overall.",
@tarruda I don't think having mtp embedded in the model increases vram usage.
I tested this by quantizing Qwen3.6-9B to Q8 three times and loading in ik_llama (dropping caches/compacting memory before each run):
Summary Table
| Setup | System RAM (Before) | System RAM (After) | Pinned Memory Log | VRAM (GPU) Used |
|---|---|---|---|---|
| 1. No MTP in GGUF | 4.9 GiB | 23 GiB | 8.05 GiB | 1807 MiB |
| 2. MTP Embedded (Disabled) | 5.0 GiB | 23 GiB | 8.86 GiB β 8.05 GiB | 1825 MiB |
| 3. MTP Embedded (Enabled) | 5.0 GiB | 29 GiB | 9.10 GiB β 9.05 GiB | 3375 MiB |
Which is a shame, got my hopes up thinking I could run a bigger Mimo-2.5-Pro or Kimi-K2.6 by stripping out the mtp.
Also looks like the storage cost is higher if you want MTP but include it separately:
9.2G May 21 15:42 qwen3.5-9b-default.q8.gguf
2.3G May 21 15:37 qwen3.5-9b-mtp.q8.gguf
8.9G May 21 15:38 qwen3.5-9b-nomtp.q8.gguf
Personally prefer it be excluded like mmproj. For some models I have multiple quants sitting on the SSD.
I'll load a smaller one if I want speed over quality, a larger one to train control-vectors, etc. So I'd rather just point at the same draft model.
Though lately I've been doing 1 tensor per file splits and managing symlinks / pulling in specific tensors from @Thireus repos (and building ik_llama with -DGGML_MAX_CONTEXTS=2048)
https://nobodywho.ooo/posts/whats-in-a-gguf/
The projection model is often ~1GB in size - enough of an overhead that we definitely want to skip it when it's not used. But I think it's reasonable to provide two variants of the GGUF: one with projection weights, and one without. That could get us back to the situation of managing just one url to download, just one file to cache on disk, etc.
It the general "vibe" is to offer both for mmproj, so likely the same with mtp?
I just hope llama.cpp and ik_llama.cpp are both compatible with whatever the community adopts so I don't have to do https://huggingface.co/gghfez/MiMo-V2.5-Pro-unfused-test again -_-!
I am so waiting for mtp on the big models. Mimo v2.5 Pro and non pro has just won me over so much. Trick is to keep min p high though and cap thinking.