Instructions to use ubergarm/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/Kimi-K2.6-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/Kimi-K2.6-GGUF",
	filename="IQ3_K/Kimi-K2.6-IQ3_K-00001-of-00012.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ubergarm/Kimi-K2.6-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/Kimi-K2.6-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/Kimi-K2.6-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/Kimi-K2.6-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K

Ollama
How to use ubergarm/Kimi-K2.6-GGUF with Ollama:
```
ollama run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
```

Unsloth Studio new

How to use ubergarm/Kimi-K2.6-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Kimi-K2.6-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Kimi-K2.6-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/Kimi-K2.6-GGUF to start chatting

Pi new

How to use ubergarm/Kimi-K2.6-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/Kimi-K2.6-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/Kimi-K2.6-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/Kimi-K2.6-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/Kimi-K2.6-GGUF:Q2_K

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/Kimi-K2.6-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/Kimi-K2.6-GGUF:Q2_K
```

Lemonade

How to use ubergarm/Kimi-K2.6-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/Kimi-K2.6-GGUF:Q2_K

Run and chat with the model

lemonade run user.Kimi-K2.6-GGUF-Q2_K

List all available models

lemonade list

MTP Support

#13

by anikifoss - opened 11 days ago

Discussion

anikifoss

11 days ago

As broadcasted in LocalLLaMA, there is now MTP support in llama.cpp!

Any chance we can get MTP layers for Kimi-K2.6 added to the first GGUF?

srv    load_model: creating MTP draft context against the target model '/models/ubergarm/Kimi-K2.6-Q4_X-GGUF/Kimi-K2.6-Q4_X-00001-of-00014.gguf'
llama_init_from_model: context type MTP requested but model doesn't contain MTP layers
srv    load_model: failed to create MTP context

ubergarm

Owner 9 days ago

•

edited 9 days ago

@anikifoss

MTP is going gang busters lately on all the models!

I'm not 100% sure how all the inference engines are doing it, e.g.

do i have to re-convert the entire model and preserve the MTP tensors using the latest mainline llama.cpp convert_hf_to_gguf.py and quantize again?
can I just extract the MTP layer/tensors into a seperate GGUF and pass it in to llama-server at runtime (ik and mainline might handle this differently now, its been moving fast and hard to keep up)
might be able to do #2 but kinda modified and only need to download the 1st small metadata GGUF and the new MTP gguf and the new metadata will point to the new MTP GGUF

Unfortunately, the big rigs of Wendell's are down for some maintenance right now, but I'll keep my eye on this.

I see @tarruda is going back trying to add MTP to big Qwen3.5 and has some notes here:

Apparently there's a new --mtp flag on convert_hf_to_gguf.py to create a new GGUF with only the MTP model, avoiding having to recreate the original GGUF...
https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/10#6a091ab9a80e0113bb2e0868

EDIT Also keep an eye on this new ik PR 1821 which may add -sm graph for MLA models like Kimi 🤞

Thireus

8 days ago

•

edited 8 days ago

I have the same questions, and I've assessed that it will probably be faster for me to just redo everything from scratch than offer both options and add the missing tensors.

gghfez

8 days ago

Kimi

Planned retest on a 16×24GB rig once it's back online

Lol. I take it there's no benefit if half the model is running on the CPU?

can I just extract the MTP layer/tensors into a seperate GGUF and pass it in to llama-server at runtime

I like this idea, like mmproj.

tarruda

8 days ago

I generated MTP version of Qwen 3.5 397b but IMO it was not worth it if you are memory constrained: When the MTP layers are quantized to Q8_0, the weights grew in size by about 6G. Also, the extra RAM was being used despite not enabling MTP, though this could be a temporary llama.cpp issue.

Thireus

7 days ago

•

edited 7 days ago

I generated MTP version of Qwen 3.5 397b but IMO it was not worth it if you are memory constrained: When the MTP layers are quantized to Q8_0, the weights grew in size by about 6G. Also, the extra RAM was being used despite not enabling MTP, though this could be a temporary llama.cpp issue.

Isn't there an option in llama to disable these tensors and not have them loaded in any way? I was under the impression people working on the MTP implementation already thought of this, as this is in my view the most sensible thing to do and also quite straightforward to implement.

Thireus

7 days ago

@ubergarm , these options are available with convert_hf_to_gguf.py:

https://github.com/ggml-org/llama.cpp/pull/22673/changes#diff-ec77d8003b92ff283179456d36b8b56abf635e7b1232e70daf16676e8920ccf1R120-R126

    parser.add_argument(
        "--mtp", action="store_true",
        help="(Experimental) Export only the multi-token prediction (MTP) head as a separate GGUF, suitable for use as a speculative draft. Output file name will get a '-MTP' suffix.",
    )
    parser.add_argument(
        "--no-mtp", action="store_true",
        help="(Experimental) Exclude the multi-token prediction (MTP) head from the converted GGUF. Pair with --mtp on a second run to publish trunk and MTP as two files. Note: the split form duplicates embeddings, so the bundled default is more space-efficient overall.",

gghfez

7 days ago

@tarruda I don't think having mtp embedded in the model increases vram usage.
I tested this by quantizing Qwen3.6-9B to Q8 three times and loading in ik_llama (dropping caches/compacting memory before each run):

Summary Table

Setup	System RAM (Before)	System RAM (After)	Pinned Memory Log	VRAM (GPU) Used
1. No MTP in GGUF	4.9 GiB	23 GiB	8.05 GiB	1807 MiB
2. MTP Embedded (Disabled)	5.0 GiB	23 GiB	8.86 GiB → 8.05 GiB	1825 MiB
3. MTP Embedded (Enabled)	5.0 GiB	29 GiB	9.10 GiB → 9.05 GiB	3375 MiB

Which is a shame, got my hopes up thinking I could run a bigger Mimo-2.5-Pro or Kimi-K2.6 by stripping out the mtp.

Also looks like the storage cost is higher if you want MTP but include it separately:

 9.2G May 21 15:42 qwen3.5-9b-default.q8.gguf
 2.3G May 21 15:37 qwen3.5-9b-mtp.q8.gguf
 8.9G May 21 15:38 qwen3.5-9b-nomtp.q8.gguf

Personally prefer it be excluded like mmproj. For some models I have multiple quants sitting on the SSD.
I'll load a smaller one if I want speed over quality, a larger one to train control-vectors, etc. So I'd rather just point at the same draft model.
Though lately I've been doing 1 tensor per file splits and managing symlinks / pulling in specific tensors from @Thireus repos (and building ik_llama with -DGGML_MAX_CONTEXTS=2048)

https://nobodywho.ooo/posts/whats-in-a-gguf/

The projection model is often ~1GB in size - enough of an overhead that we definitely want to skip it when it's not used. But I think it's reasonable to provide two variants of the GGUF: one with projection weights, and one without. That could get us back to the situation of managing just one url to download, just one file to cache on disk, etc.

It the general "vibe" is to offer both for mmproj, so likely the same with mtp?

gghfez

7 days ago

I just hope llama.cpp and ik_llama.cpp are both compatible with whatever the community adopts so I don't have to do https://huggingface.co/gghfez/MiMo-V2.5-Pro-unfused-test again -_-!

Hunterx

5 days ago

I am so waiting for mtp on the big models. Mimo v2.5 Pro and non pro has just won me over so much. Trick is to keep min p high though and cap thinking.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment