Instructions to use batiai/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use batiai/Kimi-K2.6-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="batiai/Kimi-K2.6-GGUF",
	filename="moonshotai-Kimi-K2.6-IQ3_XXS-00001-of-00009.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use batiai/Kimi-K2.6-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
# Run inference directly in the terminal:
llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
# Run inference directly in the terminal:
llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
# Run inference directly in the terminal:
./llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS

Use Docker

docker model run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS

LM Studio
Jan

vLLM

How to use batiai/Kimi-K2.6-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "batiai/Kimi-K2.6-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "batiai/Kimi-K2.6-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS

Ollama
How to use batiai/Kimi-K2.6-GGUF with Ollama:
```
ollama run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS
```

Unsloth Studio new

How to use batiai/Kimi-K2.6-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for batiai/Kimi-K2.6-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for batiai/Kimi-K2.6-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for batiai/Kimi-K2.6-GGUF to start chatting

Pi new

How to use batiai/Kimi-K2.6-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "batiai/Kimi-K2.6-GGUF:IQ3_XXS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use batiai/Kimi-K2.6-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default batiai/Kimi-K2.6-GGUF:IQ3_XXS

Run Hermes

hermes

Docker Model Runner
How to use batiai/Kimi-K2.6-GGUF with Docker Model Runner:
```
docker model run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS
```

Lemonade

How to use batiai/Kimi-K2.6-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull batiai/Kimi-K2.6-GGUF:IQ3_XXS

Run and chat with the model

lemonade run user.Kimi-K2.6-GGUF-IQ3_XXS

List all available models

lemonade list

Kimi K2.6 GGUF — Quantized by BatiAI

IQ3_XXS / IQ4_XS quantization of moonshotai/Kimi-K2.6 (1T total / 32B active MoE). Quantized directly from official Moonshot FP8 weights by BatiAI.

Why Kimi K2.6?

1T parameters (32B active) — frontier-class open weight model
SWE-Bench Pro 58.6 — beats GPT-5.4 xhigh (57.7), Claude Opus 4.6 max (53.4), Gemini 3.1 Pro (54.2)
HLE 36.4% (no tools) / 55.5% (w/ tools) — Humanity's Last Exam frontier tier
Agent swarm architecture — 300 sub-agents, 4,000 coordinated steps
256K native context (262,144 tokens) via YARN scaling
Native tool calling — search, code-interpreter, web-browsing
Modified-MIT license — redistribution + fine-tuning allowed
Released 2026-04-20 by Moonshot AI

Quick Start

# IQ4_XS (recommended balance, 546GB, M3 Ultra 512GB+)
ollama pull batiai/kimi-k2.6:iq4

# IQ3_XXS (smaller, 394GB, 384GB+ RAM)
ollama pull batiai/kimi-k2.6:iq3

# Q5_K_M (highest quality, 728GB, needs 768GB+ RAM)
ollama pull batiai/kimi-k2.6:q5

Available Quantizations

Quant	Size	Min RAM	Target Hardware	Notes
IQ3_XXS	394GB	384GB	M3 Ultra 512GB / H100 node	aggressive compression, imatrix-calibrated
IQ4_XS	546GB	512GB	M3 Ultra 512GB / 8×A100 80GB	recommended balance
Q5_K_M	728GB	768GB	2× M3 Ultra / 8×A100 80GB / H100 node	highest quality, near-original

⚠️ Not for consumer Mac — this is a workstation / server / frontier research model. 16-128GB Macs should use batiai/qwen3.6-35b or batiai/minimax-m2.7 instead (see comparison table below).

Hardware Reality Check

Your System	IQ3 (394GB)	IQ4 (546GB)	Q5 (728GB)
Mac 128GB	❌ Won't fit	❌	❌
Mac 192GB	❌ Won't fit	❌	❌
Mac 256GB	⚠️ Heavy swap (unusable)	❌	❌
Mac 384GB	⚠️ Tight	❌	❌
Mac M3 Ultra 512GB	✅ Comfortable	✅ Usable (tight)	❌
2× M3 Ultra (cluster)	✅	✅	✅
8× A100 80GB (640GB total)	✅	✅ Fast	✅
H100 node (640GB+)	✅ Fast	✅ Fast	✅ Fast

Numbers based on MoE activation patterns — 32B active params × 4 bytes buffer ≈ 130GB runtime even after quantization, plus shard headers + KV cache (at 256K context, cache alone is 30-80GB).

What BatiAI's Quantization Delivers

	BatiAI	unsloth / ubergarm
Source	Direct from official Moonshot FP8 weights	Same (major providers)
Quantization flow	FP8 → Q8_0 → IQ3_XXS/IQ4_XS with imatrix (wikitext-2 calibration, 200 chunks)	Similar
imatrix	✅ 200 chunks (quality saturation point)	Varies
Tool-calling preservation	✅ Native template preserved	✅
Korean validation	✅ (pending benchmark on target hardware)	✗
BatiAI signature	✅ `general.author=BatiAI`, `general.url=https://flow.bati.ai`	✗
Pipeline	Open source — `docs/202604-large-moe-quantization.md`	Internal

Model Comparison — BatiAI Model Lineup

Kimi K2.6 is for frontier workstation users. For everyone else:

Your Hardware	Best BatiAI Model	Size
16GB Mac	`batiai/gemma4-e4b:q4`	4.9GB
24GB Mac	`batiai/gemma4-26b:iq4`	15GB
48GB Mac	`batiai/qwen3.5-35b:iq4`	22GB
96GB Mac	`batiai/qwen3.6-35b:iq4`	22GB
128GB Mac	`batiai/minimax-m2.7:iq3`	82GB
M3 Ultra 512GB / H100	`batiai/kimi-k2.6:iq4`	509GB

Benchmarks (source model)

Benchmark numbers from Moonshot AI's official report — validating that aggressive quantization preserves these capabilities is pending on our end (bench.sh on M3 Ultra / H100 target).

Benchmark	Kimi K2.6	Comparison
SWE-Bench Pro	58.6	GPT-5.4 xhigh 57.7, Opus 4.6 max 53.4
HLE (no tools)	36.4%	frontier tier
HLE (w/ tools)	55.5%	frontier tier
Context	256K	YARN scaling
Native tool use	✅	search, code, web

Technical Details

Original Model: moonshotai/Kimi-K2.6
Architecture: Mixture of Experts — 1T total / 32B active, 61 layers, 384 experts (8 selected + 1 shared), MLA attention
Original storage: FP8 / INT4 hybrid QAT (555GB)
License: Modified-MIT
Quantized with: llama.cpp
Calibration: wikitext-2-raw, 200 chunks (quality saturation)
Quantized by: BatiAI

Usage

llama.cpp

./llama-cli -m Kimi-K2.6-IQ4_XS.gguf \
  -p "Your prompt" \
  --ctx-size 65536 \
  --n-gpu-layers 99

Ollama

ollama run batiai/kimi-k2.6:iq4

vLLM / TGI

Not directly compatible — these serve FP8/BF16 safetensors. Use original moonshotai/Kimi-K2.6 for vLLM.

About BatiAI

BatiAI quantizes frontier open weight models with validated quality and transparent provenance. We built BatiFlow — free, on-device AI automation for Mac — and open-source our full quantization pipeline.

The Kimi K2.6 release demonstrates our pipeline handles 1T+ MoE models (most quantization providers stop at 70B). See our Kimi K2.6 quantization notes for the engineering trade-offs.

License

Quantized from moonshotai/Kimi-K2.6. License: Modified-MIT — commercial use + redistribution allowed.

Downloads last month: 1,730

GGUF

Model size

1T params

Architecture

deepseek2

Hardware compatibility

3-bit

4-bit

5-bit

Model tree for batiai/Kimi-K2.6-GGUF

Base model

moonshotai/Kimi-K2.6

Quantized

(34)

this model

Collection including batiai/Kimi-K2.6-GGUF

🚀 Frontier MoE — 128B–1T

Collection

Largest open-weight LLMs, BatiAI-quantized. Mac-runnable from M4 Max 128GB to Mac Studio M3 Ultra 512GB. • 7 items • Updated 7 days ago