Instructions to use batiai/Kimi-K2.6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use batiai/Kimi-K2.6-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/Kimi-K2.6-GGUF", filename="moonshotai-Kimi-K2.6-IQ3_XXS-00001-of-00009.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use batiai/Kimi-K2.6-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS # Run inference directly in the terminal: llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS # Run inference directly in the terminal: llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS # Run inference directly in the terminal: ./llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS # Run inference directly in the terminal: ./build/bin/llama-cli -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
Use Docker
docker model run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS
- LM Studio
- Jan
- vLLM
How to use batiai/Kimi-K2.6-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "batiai/Kimi-K2.6-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "batiai/Kimi-K2.6-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS
- Ollama
How to use batiai/Kimi-K2.6-GGUF with Ollama:
ollama run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS
- Unsloth Studio new
How to use batiai/Kimi-K2.6-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Kimi-K2.6-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Kimi-K2.6-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for batiai/Kimi-K2.6-GGUF to start chatting
- Pi new
How to use batiai/Kimi-K2.6-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "batiai/Kimi-K2.6-GGUF:IQ3_XXS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use batiai/Kimi-K2.6-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Kimi-K2.6-GGUF:IQ3_XXS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default batiai/Kimi-K2.6-GGUF:IQ3_XXS
Run Hermes
hermes
- Docker Model Runner
How to use batiai/Kimi-K2.6-GGUF with Docker Model Runner:
docker model run hf.co/batiai/Kimi-K2.6-GGUF:IQ3_XXS
- Lemonade
How to use batiai/Kimi-K2.6-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull batiai/Kimi-K2.6-GGUF:IQ3_XXS
Run and chat with the model
lemonade run user.Kimi-K2.6-GGUF-IQ3_XXS
List all available models
lemonade list
Kimi K2.6 GGUF โ Quantized by BatiAI
IQ3_XXS / IQ4_XS quantization of moonshotai/Kimi-K2.6 (1T total / 32B active MoE). Quantized directly from official Moonshot FP8 weights by BatiAI.
Why Kimi K2.6?
- 1T parameters (32B active) โ frontier-class open weight model
- SWE-Bench Pro 58.6 โ beats GPT-5.4 xhigh (57.7), Claude Opus 4.6 max (53.4), Gemini 3.1 Pro (54.2)
- HLE 36.4% (no tools) / 55.5% (w/ tools) โ Humanity's Last Exam frontier tier
- Agent swarm architecture โ 300 sub-agents, 4,000 coordinated steps
- 256K native context (262,144 tokens) via YARN scaling
- Native tool calling โ search, code-interpreter, web-browsing
- Modified-MIT license โ redistribution + fine-tuning allowed
- Released 2026-04-20 by Moonshot AI
Quick Start
# IQ4_XS (recommended balance, 546GB, M3 Ultra 512GB+)
ollama pull batiai/kimi-k2.6:iq4
# IQ3_XXS (smaller, 394GB, 384GB+ RAM)
ollama pull batiai/kimi-k2.6:iq3
# Q5_K_M (highest quality, 728GB, needs 768GB+ RAM)
ollama pull batiai/kimi-k2.6:q5
Available Quantizations
| Quant | Size | Min RAM | Target Hardware | Notes |
|---|---|---|---|---|
| IQ3_XXS | 394GB | 384GB | M3 Ultra 512GB / H100 node | aggressive compression, imatrix-calibrated |
| IQ4_XS | 546GB | 512GB | M3 Ultra 512GB / 8รA100 80GB | recommended balance |
| Q5_K_M | 728GB | 768GB | 2ร M3 Ultra / 8รA100 80GB / H100 node | highest quality, near-original |
โ ๏ธ Not for consumer Mac โ this is a workstation / server / frontier research model. 16-128GB Macs should use
batiai/qwen3.6-35borbatiai/minimax-m2.7instead (see comparison table below).
Hardware Reality Check
| Your System | IQ3 (394GB) | IQ4 (546GB) | Q5 (728GB) |
|---|---|---|---|
| Mac 128GB | โ Won't fit | โ | โ |
| Mac 192GB | โ Won't fit | โ | โ |
| Mac 256GB | โ ๏ธ Heavy swap (unusable) | โ | โ |
| Mac 384GB | โ ๏ธ Tight | โ | โ |
| Mac M3 Ultra 512GB | โ Comfortable | โ Usable (tight) | โ |
| 2ร M3 Ultra (cluster) | โ | โ | โ |
| 8ร A100 80GB (640GB total) | โ | โ Fast | โ |
| H100 node (640GB+) | โ Fast | โ Fast | โ Fast |
Numbers based on MoE activation patterns โ 32B active params ร 4 bytes buffer โ 130GB runtime even after quantization, plus shard headers + KV cache (at 256K context, cache alone is 30-80GB).
What BatiAI's Quantization Delivers
| BatiAI | unsloth / ubergarm | |
|---|---|---|
| Source | Direct from official Moonshot FP8 weights | Same (major providers) |
| Quantization flow | FP8 โ Q8_0 โ IQ3_XXS/IQ4_XS with imatrix (wikitext-2 calibration, 200 chunks) | Similar |
| imatrix | โ 200 chunks (quality saturation point) | Varies |
| Tool-calling preservation | โ Native template preserved | โ |
| Korean validation | โ (pending benchmark on target hardware) | โ |
| BatiAI signature | โ
general.author=BatiAI, general.url=https://flow.bati.ai |
โ |
| Pipeline | Open source โ docs/202604-large-moe-quantization.md |
Internal |
Model Comparison โ BatiAI Model Lineup
Kimi K2.6 is for frontier workstation users. For everyone else:
| Your Hardware | Best BatiAI Model | Size |
|---|---|---|
| 16GB Mac | batiai/gemma4-e4b:q4 |
4.9GB |
| 24GB Mac | batiai/gemma4-26b:iq4 |
15GB |
| 48GB Mac | batiai/qwen3.5-35b:iq4 |
22GB |
| 96GB Mac | batiai/qwen3.6-35b:iq4 |
22GB |
| 128GB Mac | batiai/minimax-m2.7:iq3 |
82GB |
| M3 Ultra 512GB / H100 | batiai/kimi-k2.6:iq4 |
509GB |
Benchmarks (source model)
Benchmark numbers from Moonshot AI's official report โ validating that aggressive quantization preserves these capabilities is pending on our end (bench.sh on M3 Ultra / H100 target).
| Benchmark | Kimi K2.6 | Comparison |
|---|---|---|
| SWE-Bench Pro | 58.6 | GPT-5.4 xhigh 57.7, Opus 4.6 max 53.4 |
| HLE (no tools) | 36.4% | frontier tier |
| HLE (w/ tools) | 55.5% | frontier tier |
| Context | 256K | YARN scaling |
| Native tool use | โ | search, code, web |
Technical Details
- Original Model: moonshotai/Kimi-K2.6
- Architecture: Mixture of Experts โ 1T total / 32B active, 61 layers, 384 experts (8 selected + 1 shared), MLA attention
- Original storage: FP8 / INT4 hybrid QAT (555GB)
- License: Modified-MIT
- Quantized with: llama.cpp
- Calibration: wikitext-2-raw, 200 chunks (quality saturation)
- Quantized by: BatiAI
Usage
llama.cpp
./llama-cli -m Kimi-K2.6-IQ4_XS.gguf \
-p "Your prompt" \
--ctx-size 65536 \
--n-gpu-layers 99
Ollama
ollama run batiai/kimi-k2.6:iq4
vLLM / TGI
Not directly compatible โ these serve FP8/BF16 safetensors. Use original moonshotai/Kimi-K2.6 for vLLM.
About BatiAI
BatiAI quantizes frontier open weight models with validated quality and transparent provenance. We built BatiFlow โ free, on-device AI automation for Mac โ and open-source our full quantization pipeline.
The Kimi K2.6 release demonstrates our pipeline handles 1T+ MoE models (most quantization providers stop at 70B). See our Kimi K2.6 quantization notes for the engineering trade-offs.
License
Quantized from moonshotai/Kimi-K2.6. License: Modified-MIT โ commercial use + redistribution allowed.
- Downloads last month
- 1,730
3-bit
4-bit
5-bit
Model tree for batiai/Kimi-K2.6-GGUF
Base model
moonshotai/Kimi-K2.6