---
library_name: mlx
tags:
- mlx
- safetensors
- kimi_k25
- quantized
- moe-aware-quant
- image-text-to-text
- conversational
- custom_code
base_model: moonshotai/Kimi-K2.6
base_model_relation: quantized
language:
- en
pipeline_tag: image-text-to-text
---

# Kimi-K2.6-MoE-Smart-Quant (MLX)

MoE-aware mixed-precision quantization of [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6) for Apple Silicon.

## Quantization Strategy

Unlike uniform quantization, this applies **per-component bit allocation** optimized for MoE + MLA architecture:

| Component | Bits | Rationale |
|-----------|------|-----------|
| Routed experts (384 SwitchLinear) | 4-bit | Only 8/384 fire per token — very tolerant of low-bit |
| Shared expert (always active) | 6-bit | Every-token path, needs precision |
| MLA value projections (v_a/v_b) | 8-bit | Most sensitive attention weights |
| MLA other projections (q_a/q_b/kv_a/kv_b/o) | 6-bit | Latent compression layer |
| lm_head + embed_tokens | 8-bit | Output quality |
| First/last 3 decoder layers | 6-bit | Boundary layer sensitivity |
| Gate/router | unquantized | Tiny params, routing-critical |
| Vision encoder | unquantized | Preserved via mlx-vlm |

**Effective average: ~4.5 bpw** — near-6-bit quality at near-4-bit size.

## Model Details

- **Base model**: Kimi-K2.6 (1T params, 32B active, 384 experts)
- **Architecture**: MoE + MLA (kimi_k25)
- **Context**: 256K tokens
- **Modality**: Vision + Language (VLM)
- **Converted with**: mlx-vlm 0.4.2

## Usage


## Hardware Requirements

- **Single node**: M3/M4 Ultra 192GB+ (fits in ~150GB)
- **Distributed**: 2x M3 Ultra via JACCL/RDMA for headroom

---

*Weights uploading — conversion in progress.*