--- library_name: mlx tags: - mlx - safetensors - kimi_k25 - quantized - moe-aware-quant - image-text-to-text - conversational - custom_code base_model: moonshotai/Kimi-K2.6 base_model_relation: quantized language: - en pipeline_tag: image-text-to-text --- # Kimi-K2.6-MoE-Smart-Quant (MLX) MoE-aware mixed-precision quantization of [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6) for Apple Silicon. ## Quantization Strategy Unlike uniform quantization, this applies **per-component bit allocation** optimized for MoE + MLA architecture: | Component | Bits | Rationale | |-----------|------|-----------| | Routed experts (384 SwitchLinear) | 4-bit | Only 8/384 fire per token — very tolerant of low-bit | | Shared expert (always active) | 6-bit | Every-token path, needs precision | | MLA value projections (v_a/v_b) | 8-bit | Most sensitive attention weights | | MLA other projections (q_a/q_b/kv_a/kv_b/o) | 6-bit | Latent compression layer | | lm_head + embed_tokens | 8-bit | Output quality | | First/last 3 decoder layers | 6-bit | Boundary layer sensitivity | | Gate/router | unquantized | Tiny params, routing-critical | | Vision encoder | unquantized | Preserved via mlx-vlm | **Effective average: ~4.5 bpw** — near-6-bit quality at near-4-bit size. ## Model Details - **Base model**: Kimi-K2.6 (1T params, 32B active, 384 experts) - **Architecture**: MoE + MLA (kimi_k25) - **Context**: 256K tokens - **Modality**: Vision + Language (VLM) - **Converted with**: mlx-vlm 0.4.2 ## Usage ## Hardware Requirements - **Single node**: M3/M4 Ultra 192GB+ (fits in ~150GB) - **Distributed**: 2x M3 Ultra via JACCL/RDMA for headroom --- *Weights uploading — conversion in progress.*