mlx-community
/

Kimi-K2.6-MoE-Smart-Quant

Image-Text-to-Text

moe-aware-quant

Model card Files Files and versions

machiabeli commited on Apr 20

Commit

3dbfd6f

·

verified ·

1 Parent(s): ff9ffe9

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +59 -0

README.md ADDED Viewed

	@@ -0,0 +1,59 @@

+---
+library_name: mlx
+tags:
+- mlx
+- safetensors
+- kimi_k25
+- quantized
+- moe-aware-quant
+- image-text-to-text
+- conversational
+- custom_code
+base_model: moonshotai/Kimi-K2.6
+base_model_relation: quantized
+language:
+- en
+pipeline_tag: image-text-to-text
+---
+# Kimi-K2.6-MoE-Smart-Quant (MLX)
+MoE-aware mixed-precision quantization of [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6) for Apple Silicon.
+## Quantization Strategy
+Unlike uniform quantization, this applies **per-component bit allocation** optimized for MoE + MLA architecture:
+| Component | Bits | Rationale |
+|-----------|------|-----------|
+| Routed experts (384 SwitchLinear) | 4-bit | Only 8/384 fire per token — very tolerant of low-bit |
+| Shared expert (always active) | 6-bit | Every-token path, needs precision |
+| MLA value projections (v_a/v_b) | 8-bit | Most sensitive attention weights |
+| MLA other projections (q_a/q_b/kv_a/kv_b/o) | 6-bit | Latent compression layer |
+| lm_head + embed_tokens | 8-bit | Output quality |
+| First/last 3 decoder layers | 6-bit | Boundary layer sensitivity |
+| Gate/router | unquantized | Tiny params, routing-critical |
+| Vision encoder | unquantized | Preserved via mlx-vlm |
+**Effective average: ~4.5 bpw** — near-6-bit quality at near-4-bit size.
+## Model Details
+- **Base model**: Kimi-K2.6 (1T params, 32B active, 384 experts)
+- **Architecture**: MoE + MLA (kimi_k25)
+- **Context**: 256K tokens
+- **Modality**: Vision + Language (VLM)
+- **Converted with**: mlx-vlm 0.4.2
+## Usage
+## Hardware Requirements
+- **Single node**: M3/M4 Ultra 192GB+ (fits in ~150GB)
+- **Distributed**: 2x M3 Ultra via JACCL/RDMA for headroom
+---
+*Weights uploading — conversion in progress.*