Image-Text-to-Text
MLX
Safetensors
English
kimi_k25
quantized
moe-aware-quant
conversational
custom_code
Instructions to use mlx-community/Kimi-K2.6-MoE-Smart-Quant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Kimi-K2.6-MoE-Smart-Quant with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mlx-community/Kimi-K2.6-MoE-Smart-Quant") config = load_config("mlx-community/Kimi-K2.6-MoE-Smart-Quant") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: mlx
|
| 3 |
+
tags:
|
| 4 |
+
- mlx
|
| 5 |
+
- safetensors
|
| 6 |
+
- kimi_k25
|
| 7 |
+
- quantized
|
| 8 |
+
- moe-aware-quant
|
| 9 |
+
- image-text-to-text
|
| 10 |
+
- conversational
|
| 11 |
+
- custom_code
|
| 12 |
+
base_model: moonshotai/Kimi-K2.6
|
| 13 |
+
base_model_relation: quantized
|
| 14 |
+
language:
|
| 15 |
+
- en
|
| 16 |
+
pipeline_tag: image-text-to-text
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Kimi-K2.6-MoE-Smart-Quant (MLX)
|
| 20 |
+
|
| 21 |
+
MoE-aware mixed-precision quantization of [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6) for Apple Silicon.
|
| 22 |
+
|
| 23 |
+
## Quantization Strategy
|
| 24 |
+
|
| 25 |
+
Unlike uniform quantization, this applies **per-component bit allocation** optimized for MoE + MLA architecture:
|
| 26 |
+
|
| 27 |
+
| Component | Bits | Rationale |
|
| 28 |
+
|-----------|------|-----------|
|
| 29 |
+
| Routed experts (384 SwitchLinear) | 4-bit | Only 8/384 fire per token — very tolerant of low-bit |
|
| 30 |
+
| Shared expert (always active) | 6-bit | Every-token path, needs precision |
|
| 31 |
+
| MLA value projections (v_a/v_b) | 8-bit | Most sensitive attention weights |
|
| 32 |
+
| MLA other projections (q_a/q_b/kv_a/kv_b/o) | 6-bit | Latent compression layer |
|
| 33 |
+
| lm_head + embed_tokens | 8-bit | Output quality |
|
| 34 |
+
| First/last 3 decoder layers | 6-bit | Boundary layer sensitivity |
|
| 35 |
+
| Gate/router | unquantized | Tiny params, routing-critical |
|
| 36 |
+
| Vision encoder | unquantized | Preserved via mlx-vlm |
|
| 37 |
+
|
| 38 |
+
**Effective average: ~4.5 bpw** — near-6-bit quality at near-4-bit size.
|
| 39 |
+
|
| 40 |
+
## Model Details
|
| 41 |
+
|
| 42 |
+
- **Base model**: Kimi-K2.6 (1T params, 32B active, 384 experts)
|
| 43 |
+
- **Architecture**: MoE + MLA (kimi_k25)
|
| 44 |
+
- **Context**: 256K tokens
|
| 45 |
+
- **Modality**: Vision + Language (VLM)
|
| 46 |
+
- **Converted with**: mlx-vlm 0.4.2
|
| 47 |
+
|
| 48 |
+
## Usage
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
## Hardware Requirements
|
| 53 |
+
|
| 54 |
+
- **Single node**: M3/M4 Ultra 192GB+ (fits in ~150GB)
|
| 55 |
+
- **Distributed**: 2x M3 Ultra via JACCL/RDMA for headroom
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
*Weights uploading — conversion in progress.*
|