--- base_model: Qwen/Qwen3.6-35B-A3B library_name: mlx tags: - turboquant - kv-cache-quantization - qwen - qwen-3.6 - qwen3.6 - moe - multimodal - quantized - mlx - 8bit license: apache-2.0 pipeline_tag: image-text-to-text --- # Qwen3.6 35B-A3B - TurboQuant MLX 8-bit **8-bit weight-quantized MLX version** of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the [MLX](https://github.com/ml-explore/mlx) framework. Only 3B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests. Approximate model size: **~35 GB** ## Model Specifications | Property | Value | |---|---| | **Base Model** | [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) | | **Parameters** | 35 billion total (3 billion active per token) | | **Architecture** | Mixture-of-Experts (MoE) (3B active per token) | | **Modality** | Multimodal: image + video + text input, text output | | **License** | Apache 2.0 | | **Weight Quantization** | 8-bit (~35 GB) | | **KV-Cache Quantization** | TurboQuant | | **Framework** | MLX (Apple Silicon) | ## Quickstart ```python import mlx.core as mx from mlx_lm import load, generate model, tokenizer = load("majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-8bit") prompt = "Describe this image in detail." response = generate(model, tokenizer, prompt=prompt, max_tokens=512) print(response) ``` For multimodal usage with images: ```python from mlx_vlm import load, generate model, processor = load("majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-8bit") prompt = "What do you see in this image?" output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512) print(output) ``` ## What is TurboQuant? TurboQuant ([arXiv: 2504.19874](https://arxiv.org/abs/2504.19874)) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy: smaller model weights for reduced disk and memory footprint, plus compressed KV cache for efficient long-context generation. ## KV-Cache Quantization Comparison | Method | Prefill Speed | Decode Speed | Memory Savings | Reference | |---|---|---|---|---| | **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) | | **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) | ## Memory Estimates (Qwen3.6 35B-A3B) | Precision | Approximate Size | MLX Variant | |---|---|---| | FP16 (original) | ~70 GB (approx.) | -- | | **8-bit quantized** | **~35 GB** | **This model** | | 4-bit quantized | ~18 GB | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-4bit) | | 2-bit quantized | ~9 GB | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-2bit) | ## Hardware Requirements This model requires approximately 35 GB of unified memory. Recommended hardware: - Apple M2 Max (32 GB+) - Apple M3 Max (48 GB+) - Apple M4 Max (48 GB+) ## See Also - [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model - [majentik/Qwen3.6-35B-A3B-TurboQuant](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant) -- TurboQuant KV-cache only (transformers) - [majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-4bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-4bit) -- MLX 4-bit variant - [majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-2bit) -- MLX 2-bit variant - [majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-8bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-8bit) -- RotorQuant MLX 8-bit variant - [TurboQuant Paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874) - [MLX Framework](https://github.com/ml-explore/mlx) ## Quant trade-off (MLX lane) | Bits | Approx size | Use case | Recommendation | |---|---|---|---| | 2-bit | ~9.1 GB | Aggressive quantization | Very low-RAM Macs | | 3-bit | ~13 GB | Lossy but small | Low-RAM Macs | | 4-bit | ~15 GB | Balanced default | Recommended for most Macs | | 5-bit | ~18 GB | Higher fidelity | Quality-sensitive | | 6-bit | ~21 GB | Approaching FP16 quality | High-fidelity | | **8-bit** | ~27 GB | Near-lossless reference | **Fidelity-critical work** | (Current variant — **8bit** — is bolded.) ## Variants in this family (Showing 24 sibling variants under `majentik/qwen3.6-35b-a3b-*`. The current variant — `TurboQuant-MLX-8bit` — is **bolded**.) | Variant | Runtime | Approx size | Use case | |---|---|---|---| | [RotorQuant](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [RotorQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) | | [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) | | [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~30 GB | Lossy 4-bit, low-RAM CPU/edge | | [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q2_K) | llama.cpp | ~21 GB | Lossy, low-RAM CPU/edge | | [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~27 GB | Smaller 3-bit, CPU-friendly | | [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~38 GB | Balanced default | | [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~46 GB | Higher fidelity, more RAM | | [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q8_0) | llama.cpp | ~74 GB | Near-lossless reference | | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest | | [RotorQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small | | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced | | [RotorQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity | | [RotorQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless | | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-8bit) | mlx-lm | ~41 GB | Apple Silicon reference | | [TurboQuant](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) | | [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) | | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest | | [TurboQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small | | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced | | [TurboQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity | | [TurboQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless | | **TurboQuant-MLX-8bit** | mlx-lm | ~41 GB | Apple Silicon reference |