---
base_model: Qwen/Qwen3.6-35B-A3B
library_name: mlx
tags:
  - turboquant
  - kv-cache-quantization
  - qwen
  - qwen-3.6
  - qwen3.6
  - moe
  - multimodal
  - quantized
  - mlx
  - 8bit
license: apache-2.0
pipeline_tag: image-text-to-text
---

# Qwen3.6 35B-A3B - TurboQuant MLX 8-bit

**8-bit weight-quantized MLX version** of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the [MLX](https://github.com/ml-explore/mlx) framework. Only 3B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: **~35 GB**

## Model Specifications

| Property | Value |
|---|---|
| **Base Model** | [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) |
| **Parameters** | 35 billion total (3 billion active per token) |
| **Architecture** | Mixture-of-Experts (MoE) (3B active per token) |
| **Modality** | Multimodal: image + video + text input, text output |
| **License** | Apache 2.0 |
| **Weight Quantization** | 8-bit (~35 GB) |
| **KV-Cache Quantization** | TurboQuant |
| **Framework** | MLX (Apple Silicon) |

## Quickstart

```python
import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-8bit")

prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
```

For multimodal usage with images:

```python
from mlx_vlm import load, generate

model, processor = load("majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-8bit")

prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)
```

## What is TurboQuant?

TurboQuant ([arXiv: 2504.19874](https://arxiv.org/abs/2504.19874)) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy: smaller model weights for reduced disk and memory footprint, plus compressed KV cache for efficient long-context generation.

## KV-Cache Quantization Comparison

| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
| **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |

## Memory Estimates (Qwen3.6 35B-A3B)

| Precision | Approximate Size | MLX Variant |
|---|---|---|
| FP16 (original) | ~70 GB (approx.) | -- |
| **8-bit quantized** | **~35 GB** | **This model** |
| 4-bit quantized | ~18 GB | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-4bit) |
| 2-bit quantized | ~9 GB | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-2bit) |

## Hardware Requirements

This model requires approximately 35 GB of unified memory. Recommended hardware:
- Apple M2 Max (32 GB+)
- Apple M3 Max (48 GB+)
- Apple M4 Max (48 GB+)

## See Also

- [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model
- [majentik/Qwen3.6-35B-A3B-TurboQuant](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant) -- TurboQuant KV-cache only (transformers)
- [majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-4bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-4bit) -- MLX 4-bit variant
- [majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-2bit) -- MLX 2-bit variant
- [majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-8bit](https://huggingface.co/majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-8bit) -- RotorQuant MLX 8-bit variant
- [TurboQuant Paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874)
- [MLX Framework](https://github.com/ml-explore/mlx)

## Quant trade-off (MLX lane)

| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 2-bit | ~9.1 GB | Aggressive quantization | Very low-RAM Macs |
| 3-bit | ~13 GB | Lossy but small | Low-RAM Macs |
| 4-bit | ~15 GB | Balanced default | Recommended for most Macs |
| 5-bit | ~18 GB | Higher fidelity | Quality-sensitive |
| 6-bit | ~21 GB | Approaching FP16 quality | High-fidelity |
| **8-bit** | ~27 GB | Near-lossless reference | **Fidelity-critical work** |

(Current variant — **8bit** — is bolded.)

## Variants in this family

(Showing 24 sibling variants under `majentik/qwen3.6-35b-a3b-*`. The current variant — `TurboQuant-MLX-8bit` — is **bolded**.)

| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [RotorQuant](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [RotorQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) |
| [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) |
| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~30 GB | Lossy 4-bit, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q2_K) | llama.cpp | ~21 GB | Lossy, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~27 GB | Smaller 3-bit, CPU-friendly |
| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~38 GB | Balanced default |
| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~46 GB | Higher fidelity, more RAM |
| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q8_0) | llama.cpp | ~74 GB | Near-lossless reference |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest |
| [RotorQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced |
| [RotorQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity |
| [RotorQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-8bit) | mlx-lm | ~41 GB | Apple Silicon reference |
| [TurboQuant](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) |
| [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest |
| [TurboQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced |
| [TurboQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity |
| [TurboQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless |
| **TurboQuant-MLX-8bit** | mlx-lm | ~41 GB | Apple Silicon reference |