Qwen3.6-35B-A3B GGUF (AutoRound Quantized, MTP Enabled)

This repository contains GGUF quantized versions of Qwen/Qwen3.6-35B-A3B created using Intel's AutoRound quantization method.

Qwen3.6-35B-A3B is a Mixture-of-Experts (MoE) model with 256 experts and approximately 3.6B active parameters.

🆕 MTP (Multi-Token Prediction) Support — All models now include the MTP / NextN head (blk.40.* tensors), enabling speculative decoding in compatible runtimes such as recent builds of llama.cpp. Each GGUF has been validated to contain the full set of MTP tensors.

Quantization Details

The models were quantized using various schemes provided by the auto-round tool with MTP layers explicitly enabled. For multimodal use, projector files (mmproj) are provided in F16, BF16, and F32 formats.

Files and Sizes

File Name Quant Type Size Description
Qwen3.6-35B-A3B-Q2_K_S.gguf Q2_K_S 12 GB Extremely high compression, significant quality loss.
Qwen3.6-35B-A3B-Q2_K_MIXED.gguf Q2_K_MIXED 13 GB Recommended high-compression option. Fast inference.
Qwen3.6-35B-A3B-Q3_K_S.gguf Q3_K_S 15 GB Very high compression, notable quality loss.
Qwen3.6-35B-A3B-Q3_K_M.gguf Q3_K_M 16 GB Balanced 3-bit quantization.
Qwen3.6-35B-A3B-Q3_K_L.gguf Q3_K_L 18 GB High quality 3-bit quantization.
Qwen3.6-35B-A3B-Q4_0.gguf Q4_0 19 GB Standard 4-bit quantization, good balance.
Qwen3.6-35B-A3B-Q4_1.gguf Q4_1 21 GB Higher quality 4-bit quantization than Q4_0.
Qwen3.6-35B-A3B-Q4_K_S.gguf Q4_K_S 19 GB Small 4-bit K-quant, good efficiency.
Qwen3.6-35B-A3B-Q4_K_M.gguf Q4_K_M 21 GB Recommended 4-bit K-quant, excellent balance.
Qwen3.6-35B-A3B-Q5_0.gguf Q5_0 23 GB Standard 5-bit quantization, very high quality.
Qwen3.6-35B-A3B-Q5_1.gguf Q5_1 25 GB Higher quality 5-bit quantization than Q5_0.
Qwen3.6-35B-A3B-Q5_K_S.gguf Q5_K_S 23 GB Small 5-bit K-quant, very high quality.
Qwen3.6-35B-A3B-Q5_K_M.gguf Q5_K_M 24 GB Recommended 5-bit K-quant, near-lossless.
Qwen3.6-35B-A3B-Q6_K.gguf Q6_K 28 GB 6-bit K-quant, virtually indistinguishable from F16.
Qwen3.6-35B-A3B-Q8_0.gguf Q8_0 36 GB 8-bit quantization, near-lossless.
mmproj-model-f16.gguf F16 Unified Projector in Float16 format.
mmproj-model-bf16.gguf BF16 Unified Projector in BFloat16 format.
mmproj-model-f32.gguf F32 Unified Projector in Float32 format.

Note: File sizes are slightly larger than non-MTP quants due to the additional MTP head weights.

Generate the Model

The models were generated using Intel's AutoRound with MTP layers explicitly enabled:

auto-round \
    --model Qwen/Qwen3.6-35B-A3B \
    --output_dir ./quantized/ \
    --scheme <SCHEME> \
    --iters 0 \
    --options '{"mtp_num_hidden_layers": 1, "num_nextn_predict_layers": 1}'

Usage with llama.cpp

These models can be used with a recent build of llama.cpp (must include Qwen3.5+ MTP support). For multimodal usage, specify the projector file:

./llama-cli -m Qwen3.6-35B-A3B-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf --image your_image.jpg -p "Describe this image."

About AutoRound

AutoRound is an advanced quantization technique from Intel that aims to minimize accuracy loss through automated rounding optimization.


Support

These quantized models are made in my spare time using expensive hardware such as DGX Spark systems for quantization and validation. If you find these GGUFs useful for your projects, consider buying me a coffee to help cover hardware and compute costs. Every bit of support helps me keep producing high-quality quantized models for the community!

☕ Support me on Ko-fi

Downloads last month
3,044
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF

Quantized
(490)
this model

Collection including sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF