Instructions to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF",
	filename="Qwen3.6-35B-A3B-Q2_K_MIXED.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Use Docker

docker model run hf.co/sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with Ollama:
```
ollama run hf.co/sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M
```

Unsloth Studio

How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF to start chatting

How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with Docker Model Runner:
```
docker model run hf.co/sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M
```

Lemonade

How to use sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-AutoRound-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-35B-A3B GGUF (AutoRound Quantized, MTP Enabled)

This repository contains GGUF quantized versions of Qwen/Qwen3.6-35B-A3B created using Intel's AutoRound quantization method.

Qwen3.6-35B-A3B is a Mixture-of-Experts (MoE) model with 256 experts and approximately 3.6B active parameters.

🆕 MTP (Multi-Token Prediction) Support — All models now include the MTP / NextN head (blk.40.* tensors), enabling speculative decoding in compatible runtimes such as recent builds of llama.cpp. Each GGUF has been validated to contain the full set of MTP tensors.

Quantization Details

The models were quantized using various schemes provided by the auto-round tool with MTP layers explicitly enabled. For multimodal use, projector files (mmproj) are provided in F16, BF16, and F32 formats.

Files and Sizes

File Name	Quant Type	Size	Description
`Qwen3.6-35B-A3B-Q2_K_S.gguf`	Q2_K_S	12 GB	Extremely high compression, significant quality loss.
`Qwen3.6-35B-A3B-Q2_K_MIXED.gguf`	Q2_K_MIXED	13 GB	Recommended high-compression option. Fast inference.
`Qwen3.6-35B-A3B-Q3_K_S.gguf`	Q3_K_S	15 GB	Very high compression, notable quality loss.
`Qwen3.6-35B-A3B-Q3_K_M.gguf`	Q3_K_M	16 GB	Balanced 3-bit quantization.
`Qwen3.6-35B-A3B-Q3_K_L.gguf`	Q3_K_L	18 GB	High quality 3-bit quantization.
`Qwen3.6-35B-A3B-Q4_0.gguf`	Q4_0	19 GB	Standard 4-bit quantization, good balance.
`Qwen3.6-35B-A3B-Q4_1.gguf`	Q4_1	21 GB	Higher quality 4-bit quantization than Q4_0.
`Qwen3.6-35B-A3B-Q4_K_S.gguf`	Q4_K_S	19 GB	Small 4-bit K-quant, good efficiency.
`Qwen3.6-35B-A3B-Q4_K_M.gguf`	Q4_K_M	21 GB	Recommended 4-bit K-quant, excellent balance.
`Qwen3.6-35B-A3B-Q5_0.gguf`	Q5_0	23 GB	Standard 5-bit quantization, very high quality.
`Qwen3.6-35B-A3B-Q5_1.gguf`	Q5_1	25 GB	Higher quality 5-bit quantization than Q5_0.
`Qwen3.6-35B-A3B-Q5_K_S.gguf`	Q5_K_S	23 GB	Small 5-bit K-quant, very high quality.
`Qwen3.6-35B-A3B-Q5_K_M.gguf`	Q5_K_M	24 GB	Recommended 5-bit K-quant, near-lossless.
`Qwen3.6-35B-A3B-Q6_K.gguf`	Q6_K	28 GB	6-bit K-quant, virtually indistinguishable from F16.
`Qwen3.6-35B-A3B-Q8_0.gguf`	Q8_0	36 GB	8-bit quantization, near-lossless.
`mmproj-model-f16.gguf`	F16	—	Unified Projector in Float16 format.
`mmproj-model-bf16.gguf`	BF16	—	Unified Projector in BFloat16 format.
`mmproj-model-f32.gguf`	F32	—	Unified Projector in Float32 format.

Note: File sizes are slightly larger than non-MTP quants due to the additional MTP head weights.

Generate the Model

The models were generated using Intel's AutoRound with MTP layers explicitly enabled:

auto-round \
    --model Qwen/Qwen3.6-35B-A3B \
    --output_dir ./quantized/ \
    --scheme <SCHEME> \
    --iters 0 \
    --options '{"mtp_num_hidden_layers": 1, "num_nextn_predict_layers": 1}'

Usage with llama.cpp

These models can be used with a recent build of llama.cpp (must include Qwen3.5+ MTP support). For multimodal usage, specify the projector file:

./llama-cli -m Qwen3.6-35B-A3B-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf --image your_image.jpg -p "Describe this image."

About AutoRound

AutoRound is an advanced quantization technique from Intel that aims to minimize accuracy loss through automated rounding optimization.

Support

These quantized models are made in my spare time using expensive hardware such as DGX Spark systems for quantization and validation. If you find these GGUFs useful for your projects, consider buying me a coffee to help cover hardware and compute costs. Every bit of support helps me keep producing high-quality quantized models for the community!

☕ Support me on Ko-fi

Downloads last month: 3,044

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(490)

this model

Collection including sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF

Qwen3.6-AutoRound-GGUF

Collection

Maintained by Sphaela. If these models help you, support continued open releases: https://ko-fi.com/sphaela • 2 items • Updated 10 days ago