Instructions to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF",
	filename="Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
# Run inference directly in the terminal:
llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
# Run inference directly in the terminal:
llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
# Run inference directly in the terminal:
./llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Use Docker

docker model run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

LM Studio
Jan

vLLM

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Ollama
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Ollama:
```
ollama run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
```

Unsloth Studio

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF to start chatting

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Docker Model Runner:
```
docker model run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
```

Lemonade

How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF-Q3_K_M

List all available models

lemonade list

Cerebellum

Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF

Sensitivity-guided mixed-precision quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF, which is itself a decensored variant of Qwen/Qwen3.6-35B-A3B produced by llmfan46 using Heretic v1.2.0.

All future Heretic versions of this build will live in this repository. Version identifiers appear only in filenames, not in the repo name.

Files

File	Size	Description
`Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf`	11.96 GB (11,955,468,384 bytes)	Cerebellum v3 recipe — recommended
`Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf`	~858 MB	Vision projector, passed through unmodified from llmfan46's repo

The vision projector is required for multimodal (image/video) use. It is identical to the file distributed by llmfan46 and is included here for single-repo convenience only.

Provenance

Base architecture: Qwen/Qwen3.6-35B-A3B — Qwen Team (Apache-2.0)
Heretic variant: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF — llmfan46. The BF16 GGUF from that repository was used as the direct quantization source. llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal Ablation (MPOA) method, targeting attn.o_proj, attn.out_proj, and mlp.down_proj. Their reported result: 0.0015 KL divergence from base, 10/100 refusals vs 83/100 on the original model.
Quantization: Cerebellum v3 recipe transferred verbatim from the stock deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF build — same 360-entry tensor-type override file, same Unsloth coder imatrix.

Benchmarks

Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090. All numbers are audited; every failed answer was manually verified as a genuine model error — audit reports are in benchmark_results/AUDIT_*.md. Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON, adversarial audit reports) is in benchmark_results/ in this repository.

Heretic Cerebellum v1 (11.96 GB) vs baselines

Benchmark	Heretic Cerebellum v1 (11.96 GB)	Stock Cerebellum v3 (11.1 GB)	Uniform Q3_K_M baseline (15.6 GB)	Notes
Wiki PPL (ctx 2048, 32 chunks)	7.157 ± 0.103	7.099 ± 0.102	—	RTX 3090, identical invocation
ARC-Challenge	95.48% (1172 q)	95.82%	96.10%	25-shot
HellaSwag	91.78% (10042 q)	92.28%	91.50%	10-shot
MMLU-Redux	75.42% (2400 q)	75.00%	74.12%	5-shot
HumanEval base	68.29% (164 problems)	70.73%	—	pass@1, evalplus
HumanEval+	64.63%	65.24%	56.71%	pass@1, evalplus
Vision smoke	100% (24/24)	100% (36 images)	—	basic image description
RealWorldQA	76.0% (n=50)	~78%	—	single-question granularity ±2%

Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base. Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB — the standard comparison point for showing what mixed-precision buys at reduced size.

Head-to-head: same weights, uniform quant

llmfan46's own uniform Q3_K_M of the identical heretic weights (16.87 GB) was benchmarked on the identical harness, same night, same protocol.

Metric	Heretic Cerebellum v1 (11.96 GB)	Uniform Q3_K_M (16.87 GB)
Wiki PPL (ctx 2048, 32 chunks)	7.157 ± 0.103	7.220 ± 0.106
ARC-Challenge	95.48%	95.56%
HellaSwag	91.78%	91.92%
MMLU-Redux	75.42%	74.88%
HumanEval base	68.29%	65.24%
HumanEval+	64.63%	57.93%

The Cerebellum allocation is 29% smaller and scores equal-or-better on PPL, MMLU and HumanEval+ (both runs' per-question artifacts in benchmark_results_uniform/).

Heretic Abliteration Details (from llmfan46)

The following parameters are as reported in llmfan46's model card and are reproduced here for downstream reference.

Parameter	Value
direction_index	19.93
attn.out_proj.max_weight	1.49
attn.out_proj.max_weight_position	23.45
attn.out_proj.min_weight	1.08
attn.out_proj.min_weight_distance	16.54
mlp.down_proj.max_weight	1.46
mlp.down_proj.max_weight_position	28.05
mlp.down_proj.min_weight	1.27
mlp.down_proj.min_weight_distance	18.79
attn.o_proj.max_weight	1.47
attn.o_proj.max_weight_position	24.35
attn.o_proj.min_weight	0.07
attn.o_proj.min_weight_distance	22.58

Targeted components: attn.o_proj, attn.out_proj, mlp.down_proj.

Tool: Heretic v1.2.0, method: Magnitude-Preserving Orthogonal Ablation (MPOA) (reference).

Cerebellum v3 Tensor Allocation

Same allocation as the stock build. Listed here for reference.

Group	Precision	Rationale
`attn_qkv`	Q3_K_M	Critical for vision and attention routing
`ssm_out`	Q3_K_M	Most sensitive tensor per ablation (+0.24 PPL)
`ffn_gate_exps`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation
`ffn_up_exps`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation
`ffn_down_exps`	Q2_K	Acceptable loss for size savings
`ffn_gate_shexp`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation
`ffn_up_shexp`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation
`ffn_down_shexp`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation
`attn_gate`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation
`ssm_alpha`, `ssm_beta`	Q2_K	Q2_K regularization outperforms Q3_K_M in reverse ablation

Protected: all norms (F32), SSM state parameters (F32), router tensors (default).

6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse ablation — imatrix-guided Q2_K acts as regularization on gate, mixing, and shared-expert weights for this architecture.

Perplexity Note

Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock Cerebellum v3 (7.099). The difference is within the measurement uncertainty (overlapping ±0.1 error bars) and reflects the small distributional shift introduced by abliteration rather than quantization quality. Both builds used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090.

Measured launch (RTX 3090, llama.cpp)

Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:

metric	measured
decode speed	149 tok/s
peak VRAM (4-slot serving)	14.2 GB
max measured context (q8_0 KV)	131,072

llama-server -m Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  -ngl 99 --parallel 4 -c 24576 --jinja

This rig's measurements; no quality claims beyond them.

Runtime — Casual Deployment

llama-server \
  --model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  --mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja

--jinja is required for Qwen3.6. The enable_thinking chat-template flag only takes effect when the Jinja template path is active; without it, the model defaults to thinking mode on every request.

Non-thinking requests require an explicit flag at the API level:

{"chat_template_kwargs": {"enable_thinking": false}}

Qwen3.6 does not support the /think and /nothink soft-switch tokens used by Qwen3.5. Thinking mode is on by default.

Recommended Sampling Parameters

From the official Qwen3.6-35B-A3B documentation.

Mode	temperature	top_p	top_k	presence_penalty	repetition_penalty
Thinking — general	1.0	0.95	20	1.5	1.0
Thinking — precise coding (WebDev)	0.6	0.95	20	0.0	1.0
Non-thinking (instruct)	0.7	0.80	20	1.5	1.0

presence_penalty can be adjusted between 0 and 2 to reduce repetition loops; higher values may occasionally cause language mixing.

Reproduction

Standard Cerebellum recipe. The tensor-type override file and ablation logs from the stock v3 build apply directly.

# 1. imatrix (constant ~300 MB RAM)
python -m osmosis.imatrix_stream \
    --model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    --output imatrix.dat

# 2. quantize with stock llama-quantize
llama-quantize \
    --imatrix imatrix.dat \
    --tensor-type-file cerebellum_v3_overrides.txt \
    Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
    Q3_K_M

The imatrix used for this build was generated from the Unsloth coder corpus (same corpus as the stock Cerebellum v3 build).

The 360-line tensor override file (cerebellum_v3_overrides.txt) is included in this repository alongside the ablation logs.

Benchmark Artifacts

Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and adversarial audit reports (AUDIT_*.md) are in benchmark_results/ in this repository per project policy.

Credits

Base model: Qwen/Qwen3.6-35B-A3B — Qwen Team
Heretic variant and BF16 source: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF — llmfan46
Abliteration tool: Heretic v1.2.0 by p-e-w
GGUF runtime: llama.cpp
Quantization method and workflow: Cerebellum — deucebucket