How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Cerebellum

Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF

Sensitivity-guided mixed-precision quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF, which is itself a decensored variant of Qwen/Qwen3.6-35B-A3B produced by llmfan46 using Heretic v1.2.0.

All future Heretic versions of this build will live in this repository. Version identifiers appear only in filenames, not in the repo name.

Files

File Size Description
Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf 11.96 GB (11,955,468,384 bytes) Cerebellum v3 recipe — recommended
Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf ~858 MB Vision projector, passed through unmodified from llmfan46's repo

The vision projector is required for multimodal (image/video) use. It is identical to the file distributed by llmfan46 and is included here for single-repo convenience only.

Provenance

  1. Base architecture: Qwen/Qwen3.6-35B-A3B — Qwen Team (Apache-2.0)
  2. Heretic variant: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF — llmfan46. The BF16 GGUF from that repository was used as the direct quantization source. llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal Ablation (MPOA) method, targeting attn.o_proj, attn.out_proj, and mlp.down_proj. Their reported result: 0.0015 KL divergence from base, 10/100 refusals vs 83/100 on the original model.
  3. Quantization: Cerebellum v3 recipe transferred verbatim from the stock deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF build — same 360-entry tensor-type override file, same Unsloth coder imatrix.

Benchmarks

Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090. All numbers are audited; every failed answer was manually verified as a genuine model error — audit reports are in benchmark_results/AUDIT_*.md. Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON, adversarial audit reports) is in benchmark_results/ in this repository.

Heretic Cerebellum v1 (11.96 GB) vs baselines

Benchmark Heretic Cerebellum v1 (11.96 GB) Stock Cerebellum v3 (11.1 GB) Uniform Q3_K_M baseline (15.6 GB) Notes
Wiki PPL (ctx 2048, 32 chunks) 7.157 ± 0.103 7.099 ± 0.102 RTX 3090, identical invocation
ARC-Challenge 95.48% (1172 q) 95.82% 96.10% 25-shot
HellaSwag 91.78% (10042 q) 92.28% 91.50% 10-shot
MMLU-Redux 75.42% (2400 q) 75.00% 74.12% 5-shot
HumanEval base 68.29% (164 problems) 70.73% pass@1, evalplus
HumanEval+ 64.63% 65.24% 56.71% pass@1, evalplus
Vision smoke 100% (24/24) 100% (36 images) basic image description
RealWorldQA 76.0% (n=50) ~78% single-question granularity ±2%

Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base. Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB — the standard comparison point for showing what mixed-precision buys at reduced size.

Head-to-head: same weights, uniform quant

llmfan46's own uniform Q3_K_M of the identical heretic weights (16.87 GB) was benchmarked on the identical harness, same night, same protocol.

Metric Heretic Cerebellum v1 (11.96 GB) Uniform Q3_K_M (16.87 GB)
Wiki PPL (ctx 2048, 32 chunks) 7.157 ± 0.103 7.220 ± 0.106
ARC-Challenge 95.48% 95.56%
HellaSwag 91.78% 91.92%
MMLU-Redux 75.42% 74.88%
HumanEval base 68.29% 65.24%
HumanEval+ 64.63% 57.93%

The Cerebellum allocation is 29% smaller and scores equal-or-better on PPL, MMLU and HumanEval+ (both runs' per-question artifacts in benchmark_results_uniform/).

Heretic Abliteration Details (from llmfan46)

The following parameters are as reported in llmfan46's model card and are reproduced here for downstream reference.

Parameter Value
direction_index 19.93
attn.out_proj.max_weight 1.49
attn.out_proj.max_weight_position 23.45
attn.out_proj.min_weight 1.08
attn.out_proj.min_weight_distance 16.54
mlp.down_proj.max_weight 1.46
mlp.down_proj.max_weight_position 28.05
mlp.down_proj.min_weight 1.27
mlp.down_proj.min_weight_distance 18.79
attn.o_proj.max_weight 1.47
attn.o_proj.max_weight_position 24.35
attn.o_proj.min_weight 0.07
attn.o_proj.min_weight_distance 22.58

Targeted components: attn.o_proj, attn.out_proj, mlp.down_proj.

Tool: Heretic v1.2.0, method: Magnitude-Preserving Orthogonal Ablation (MPOA) (reference).

Cerebellum v3 Tensor Allocation

Same allocation as the stock build. Listed here for reference.

Group Precision Rationale
attn_qkv Q3_K_M Critical for vision and attention routing
ssm_out Q3_K_M Most sensitive tensor per ablation (+0.24 PPL)
ffn_gate_exps Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_up_exps Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_down_exps Q2_K Acceptable loss for size savings
ffn_gate_shexp Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_up_shexp Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_down_shexp Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
attn_gate Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ssm_alpha, ssm_beta Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation

Protected: all norms (F32), SSM state parameters (F32), router tensors (default).

6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse ablation — imatrix-guided Q2_K acts as regularization on gate, mixing, and shared-expert weights for this architecture.

Perplexity Note

Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock Cerebellum v3 (7.099). The difference is within the measurement uncertainty (overlapping ±0.1 error bars) and reflects the small distributional shift introduced by abliteration rather than quantization quality. Both builds used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090.

Measured launch (RTX 3090, llama.cpp)

Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:

metric measured
decode speed 149 tok/s
peak VRAM (4-slot serving) 14.2 GB
max measured context (q8_0 KV) 131,072
llama-server -m Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  -ngl 99 --parallel 4 -c 24576 --jinja

This rig's measurements; no quality claims beyond them.

Runtime — Casual Deployment

llama-server \
  --model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  --mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja

--jinja is required for Qwen3.6. The enable_thinking chat-template flag only takes effect when the Jinja template path is active; without it, the model defaults to thinking mode on every request.

Non-thinking requests require an explicit flag at the API level:

{"chat_template_kwargs": {"enable_thinking": false}}

Qwen3.6 does not support the /think and /nothink soft-switch tokens used by Qwen3.5. Thinking mode is on by default.

Recommended Sampling Parameters

From the official Qwen3.6-35B-A3B documentation.

Mode temperature top_p top_k min_p presence_penalty repetition_penalty
Thinking — general 1.0 0.95 20 0.0 1.5 1.0
Thinking — precise coding (WebDev) 0.6 0.95 20 0.0 0.0 1.0
Non-thinking (instruct) 0.7 0.80 20 0.0 1.5 1.0

presence_penalty can be adjusted between 0 and 2 to reduce repetition loops; higher values may occasionally cause language mixing.

Reproduction

Standard Cerebellum recipe. The tensor-type override file and ablation logs from the stock v3 build apply directly.

# 1. imatrix (constant ~300 MB RAM)
python -m osmosis.imatrix_stream \
    --model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    --output imatrix.dat

# 2. quantize with stock llama-quantize
llama-quantize \
    --imatrix imatrix.dat \
    --tensor-type-file cerebellum_v3_overrides.txt \
    Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
    Q3_K_M

The imatrix used for this build was generated from the Unsloth coder corpus (same corpus as the stock Cerebellum v3 build).

The 360-line tensor override file (cerebellum_v3_overrides.txt) is included in this repository alongside the ablation logs.

Benchmark Artifacts

Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and adversarial audit reports (AUDIT_*.md) are in benchmark_results/ in this repository per project policy.

Credits

Downloads last month
525
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF

Evaluation results