---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/LICENSE
library_name: gguf
base_model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF
base_model_relation: quantized
model_name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
model_creator: Qwen
model_type: qwen3
quantized_by: deucebucket
pipeline_tag: image-text-to-text
tags:
- GGUF
- qwen3
- qwen
- quantized
- cerebellum
- imatrix
- moe
- mixed-precision
- 3-bit
- heretic
- uncensored
- abliterated
model-index:
- name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
results:
- task:
name: Text Generation
type: text-generation
dataset:
name: AI2 Reasoning Challenge
type: ai2_arc
config: ARC-Challenge
split: test
metrics:
- name: normalized accuracy
type: acc_norm
value: 0.9548
source:
name: Local audited benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: HellaSwag
type: hellaswag
split: validation
metrics:
- name: normalized accuracy
type: acc_norm
value: 0.9178
source:
name: Local audited benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: MMLU-Redux
type: cais/mmlu
config: all
split: test
metrics:
- name: accuracy
type: acc
value: 0.7542
source:
name: Local audited benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: HumanEval+ (pass@1)
type: openai_humaneval
split: test
metrics:
- name: pass@1
type: pass@1
value: 0.6463
source:
name: Local audited benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: WikiText-2 Perplexity
type: wikitext
config: wikitext-2-raw-v1
split: test
metrics:
- name: perplexity
type: perplexity
value: 7.157
source:
name: Local audited benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
---
# Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF
Sensitivity-guided mixed-precision quantization of
[llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF),
which is itself a decensored variant of
[Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
produced by llmfan46 using [Heretic](https://github.com/p-e-w/heretic) v1.2.0.
All future Heretic versions of this build will live in this repository.
Version identifiers appear only in filenames, not in the repo name.
## Files
| File | Size | Description |
|------|------|-------------|
| `Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf` | **11.96 GB** (11,955,468,384 bytes) | Cerebellum v3 recipe — recommended |
| `Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf` | ~858 MB | Vision projector, passed through unmodified from llmfan46's repo |
The vision projector is required for multimodal (image/video) use.
It is identical to the file distributed by llmfan46 and is included here
for single-repo convenience only.
## Provenance
1. **Base architecture**: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — Qwen Team (Apache-2.0)
2. **Heretic variant**: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF) — llmfan46.
The BF16 GGUF from that repository was used as the direct quantization source.
llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal
Ablation (MPOA) method, targeting `attn.o_proj`, `attn.out_proj`, and
`mlp.down_proj`. Their reported result: **0.0015 KL divergence** from base,
**10/100 refusals** vs 83/100 on the original model.
3. **Quantization**: Cerebellum v3 recipe transferred verbatim from the stock
[deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF](https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF)
build — same 360-entry tensor-type override file, same Unsloth coder imatrix.
## Benchmarks
Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090.
All numbers are audited; every failed answer was manually verified as a genuine
model error — audit reports are in `benchmark_results/AUDIT_*.md`.
Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON,
adversarial audit reports) is in `benchmark_results/` in this repository.
### Heretic Cerebellum v1 (11.96 GB) vs baselines
| Benchmark | Heretic Cerebellum v1 (11.96 GB) | Stock Cerebellum v3 (11.1 GB) | Uniform Q3_K_M baseline (15.6 GB) | Notes |
|-----------|:---:|:---:|:---:|---|
| Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.099 ± 0.102 | — | RTX 3090, identical invocation |
| ARC-Challenge | **95.48%** (1172 q) | 95.82% | 96.10% | 25-shot |
| HellaSwag | **91.78%** (10042 q) | 92.28% | 91.50% | 10-shot |
| MMLU-Redux | **75.42%** (2400 q) | 75.00% | 74.12% | 5-shot |
| HumanEval base | **68.29%** (164 problems) | 70.73% | — | pass@1, evalplus |
| HumanEval+ | **64.63%** | 65.24% | 56.71% | pass@1, evalplus |
| Vision smoke | **100%** (24/24) | 100% (36 images) | — | basic image description |
| RealWorldQA | **76.0%** (n=50) | ~78% | — | single-question granularity ±2% |
Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base.
Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB — the
standard comparison point for showing what mixed-precision buys at reduced size.
## Head-to-head: same weights, uniform quant
llmfan46's own uniform Q3_K_M of the identical heretic weights (16.87 GB) was
benchmarked on the identical harness, same night, same protocol.
| Metric | Heretic Cerebellum v1 (11.96 GB) | Uniform Q3_K_M (16.87 GB) |
|--------|:---:|:---:|
| Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.220 ± 0.106 |
| ARC-Challenge | 95.48% | 95.56% |
| HellaSwag | 91.78% | 91.92% |
| MMLU-Redux | 75.42% | 74.88% |
| HumanEval base | 68.29% | 65.24% |
| HumanEval+ | 64.63% | 57.93% |
The Cerebellum allocation is 29% smaller and scores equal-or-better on PPL,
MMLU and HumanEval+ (both runs' per-question artifacts in benchmark_results_uniform/).
## Heretic Abliteration Details (from llmfan46)
The following parameters are as reported in llmfan46's model card and are
reproduced here for downstream reference.
| Parameter | Value |
|-----------|-------|
| direction_index | 19.93 |
| attn.out_proj.max_weight | 1.49 |
| attn.out_proj.max_weight_position | 23.45 |
| attn.out_proj.min_weight | 1.08 |
| attn.out_proj.min_weight_distance | 16.54 |
| mlp.down_proj.max_weight | 1.46 |
| mlp.down_proj.max_weight_position | 28.05 |
| mlp.down_proj.min_weight | 1.27 |
| mlp.down_proj.min_weight_distance | 18.79 |
| attn.o_proj.max_weight | 1.47 |
| attn.o_proj.max_weight_position | 24.35 |
| attn.o_proj.min_weight | 0.07 |
| attn.o_proj.min_weight_distance | 22.58 |
Targeted components: `attn.o_proj`, `attn.out_proj`, `mlp.down_proj`.
Tool: [Heretic](https://github.com/p-e-w/heretic) v1.2.0,
method: Magnitude-Preserving Orthogonal Ablation (MPOA)
([reference](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration)).
## Cerebellum v3 Tensor Allocation
Same allocation as the stock build. Listed here for reference.
| Group | Precision | Rationale |
|-------|-----------|-----------|
| `attn_qkv` | Q3_K_M | Critical for vision and attention routing |
| `ssm_out` | Q3_K_M | Most sensitive tensor per ablation (+0.24 PPL) |
| `ffn_gate_exps` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_up_exps` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_down_exps` | Q2_K | Acceptable loss for size savings |
| `ffn_gate_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_up_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_down_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `attn_gate` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ssm_alpha`, `ssm_beta` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
Protected: all norms (F32), SSM state parameters (F32), router tensors (default).
6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse
ablation — imatrix-guided Q2_K acts as regularization on gate, mixing, and
shared-expert weights for this architecture.
## Perplexity Note
Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock
Cerebellum v3 (7.099). The difference is within the measurement uncertainty
(overlapping ±0.1 error bars) and reflects the small distributional shift
introduced by abliteration rather than quantization quality. Both builds
used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090.
## Measured launch (RTX 3090, llama.cpp)
Measured 2026-06-13 on a single RTX 3090 (24 GB), one `llama-server`, KV cache `q8_0`:
| metric | measured |
|---|---|
| decode speed | 149 tok/s |
| peak VRAM (4-slot serving) | 14.2 GB |
| max measured context (q8_0 KV) | 131,072 |
```bash
llama-server -m Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
-ngl 99 --parallel 4 -c 24576 --jinja
```
_This rig's measurements; no quality claims beyond them._
## Runtime — Casual Deployment
```bash
llama-server \
--model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
--mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \
--n-gpu-layers 99 \
--ctx-size 8192 \
--jinja
```
`--jinja` is required for Qwen3.6. The `enable_thinking` chat-template flag
only takes effect when the Jinja template path is active; without it, the
model defaults to thinking mode on every request.
Non-thinking requests require an explicit flag at the API level:
```json
{"chat_template_kwargs": {"enable_thinking": false}}
```
Qwen3.6 does not support the `/think` and `/nothink` soft-switch tokens
used by Qwen3.5. Thinking mode is on by default.
## Recommended Sampling Parameters
From the official Qwen3.6-35B-A3B documentation.
| Mode | temperature | top_p | top_k | min_p | presence_penalty | repetition_penalty |
|------|-------------|-------|-------|-------|------------------|--------------------|
| Thinking — general | 1.0 | 0.95 | 20 | 0.0 | 1.5 | 1.0 |
| Thinking — precise coding (WebDev) | 0.6 | 0.95 | 20 | 0.0 | 0.0 | 1.0 |
| Non-thinking (instruct) | 0.7 | 0.80 | 20 | 0.0 | 1.5 | 1.0 |
`presence_penalty` can be adjusted between 0 and 2 to reduce repetition loops;
higher values may occasionally cause language mixing.
## Reproduction
Standard Cerebellum recipe. The tensor-type override file and ablation logs
from the stock v3 build apply directly.
```bash
# 1. imatrix (constant ~300 MB RAM)
python -m osmosis.imatrix_stream \
--model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
--output imatrix.dat
# 2. quantize with stock llama-quantize
llama-quantize \
--imatrix imatrix.dat \
--tensor-type-file cerebellum_v3_overrides.txt \
Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
Q3_K_M
```
The imatrix used for this build was generated from the Unsloth coder corpus
(same corpus as the stock Cerebellum v3 build).
The 360-line tensor override file (`cerebellum_v3_overrides.txt`) is included
in this repository alongside the ablation logs.
## Benchmark Artifacts
Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and
adversarial audit reports (`AUDIT_*.md`) are in `benchmark_results/` in this
repository per project policy.
## Credits
- Base model: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — Qwen Team
- Heretic variant and BF16 source: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF) — llmfan46
- Abliteration tool: [Heretic](https://github.com/p-e-w/heretic) v1.2.0 by p-e-w
- GGUF runtime: [llama.cpp](https://github.com/ggml-org/llama.cpp)
- Quantization method and workflow: [Cerebellum](https://github.com/deucebucket/cerebellum) — deucebucket