---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/LICENSE
library_name: gguf
base_model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF
base_model_relation: quantized
model_name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
model_creator: Qwen
model_type: qwen3
quantized_by: deucebucket
pipeline_tag: image-text-to-text
tags:
  - GGUF
  - qwen3
  - qwen
  - quantized
  - cerebellum
  - imatrix
  - moe
  - mixed-precision
  - 3-bit
  - heretic
  - uncensored
  - abliterated
model-index:
- name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
  results:
  - task:
      name: Text Generation
      type: text-generation
    dataset:
      name: AI2 Reasoning Challenge
      type: ai2_arc
      config: ARC-Challenge
      split: test
    metrics:
    - name: normalized accuracy
      type: acc_norm
      value: 0.9548
    source:
      name: Local audited benchmark run (RTX 3090, llama.cpp)
      url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
  - task:
      name: Text Generation
      type: text-generation
    dataset:
      name: HellaSwag
      type: hellaswag
      split: validation
    metrics:
    - name: normalized accuracy
      type: acc_norm
      value: 0.9178
    source:
      name: Local audited benchmark run (RTX 3090, llama.cpp)
      url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
  - task:
      name: Text Generation
      type: text-generation
    dataset:
      name: MMLU-Redux
      type: cais/mmlu
      config: all
      split: test
    metrics:
    - name: accuracy
      type: acc
      value: 0.7542
    source:
      name: Local audited benchmark run (RTX 3090, llama.cpp)
      url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
  - task:
      name: Text Generation
      type: text-generation
    dataset:
      name: HumanEval+ (pass@1)
      type: openai_humaneval
      split: test
    metrics:
    - name: pass@1
      type: pass@1
      value: 0.6463
    source:
      name: Local audited benchmark run (RTX 3090, llama.cpp)
      url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
  - task:
      name: Text Generation
      type: text-generation
    dataset:
      name: WikiText-2 Perplexity
      type: wikitext
      config: wikitext-2-raw-v1
      split: test
    metrics:
    - name: perplexity
      type: perplexity
      value: 7.157
    source:
      name: Local audited benchmark run (RTX 3090, llama.cpp)
      url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
---

<p align="center">
  <img src="cerebellum_banner.png" alt="Cerebellum" width="640">
</p>

# Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF

Sensitivity-guided mixed-precision quantization of
[llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF),
which is itself a decensored variant of
[Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
produced by llmfan46 using [Heretic](https://github.com/p-e-w/heretic) v1.2.0.

All future Heretic versions of this build will live in this repository.
Version identifiers appear only in filenames, not in the repo name.

## Files

| File | Size | Description |
|------|------|-------------|
| `Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf` | **11.96 GB** (11,955,468,384 bytes) | Cerebellum v3 recipe — recommended |
| `Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf` | ~858 MB | Vision projector, passed through unmodified from llmfan46's repo |

The vision projector is required for multimodal (image/video) use.
It is identical to the file distributed by llmfan46 and is included here
for single-repo convenience only.

## Provenance

1. **Base architecture**: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — Qwen Team (Apache-2.0)
2. **Heretic variant**: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF) — llmfan46.
   The BF16 GGUF from that repository was used as the direct quantization source.
   llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal
   Ablation (MPOA) method, targeting `attn.o_proj`, `attn.out_proj`, and
   `mlp.down_proj`. Their reported result: **0.0015 KL divergence** from base,
   **10/100 refusals** vs 83/100 on the original model.
3. **Quantization**: Cerebellum v3 recipe transferred verbatim from the stock
   [deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF](https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF)
   build — same 360-entry tensor-type override file, same Unsloth coder imatrix.

## Benchmarks

Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090.
All numbers are audited; every failed answer was manually verified as a genuine
model error — audit reports are in `benchmark_results/AUDIT_*.md`.
Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON,
adversarial audit reports) is in `benchmark_results/` in this repository.

### Heretic Cerebellum v1 (11.96 GB) vs baselines

| Benchmark | Heretic Cerebellum v1 (11.96 GB) | Stock Cerebellum v3 (11.1 GB) | Uniform Q3_K_M baseline (15.6 GB) | Notes |
|-----------|:---:|:---:|:---:|---|
| Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.099 ± 0.102 | — | RTX 3090, identical invocation |
| ARC-Challenge | **95.48%** (1172 q) | 95.82% | 96.10% | 25-shot |
| HellaSwag | **91.78%** (10042 q) | 92.28% | 91.50% | 10-shot |
| MMLU-Redux | **75.42%** (2400 q) | 75.00% | 74.12% | 5-shot |
| HumanEval base | **68.29%** (164 problems) | 70.73% | — | pass@1, evalplus |
| HumanEval+ | **64.63%** | 65.24% | 56.71% | pass@1, evalplus |
| Vision smoke | **100%** (24/24) | 100% (36 images) | — | basic image description |
| RealWorldQA | **76.0%** (n=50) | ~78% | — | single-question granularity ±2% |

Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base.
Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB — the
standard comparison point for showing what mixed-precision buys at reduced size.

## Head-to-head: same weights, uniform quant

llmfan46's own uniform Q3_K_M of the identical heretic weights (16.87 GB) was
benchmarked on the identical harness, same night, same protocol.

| Metric | Heretic Cerebellum v1 (11.96 GB) | Uniform Q3_K_M (16.87 GB) |
|--------|:---:|:---:|
| Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.220 ± 0.106 |
| ARC-Challenge | 95.48% | 95.56% |
| HellaSwag | 91.78% | 91.92% |
| MMLU-Redux | 75.42% | 74.88% |
| HumanEval base | 68.29% | 65.24% |
| HumanEval+ | 64.63% | 57.93% |

The Cerebellum allocation is 29% smaller and scores equal-or-better on PPL,
MMLU and HumanEval+ (both runs' per-question artifacts in benchmark_results_uniform/).

## Heretic Abliteration Details (from llmfan46)

The following parameters are as reported in llmfan46's model card and are
reproduced here for downstream reference.

| Parameter | Value |
|-----------|-------|
| direction_index | 19.93 |
| attn.out_proj.max_weight | 1.49 |
| attn.out_proj.max_weight_position | 23.45 |
| attn.out_proj.min_weight | 1.08 |
| attn.out_proj.min_weight_distance | 16.54 |
| mlp.down_proj.max_weight | 1.46 |
| mlp.down_proj.max_weight_position | 28.05 |
| mlp.down_proj.min_weight | 1.27 |
| mlp.down_proj.min_weight_distance | 18.79 |
| attn.o_proj.max_weight | 1.47 |
| attn.o_proj.max_weight_position | 24.35 |
| attn.o_proj.min_weight | 0.07 |
| attn.o_proj.min_weight_distance | 22.58 |

Targeted components: `attn.o_proj`, `attn.out_proj`, `mlp.down_proj`.

Tool: [Heretic](https://github.com/p-e-w/heretic) v1.2.0,
method: Magnitude-Preserving Orthogonal Ablation (MPOA)
([reference](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration)).

## Cerebellum v3 Tensor Allocation

Same allocation as the stock build. Listed here for reference.

| Group | Precision | Rationale |
|-------|-----------|-----------|
| `attn_qkv` | Q3_K_M | Critical for vision and attention routing |
| `ssm_out` | Q3_K_M | Most sensitive tensor per ablation (+0.24 PPL) |
| `ffn_gate_exps` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_up_exps` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_down_exps` | Q2_K | Acceptable loss for size savings |
| `ffn_gate_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_up_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ffn_down_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `attn_gate` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
| `ssm_alpha`, `ssm_beta` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |

Protected: all norms (F32), SSM state parameters (F32), router tensors (default).

6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse
ablation — imatrix-guided Q2_K acts as regularization on gate, mixing, and
shared-expert weights for this architecture.

## Perplexity Note

Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock
Cerebellum v3 (7.099). The difference is within the measurement uncertainty
(overlapping ±0.1 error bars) and reflects the small distributional shift
introduced by abliteration rather than quantization quality. Both builds
used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090.

## Measured launch (RTX 3090, llama.cpp)

Measured 2026-06-13 on a single RTX 3090 (24 GB), one `llama-server`, KV cache `q8_0`:

| metric | measured |
|---|---|
| decode speed | 149 tok/s |
| peak VRAM (4-slot serving) | 14.2 GB |
| max measured context (q8_0 KV) | 131,072 |

```bash
llama-server -m Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  -ngl 99 --parallel 4 -c 24576 --jinja
```

_This rig's measurements; no quality claims beyond them._

## Runtime — Casual Deployment

```bash
llama-server \
  --model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  --mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja
```

`--jinja` is required for Qwen3.6. The `enable_thinking` chat-template flag
only takes effect when the Jinja template path is active; without it, the
model defaults to thinking mode on every request.

Non-thinking requests require an explicit flag at the API level:
```json
{"chat_template_kwargs": {"enable_thinking": false}}
```

Qwen3.6 does not support the `/think` and `/nothink` soft-switch tokens
used by Qwen3.5. Thinking mode is on by default.

## Recommended Sampling Parameters

From the official Qwen3.6-35B-A3B documentation.

| Mode | temperature | top_p | top_k | min_p | presence_penalty | repetition_penalty |
|------|-------------|-------|-------|-------|------------------|--------------------|
| Thinking — general | 1.0 | 0.95 | 20 | 0.0 | 1.5 | 1.0 |
| Thinking — precise coding (WebDev) | 0.6 | 0.95 | 20 | 0.0 | 0.0 | 1.0 |
| Non-thinking (instruct) | 0.7 | 0.80 | 20 | 0.0 | 1.5 | 1.0 |

`presence_penalty` can be adjusted between 0 and 2 to reduce repetition loops;
higher values may occasionally cause language mixing.

## Reproduction

Standard Cerebellum recipe. The tensor-type override file and ablation logs
from the stock v3 build apply directly.

```bash
# 1. imatrix (constant ~300 MB RAM)
python -m osmosis.imatrix_stream \
    --model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    --output imatrix.dat

# 2. quantize with stock llama-quantize
llama-quantize \
    --imatrix imatrix.dat \
    --tensor-type-file cerebellum_v3_overrides.txt \
    Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
    Q3_K_M
```

The imatrix used for this build was generated from the Unsloth coder corpus
(same corpus as the stock Cerebellum v3 build).

The 360-line tensor override file (`cerebellum_v3_overrides.txt`) is included
in this repository alongside the ablation logs.

## Benchmark Artifacts

Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and
adversarial audit reports (`AUDIT_*.md`) are in `benchmark_results/` in this
repository per project policy.

## Credits

- Base model: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — Qwen Team
- Heretic variant and BF16 source: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF) — llmfan46
- Abliteration tool: [Heretic](https://github.com/p-e-w/heretic) v1.2.0 by p-e-w
- GGUF runtime: [llama.cpp](https://github.com/ggml-org/llama.cpp)
- Quantization method and workflow: [Cerebellum](https://github.com/deucebucket/cerebellum) — deucebucket