--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/LICENSE library_name: gguf base_model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF base_model_relation: quantized model_name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF model_creator: Qwen model_type: qwen3 quantized_by: deucebucket pipeline_tag: image-text-to-text tags: - GGUF - qwen3 - qwen - quantized - cerebellum - imatrix - moe - mixed-precision - 3-bit - heretic - uncensored - abliterated model-index: - name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF results: - task: name: Text Generation type: text-generation dataset: name: AI2 Reasoning Challenge type: ai2_arc config: ARC-Challenge split: test metrics: - name: normalized accuracy type: acc_norm value: 0.9548 source: name: Local audited benchmark run (RTX 3090, llama.cpp) url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results - task: name: Text Generation type: text-generation dataset: name: HellaSwag type: hellaswag split: validation metrics: - name: normalized accuracy type: acc_norm value: 0.9178 source: name: Local audited benchmark run (RTX 3090, llama.cpp) url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results - task: name: Text Generation type: text-generation dataset: name: MMLU-Redux type: cais/mmlu config: all split: test metrics: - name: accuracy type: acc value: 0.7542 source: name: Local audited benchmark run (RTX 3090, llama.cpp) url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results - task: name: Text Generation type: text-generation dataset: name: HumanEval+ (pass@1) type: openai_humaneval split: test metrics: - name: pass@1 type: pass@1 value: 0.6463 source: name: Local audited benchmark run (RTX 3090, llama.cpp) url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results - task: name: Text Generation type: text-generation dataset: name: WikiText-2 Perplexity type: wikitext config: wikitext-2-raw-v1 split: test metrics: - name: perplexity type: perplexity value: 7.157 source: name: Local audited benchmark run (RTX 3090, llama.cpp) url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results ---

Cerebellum

# Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF Sensitivity-guided mixed-precision quantization of [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF), which is itself a decensored variant of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) produced by llmfan46 using [Heretic](https://github.com/p-e-w/heretic) v1.2.0. All future Heretic versions of this build will live in this repository. Version identifiers appear only in filenames, not in the repo name. ## Files | File | Size | Description | |------|------|-------------| | `Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf` | **11.96 GB** (11,955,468,384 bytes) | Cerebellum v3 recipe — recommended | | `Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf` | ~858 MB | Vision projector, passed through unmodified from llmfan46's repo | The vision projector is required for multimodal (image/video) use. It is identical to the file distributed by llmfan46 and is included here for single-repo convenience only. ## Provenance 1. **Base architecture**: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — Qwen Team (Apache-2.0) 2. **Heretic variant**: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF) — llmfan46. The BF16 GGUF from that repository was used as the direct quantization source. llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal Ablation (MPOA) method, targeting `attn.o_proj`, `attn.out_proj`, and `mlp.down_proj`. Their reported result: **0.0015 KL divergence** from base, **10/100 refusals** vs 83/100 on the original model. 3. **Quantization**: Cerebellum v3 recipe transferred verbatim from the stock [deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF](https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF) build — same 360-entry tensor-type override file, same Unsloth coder imatrix. ## Benchmarks Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090. All numbers are audited; every failed answer was manually verified as a genuine model error — audit reports are in `benchmark_results/AUDIT_*.md`. Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON, adversarial audit reports) is in `benchmark_results/` in this repository. ### Heretic Cerebellum v1 (11.96 GB) vs baselines | Benchmark | Heretic Cerebellum v1 (11.96 GB) | Stock Cerebellum v3 (11.1 GB) | Uniform Q3_K_M baseline (15.6 GB) | Notes | |-----------|:---:|:---:|:---:|---| | Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.099 ± 0.102 | — | RTX 3090, identical invocation | | ARC-Challenge | **95.48%** (1172 q) | 95.82% | 96.10% | 25-shot | | HellaSwag | **91.78%** (10042 q) | 92.28% | 91.50% | 10-shot | | MMLU-Redux | **75.42%** (2400 q) | 75.00% | 74.12% | 5-shot | | HumanEval base | **68.29%** (164 problems) | 70.73% | — | pass@1, evalplus | | HumanEval+ | **64.63%** | 65.24% | 56.71% | pass@1, evalplus | | Vision smoke | **100%** (24/24) | 100% (36 images) | — | basic image description | | RealWorldQA | **76.0%** (n=50) | ~78% | — | single-question granularity ±2% | Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base. Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB — the standard comparison point for showing what mixed-precision buys at reduced size. ## Head-to-head: same weights, uniform quant llmfan46's own uniform Q3_K_M of the identical heretic weights (16.87 GB) was benchmarked on the identical harness, same night, same protocol. | Metric | Heretic Cerebellum v1 (11.96 GB) | Uniform Q3_K_M (16.87 GB) | |--------|:---:|:---:| | Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.220 ± 0.106 | | ARC-Challenge | 95.48% | 95.56% | | HellaSwag | 91.78% | 91.92% | | MMLU-Redux | 75.42% | 74.88% | | HumanEval base | 68.29% | 65.24% | | HumanEval+ | 64.63% | 57.93% | The Cerebellum allocation is 29% smaller and scores equal-or-better on PPL, MMLU and HumanEval+ (both runs' per-question artifacts in benchmark_results_uniform/). ## Heretic Abliteration Details (from llmfan46) The following parameters are as reported in llmfan46's model card and are reproduced here for downstream reference. | Parameter | Value | |-----------|-------| | direction_index | 19.93 | | attn.out_proj.max_weight | 1.49 | | attn.out_proj.max_weight_position | 23.45 | | attn.out_proj.min_weight | 1.08 | | attn.out_proj.min_weight_distance | 16.54 | | mlp.down_proj.max_weight | 1.46 | | mlp.down_proj.max_weight_position | 28.05 | | mlp.down_proj.min_weight | 1.27 | | mlp.down_proj.min_weight_distance | 18.79 | | attn.o_proj.max_weight | 1.47 | | attn.o_proj.max_weight_position | 24.35 | | attn.o_proj.min_weight | 0.07 | | attn.o_proj.min_weight_distance | 22.58 | Targeted components: `attn.o_proj`, `attn.out_proj`, `mlp.down_proj`. Tool: [Heretic](https://github.com/p-e-w/heretic) v1.2.0, method: Magnitude-Preserving Orthogonal Ablation (MPOA) ([reference](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration)). ## Cerebellum v3 Tensor Allocation Same allocation as the stock build. Listed here for reference. | Group | Precision | Rationale | |-------|-----------|-----------| | `attn_qkv` | Q3_K_M | Critical for vision and attention routing | | `ssm_out` | Q3_K_M | Most sensitive tensor per ablation (+0.24 PPL) | | `ffn_gate_exps` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | | `ffn_up_exps` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | | `ffn_down_exps` | Q2_K | Acceptable loss for size savings | | `ffn_gate_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | | `ffn_up_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | | `ffn_down_shexp` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | | `attn_gate` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | | `ssm_alpha`, `ssm_beta` | Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation | Protected: all norms (F32), SSM state parameters (F32), router tensors (default). 6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse ablation — imatrix-guided Q2_K acts as regularization on gate, mixing, and shared-expert weights for this architecture. ## Perplexity Note Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock Cerebellum v3 (7.099). The difference is within the measurement uncertainty (overlapping ±0.1 error bars) and reflects the small distributional shift introduced by abliteration rather than quantization quality. Both builds used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090. ## Measured launch (RTX 3090, llama.cpp) Measured 2026-06-13 on a single RTX 3090 (24 GB), one `llama-server`, KV cache `q8_0`: | metric | measured | |---|---| | decode speed | 149 tok/s | | peak VRAM (4-slot serving) | 14.2 GB | | max measured context (q8_0 KV) | 131,072 | ```bash llama-server -m Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \ -ngl 99 --parallel 4 -c 24576 --jinja ``` _This rig's measurements; no quality claims beyond them._ ## Runtime — Casual Deployment ```bash llama-server \ --model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \ --mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \ --n-gpu-layers 99 \ --ctx-size 8192 \ --jinja ``` `--jinja` is required for Qwen3.6. The `enable_thinking` chat-template flag only takes effect when the Jinja template path is active; without it, the model defaults to thinking mode on every request. Non-thinking requests require an explicit flag at the API level: ```json {"chat_template_kwargs": {"enable_thinking": false}} ``` Qwen3.6 does not support the `/think` and `/nothink` soft-switch tokens used by Qwen3.5. Thinking mode is on by default. ## Recommended Sampling Parameters From the official Qwen3.6-35B-A3B documentation. | Mode | temperature | top_p | top_k | min_p | presence_penalty | repetition_penalty | |------|-------------|-------|-------|-------|------------------|--------------------| | Thinking — general | 1.0 | 0.95 | 20 | 0.0 | 1.5 | 1.0 | | Thinking — precise coding (WebDev) | 0.6 | 0.95 | 20 | 0.0 | 0.0 | 1.0 | | Non-thinking (instruct) | 0.7 | 0.80 | 20 | 0.0 | 1.5 | 1.0 | `presence_penalty` can be adjusted between 0 and 2 to reduce repetition loops; higher values may occasionally cause language mixing. ## Reproduction Standard Cerebellum recipe. The tensor-type override file and ablation logs from the stock v3 build apply directly. ```bash # 1. imatrix (constant ~300 MB RAM) python -m osmosis.imatrix_stream \ --model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \ --output imatrix.dat # 2. quantize with stock llama-quantize llama-quantize \ --imatrix imatrix.dat \ --tensor-type-file cerebellum_v3_overrides.txt \ Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \ Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \ Q3_K_M ``` The imatrix used for this build was generated from the Unsloth coder corpus (same corpus as the stock Cerebellum v3 build). The 360-line tensor override file (`cerebellum_v3_overrides.txt`) is included in this repository alongside the ablation logs. ## Benchmark Artifacts Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and adversarial audit reports (`AUDIT_*.md`) are in `benchmark_results/` in this repository per project policy. ## Credits - Base model: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — Qwen Team - Heretic variant and BF16 source: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF) — llmfan46 - Abliteration tool: [Heretic](https://github.com/p-e-w/heretic) v1.2.0 by p-e-w - GGUF runtime: [llama.cpp](https://github.com/ggml-org/llama.cpp) - Quantization method and workflow: [Cerebellum](https://github.com/deucebucket/cerebellum) — deucebucket