deucebucket's picture
docs: final audited benchmarks + model-index evaluation panel
12074d3 verified
|
Raw
History Blame
11.9 kB
metadata
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/LICENSE
library_name: gguf
base_model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF
base_model_relation: quantized
model_name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
model_creator: Qwen
model_type: qwen3
quantized_by: deucebucket
pipeline_tag: image-text-to-text
tags:
  - GGUF
  - qwen3
  - qwen
  - quantized
  - cerebellum
  - imatrix
  - moe
  - mixed-precision
  - 3-bit
  - heretic
  - uncensored
  - abliterated
model-index:
  - name: Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
    results:
      - task:
          name: Text Generation
          type: text-generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - name: normalized accuracy
            type: acc_norm
            value: 0.9548
        source:
          name: Local audited benchmark run (RTX 3090, llama.cpp)
          url: >-
            https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
      - task:
          name: Text Generation
          type: text-generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - name: normalized accuracy
            type: acc_norm
            value: 0.9178
        source:
          name: Local audited benchmark run (RTX 3090, llama.cpp)
          url: >-
            https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
      - task:
          name: Text Generation
          type: text-generation
        dataset:
          name: MMLU-Redux (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - name: accuracy
            type: acc
            value: 0.7542
        source:
          name: Local audited benchmark run (RTX 3090, llama.cpp)
          url: >-
            https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
      - task:
          name: Text Generation
          type: text-generation
        dataset:
          name: HumanEval+ (pass@1)
          type: openai_humaneval
          split: test
        metrics:
          - name: pass@1
            type: pass@1
            value: 0.6463
        source:
          name: Local audited benchmark run (RTX 3090, llama.cpp)
          url: >-
            https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results
      - task:
          name: Text Generation
          type: text-generation
        dataset:
          name: WikiText-2 Perplexity
          type: wikitext
          config: wikitext-2-raw-v1
          split: test
        metrics:
          - name: perplexity
            type: perplexity
            value: 7.157
        source:
          name: Local audited benchmark run (RTX 3090, llama.cpp)
          url: >-
            https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF/tree/main/benchmark_results

Qwen 3.6 35B-A3B Heretic β€” Cerebellum GGUF

Sensitivity-guided mixed-precision quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF, which is itself a decensored variant of Qwen/Qwen3.6-35B-A3B produced by llmfan46 using Heretic v1.2.0.

All future Heretic versions of this build will live in this repository. Version identifiers appear only in filenames, not in the repo name.

Files

File Size Description
Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf 11.96 GB (11,955,468,384 bytes) Cerebellum v3 recipe β€” recommended
Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf ~858 MB Vision projector, passed through unmodified from llmfan46's repo

The vision projector is required for multimodal (image/video) use. It is identical to the file distributed by llmfan46 and is included here for single-repo convenience only.

Provenance

  1. Base architecture: Qwen/Qwen3.6-35B-A3B β€” Qwen Team (Apache-2.0)
  2. Heretic variant: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF β€” llmfan46. The BF16 GGUF from that repository was used as the direct quantization source. llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal Ablation (MPOA) method, targeting attn.o_proj, attn.out_proj, and mlp.down_proj. Their reported result: 0.0015 KL divergence from base, 10/100 refusals vs 83/100 on the original model.
  3. Quantization: Cerebellum v3 recipe transferred verbatim from the stock deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF build β€” same 360-entry tensor-type override file, same Unsloth coder imatrix.

Benchmarks

Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090. All numbers are audited; every failed answer was manually verified as a genuine model error β€” audit reports are in benchmark_results/AUDIT_*.md. Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON, adversarial audit reports) is in benchmark_results/ in this repository.

Heretic Cerebellum v1 (11.96 GB) vs baselines

Benchmark Heretic Cerebellum v1 (11.96 GB) Stock Cerebellum v3 (11.1 GB) Uniform Q3_K_M baseline (15.6 GB) Notes
Wiki PPL (ctx 2048, 32 chunks) 7.157 Β± 0.103 7.099 Β± 0.102 β€” RTX 3090, identical invocation
ARC-Challenge 95.48% (1172 q) 95.82% 96.10% 25-shot
HellaSwag 91.78% (10042 q) 92.28% 91.50% 10-shot
MMLU-Redux 75.42% (2400 q) 75.00% 74.12% 5-shot
HumanEval base 68.29% (164 problems) 70.73% β€” pass@1, evalplus
HumanEval+ 64.63% 65.24% 56.71% pass@1, evalplus
Vision smoke 100% (24/24) 100% (36 images) β€” basic image description
RealWorldQA 76.0% (n=50) ~78% β€” single-question granularity Β±2%

Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base. Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB β€” the standard comparison point for showing what mixed-precision buys at reduced size.

Heretic Abliteration Details (from llmfan46)

The following parameters are as reported in llmfan46's model card and are reproduced here for downstream reference.

Parameter Value
direction_index 19.93
attn.out_proj.max_weight 1.49
attn.out_proj.max_weight_position 23.45
attn.out_proj.min_weight 1.08
attn.out_proj.min_weight_distance 16.54
mlp.down_proj.max_weight 1.46
mlp.down_proj.max_weight_position 28.05
mlp.down_proj.min_weight 1.27
mlp.down_proj.min_weight_distance 18.79
attn.o_proj.max_weight 1.47
attn.o_proj.max_weight_position 24.35
attn.o_proj.min_weight 0.07
attn.o_proj.min_weight_distance 22.58

Targeted components: attn.o_proj, attn.out_proj, mlp.down_proj.

Tool: Heretic v1.2.0, method: Magnitude-Preserving Orthogonal Ablation (MPOA) (reference).

Cerebellum v3 Tensor Allocation

Same allocation as the stock build. Listed here for reference.

Group Precision Rationale
attn_qkv Q3_K_M Critical for vision and attention routing
ssm_out Q3_K_M Most sensitive tensor per ablation (+0.24 PPL)
ffn_gate_exps Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_up_exps Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_down_exps Q2_K Acceptable loss for size savings
ffn_gate_shexp Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_up_shexp Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ffn_down_shexp Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
attn_gate Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation
ssm_alpha, ssm_beta Q2_K Q2_K regularization outperforms Q3_K_M in reverse ablation

Protected: all norms (F32), SSM state parameters (F32), router tensors (default).

6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse ablation β€” imatrix-guided Q2_K acts as regularization on gate, mixing, and shared-expert weights for this architecture.

Perplexity Note

Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock Cerebellum v3 (7.099). The difference is within the measurement uncertainty (overlapping Β±0.1 error bars) and reflects the small distributional shift introduced by abliteration rather than quantization quality. Both builds used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090.

Runtime β€” Casual Deployment

llama-server \
  --model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
  --mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja

--jinja is required for Qwen3.6. The enable_thinking chat-template flag only takes effect when the Jinja template path is active; without it, the model defaults to thinking mode on every request.

Non-thinking requests require an explicit flag at the API level:

{"chat_template_kwargs": {"enable_thinking": false}}

Qwen3.6 does not support the /think and /nothink soft-switch tokens used by Qwen3.5. Thinking mode is on by default.

Recommended Sampling Parameters

From the official Qwen3.6-35B-A3B documentation.

Mode temperature top_p top_k min_p presence_penalty repetition_penalty
Thinking β€” general 1.0 0.95 20 0.0 1.5 1.0
Thinking β€” precise coding (WebDev) 0.6 0.95 20 0.0 0.0 1.0
Non-thinking (instruct) 0.7 0.80 20 0.0 1.5 1.0

presence_penalty can be adjusted between 0 and 2 to reduce repetition loops; higher values may occasionally cause language mixing.

Reproduction

Standard Cerebellum recipe. The tensor-type override file and ablation logs from the stock v3 build apply directly.

# 1. imatrix (constant ~300 MB RAM)
python -m osmosis.imatrix_stream \
    --model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    --output imatrix.dat

# 2. quantize with stock llama-quantize
llama-quantize \
    --imatrix imatrix.dat \
    --tensor-type-file cerebellum_v3_overrides.txt \
    Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
    Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
    Q3_K_M

The imatrix used for this build was generated from the Unsloth coder corpus (same corpus as the stock Cerebellum v3 build).

The 360-line tensor override file (cerebellum_v3_overrides.txt) is included in this repository alongside the ablation logs.

Benchmark Artifacts

Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and adversarial audit reports (AUDIT_*.md) are in benchmark_results/ in this repository per project policy.

Credits