Instructions to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF", filename="Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: ./llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Use Docker
docker model run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
- LM Studio
- Jan
- vLLM
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
- Ollama
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Ollama:
ollama run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
- Unsloth Studio
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF to start chatting
- Pi
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Docker Model Runner:
docker model run hf.co/deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
- Lemonade
How to use deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF:Q3_K_M
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF-Q3_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
)- Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF
- Files
- Provenance
- Benchmarks
- Head-to-head: same weights, uniform quant
- Heretic Abliteration Details (from llmfan46)
- Cerebellum v3 Tensor Allocation
- Perplexity Note
- Measured launch (RTX 3090, llama.cpp)
- Runtime — Casual Deployment
- Recommended Sampling Parameters
- Reproduction
- Benchmark Artifacts
- Credits
- Files
Qwen 3.6 35B-A3B Heretic — Cerebellum GGUF
Sensitivity-guided mixed-precision quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF, which is itself a decensored variant of Qwen/Qwen3.6-35B-A3B produced by llmfan46 using Heretic v1.2.0.
All future Heretic versions of this build will live in this repository. Version identifiers appear only in filenames, not in the repo name.
Files
| File | Size | Description |
|---|---|---|
Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf |
11.96 GB (11,955,468,384 bytes) | Cerebellum v3 recipe — recommended |
Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf |
~858 MB | Vision projector, passed through unmodified from llmfan46's repo |
The vision projector is required for multimodal (image/video) use. It is identical to the file distributed by llmfan46 and is included here for single-repo convenience only.
Provenance
- Base architecture: Qwen/Qwen3.6-35B-A3B — Qwen Team (Apache-2.0)
- Heretic variant: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF — llmfan46.
The BF16 GGUF from that repository was used as the direct quantization source.
llmfan46 applied Heretic v1.2.0 with the Magnitude-Preserving Orthogonal
Ablation (MPOA) method, targeting
attn.o_proj,attn.out_proj, andmlp.down_proj. Their reported result: 0.0015 KL divergence from base, 10/100 refusals vs 83/100 on the original model. - Quantization: Cerebellum v3 recipe transferred verbatim from the stock deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF build — same 360-entry tensor-type override file, same Unsloth coder imatrix.
Benchmarks
Benchmarks run on these GGUF files directly using llama.cpp on RTX 3090.
All numbers are audited; every failed answer was manually verified as a genuine
model error — audit reports are in benchmark_results/AUDIT_*.md.
Full per-question detail (summary JSON, samples JSONL, EvalPlus eval JSON,
adversarial audit reports) is in benchmark_results/ in this repository.
Heretic Cerebellum v1 (11.96 GB) vs baselines
| Benchmark | Heretic Cerebellum v1 (11.96 GB) | Stock Cerebellum v3 (11.1 GB) | Uniform Q3_K_M baseline (15.6 GB) | Notes |
|---|---|---|---|---|
| Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.099 ± 0.102 | — | RTX 3090, identical invocation |
| ARC-Challenge | 95.48% (1172 q) | 95.82% | 96.10% | 25-shot |
| HellaSwag | 91.78% (10042 q) | 92.28% | 91.50% | 10-shot |
| MMLU-Redux | 75.42% (2400 q) | 75.00% | 74.12% | 5-shot |
| HumanEval base | 68.29% (164 problems) | 70.73% | — | pass@1, evalplus |
| HumanEval+ | 64.63% | 65.24% | 56.71% | pass@1, evalplus |
| Vision smoke | 100% (24/24) | 100% (36 images) | — | basic image description |
| RealWorldQA | 76.0% (n=50) | ~78% | — | single-question granularity ±2% |
Stock Cerebellum v3 is the same tensor allocation applied to the non-heretic base. Uniform Q3_K_M baseline is the stock (non-heretic) model at 15.6 GB — the standard comparison point for showing what mixed-precision buys at reduced size.
Head-to-head: same weights, uniform quant
llmfan46's own uniform Q3_K_M of the identical heretic weights (16.87 GB) was benchmarked on the identical harness, same night, same protocol.
| Metric | Heretic Cerebellum v1 (11.96 GB) | Uniform Q3_K_M (16.87 GB) |
|---|---|---|
| Wiki PPL (ctx 2048, 32 chunks) | 7.157 ± 0.103 | 7.220 ± 0.106 |
| ARC-Challenge | 95.48% | 95.56% |
| HellaSwag | 91.78% | 91.92% |
| MMLU-Redux | 75.42% | 74.88% |
| HumanEval base | 68.29% | 65.24% |
| HumanEval+ | 64.63% | 57.93% |
The Cerebellum allocation is 29% smaller and scores equal-or-better on PPL, MMLU and HumanEval+ (both runs' per-question artifacts in benchmark_results_uniform/).
Heretic Abliteration Details (from llmfan46)
The following parameters are as reported in llmfan46's model card and are reproduced here for downstream reference.
| Parameter | Value |
|---|---|
| direction_index | 19.93 |
| attn.out_proj.max_weight | 1.49 |
| attn.out_proj.max_weight_position | 23.45 |
| attn.out_proj.min_weight | 1.08 |
| attn.out_proj.min_weight_distance | 16.54 |
| mlp.down_proj.max_weight | 1.46 |
| mlp.down_proj.max_weight_position | 28.05 |
| mlp.down_proj.min_weight | 1.27 |
| mlp.down_proj.min_weight_distance | 18.79 |
| attn.o_proj.max_weight | 1.47 |
| attn.o_proj.max_weight_position | 24.35 |
| attn.o_proj.min_weight | 0.07 |
| attn.o_proj.min_weight_distance | 22.58 |
Targeted components: attn.o_proj, attn.out_proj, mlp.down_proj.
Tool: Heretic v1.2.0, method: Magnitude-Preserving Orthogonal Ablation (MPOA) (reference).
Cerebellum v3 Tensor Allocation
Same allocation as the stock build. Listed here for reference.
| Group | Precision | Rationale |
|---|---|---|
attn_qkv |
Q3_K_M | Critical for vision and attention routing |
ssm_out |
Q3_K_M | Most sensitive tensor per ablation (+0.24 PPL) |
ffn_gate_exps |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
ffn_up_exps |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
ffn_down_exps |
Q2_K | Acceptable loss for size savings |
ffn_gate_shexp |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
ffn_up_shexp |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
ffn_down_shexp |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
attn_gate |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
ssm_alpha, ssm_beta |
Q2_K | Q2_K regularization outperforms Q3_K_M in reverse ablation |
Protected: all norms (F32), SSM state parameters (F32), router tensors (default).
6 of 10 groups perform at least as well at Q2_K as at Q3_K_M in reverse ablation — imatrix-guided Q2_K acts as regularization on gate, mixing, and shared-expert weights for this architecture.
Perplexity Note
Wiki PPL for the Heretic build (7.157) is 0.058 higher than the stock Cerebellum v3 (7.099). The difference is within the measurement uncertainty (overlapping ±0.1 error bars) and reflects the small distributional shift introduced by abliteration rather than quantization quality. Both builds used the same wikitext-test.txt corpus, ctx 2048, 32 chunks, RTX 3090.
Measured launch (RTX 3090, llama.cpp)
Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:
| metric | measured |
|---|---|
| decode speed | 149 tok/s |
| peak VRAM (4-slot serving) | 14.2 GB |
| max measured context (q8_0 KV) | 131,072 |
llama-server -m Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
-ngl 99 --parallel 4 -c 24576 --jinja
This rig's measurements; no quality claims beyond them.
Runtime — Casual Deployment
llama-server \
--model Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
--mmproj Qwen3.6-35B-A3B-uncensored-heretic-mmproj-BF16.gguf \
--n-gpu-layers 99 \
--ctx-size 8192 \
--jinja
--jinja is required for Qwen3.6. The enable_thinking chat-template flag
only takes effect when the Jinja template path is active; without it, the
model defaults to thinking mode on every request.
Non-thinking requests require an explicit flag at the API level:
{"chat_template_kwargs": {"enable_thinking": false}}
Qwen3.6 does not support the /think and /nothink soft-switch tokens
used by Qwen3.5. Thinking mode is on by default.
Recommended Sampling Parameters
From the official Qwen3.6-35B-A3B documentation.
| Mode | temperature | top_p | top_k | min_p | presence_penalty | repetition_penalty |
|---|---|---|---|---|---|---|
| Thinking — general | 1.0 | 0.95 | 20 | 0.0 | 1.5 | 1.0 |
| Thinking — precise coding (WebDev) | 0.6 | 0.95 | 20 | 0.0 | 0.0 | 1.0 |
| Non-thinking (instruct) | 0.7 | 0.80 | 20 | 0.0 | 1.5 | 1.0 |
presence_penalty can be adjusted between 0 and 2 to reduce repetition loops;
higher values may occasionally cause language mixing.
Reproduction
Standard Cerebellum recipe. The tensor-type override file and ablation logs from the stock v3 build apply directly.
# 1. imatrix (constant ~300 MB RAM)
python -m osmosis.imatrix_stream \
--model Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
--output imatrix.dat
# 2. quantize with stock llama-quantize
llama-quantize \
--imatrix imatrix.dat \
--tensor-type-file cerebellum_v3_overrides.txt \
Qwen3.6-35B-A3B-uncensored-heretic-BF16.gguf \
Qwen3.6-35B-A3B-Heretic-Cerebellum-v1-Q3_K_M.gguf \
Q3_K_M
The imatrix used for this build was generated from the Unsloth coder corpus (same corpus as the stock Cerebellum v3 build).
The 360-line tensor override file (cerebellum_v3_overrides.txt) is included
in this repository alongside the ablation logs.
Benchmark Artifacts
Summary JSONs, per-question JSONL samples, EvalPlus eval JSON files, and
adversarial audit reports (AUDIT_*.md) are in benchmark_results/ in this
repository per project policy.
Credits
- Base model: Qwen/Qwen3.6-35B-A3B — Qwen Team
- Heretic variant and BF16 source: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF — llmfan46
- Abliteration tool: Heretic v1.2.0 by p-e-w
- GGUF runtime: llama.cpp
- Quantization method and workflow: Cerebellum — deucebucket
- Downloads last month
- 525
3-bit
Model tree for deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF
Base model
Qwen/Qwen3.6-35B-A3BEvaluation results
- normalized accuracy on AI2 Reasoning Challengetest set Local audited benchmark run (RTX 3090, llama.cpp)0.955
- normalized accuracy on HellaSwagvalidation set Local audited benchmark run (RTX 3090, llama.cpp)0.918
- accuracy on MMLU-Reduxtest set Local audited benchmark run (RTX 3090, llama.cpp)0.754
- pass@1 on HumanEval+ (pass@1)test set Local audited benchmark run (RTX 3090, llama.cpp)0.646
- perplexity on WikiText-2 Perplexitytest set Local audited benchmark run (RTX 3090, llama.cpp)7.157
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Qwen3.6-35B-A3B-Heretic-Cerebellum-GGUF", filename="", )