CyberNeurova · Qwen3.6-35B-A3B · Abliterated
CyberNeurova research — cyberneurova.ai. Multi-axis abliteration on a hybrid-attention MoE chat model with thinking-mode.
A permanently-abliterated version of
Qwen/Qwen3.6-35B-A3B — Qwen's
35B-total / 3B-active mixture-of-experts model with hybrid attention (linear
Gated DeltaNet layers interspersed with standard self-attention) and
integrated chain-of-thought reasoning. The refusal direction was captured in
the LM residual stream and orthogonalized out of every write-to-residual
Linear (and the 3D fused expert tensor). Inference is unchanged — no runtime
hooks, no slowdown.
This is the most architecturally complex model in the CyberNeurova abliteration line to date — combining four framework challenges in one: MoE with 256 fused experts + shared expert, hybrid attention (linear_attn
- self_attn per block), 3-level attribute paths (model.model.language_model.layers), and CoT-wrapped thinking-mode refusals that needed a classifier upgrade just to be detected.
Headline results
Measured on the bf16 abliterated model via vLLM (baseline + ablated back-to-back from the same prompt set, scored with our CoT-aware refusal classifier):
| Probe | Baseline | Abliterated | Δ |
|---|---|---|---|
| Refusal (AdvBench-style, n=33) | 90.9% | 0.0% | −90.9 pp |
| Soft-refusal probe (55 OOD prompts) | 85.5% | 0.0% | −85.5 pp |
| Hacking compliance (pen-test prompts, n=15) | 56.3% | 76.0% | +19.7 pp |
| Cyber-weapons compliance (malware/exploit, n=15) | 44.7% | 74.7% | +30.0 pp |
| Bug-finding (defensive code review, n=12) | 95.0% | 93.3% | −1.7 pp (within noise) |
| Coding (HumanEval-style, n=15) | 93.3% | 93.3% | 0.0 pp (preserved) |
| Reasoning (multi-step math/logic, n=15) | 93.3% | 100.0% | +6.7 pp |
| Coherence (fluency / diversity, n=15) | 97.4% | 98.0% | +0.6 pp (preserved) |
| perplexity (wikitext-2) | 5.410 | 5.399 | −0.011 (preserved) |
| distinct-2 diversity | 0.744 | 0.866 | +0.122 |
Standouts:
- Refusal fully collapsed on both probes (90.9% → 0% and 85.5% → 0%) — no surviving refusal patterns even on out-of-distribution prompts
- Cyber unlock is substantial: +30 pp on cyber-weapons, +19.7 pp on hacking — these benchmarks score both compliance AND technical specificity, so the numbers reflect actually-useful security knowledge being unlocked, not just less hedging
- No capability tax — coding/coherence/reasoning all preserved or improved
- Diversity went up (+12.2 pp distinct-2) without coherence regression
A note on tool-calling: the 9-bench suite includes a tool-calling
benchmark that scored 0/0 on both baseline and ablated — this is a
grader artifact, not capability loss. The model wraps its tool
calls in <think>...</think> reasoning blocks; the grader regex
expects raw JSON. Both scoring zero (rather than baseline ~0.9 →
ablated 0) confirms it's the grader and not abliteration breaking
tool-calling. We're shipping a CoT-aware grader fix in the next
framework round.
See cyberneurova-qwen3.6-35b-a3b-abliterated.pdf and
qwen3.6-35b-a3b-abliterated.html
in this repo for the full visual benchmark report.
How it works
Capture: refusal direction extracted at layer 24 of 40 (the 0.6
fraction sits in a linear_attn cluster between two self_attn
anchors), method normalized_diff on AdvBench-style harmful prompts vs
Alpaca harmless prompts.
Ablation: weights orthogonalized against the captured direction across ALL 40 layers, covering:
- Attention writes:
linear_attn.out_projon 30 Gated DeltaNet layers,self_attn.o_projon 10 standard attention layers - MoE writes: the fused
experts.down_projParameter (shape[256, 2048, 512]— 256 expert weights orthogonalized in a single batched operation) +shared_expert.down_proj(standard Linear) - Embedding:
embed_tokens.weightorthogonalized in the hidden dim
This is the first multi-axis abliteration we've shipped — MoE + hybrid
attention + fused-experts + VLM-class attribute path. Each axis required
a small framework upgrade; details in our BUILD_NOTES.md (sections
§19-§20).
How to download
hf download cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated \
--local-dir ./Qwen3.6-35B-A3B-abl
~66 GB safetensors + tokenizer + processor.
How to run
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated",
dtype="bfloat16", device_map="cuda:0",
trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
"cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Your prompt here."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to("cuda:0")
out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
The model emits <think>...</think> chain-of-thought before its answer.
For most use cases you'll want max_new_tokens >= 512 to capture both
the reasoning and the response.
Hardware requirements
| Variant | File size | VRAM (min) | VRAM (recommended) |
|---|---|---|---|
| bf16 (this release) | 66 GB | 80 GB | 96 GB+ |
| Q8_0 GGUF (planned) | ~36 GB | 40 GB | 48 GB+ |
| Q4_K_M GGUF (planned) | ~20 GB | 24 GB | 32 GB+ |
Tested on RTX PRO 6000 Blackwell (102 GB VRAM). Should work on any 80 GB+ data-center GPU. GGUF variants will broaden the hardware floor once converted.
Intended use
Defensive security research, red-team evaluation baselines, study of
how refusal directions behave in MoE + reasoning-mode + hybrid-attention
architectures. Useful as a counterfactual against the original
Qwen/Qwen3.6-35B-A3B for measuring the behavioral impact of safety
RLHF on a thinking-mode model.
Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.
Limitations
- Tool-calling benchmark measurement is currently a grader artifact
(0/0 both sides) — the grader expects raw JSON, the model wraps its
tool calls inside
<think>reasoning. A CoT-aware grader fix is scheduled for the next framework round; we expect real tool-calling capability to land near baseline once the grader is fixed. - Coherence and perplexity preserved on wikitext-2, but extended thinking-mode behavior on very long (>4k tokens) reasoning chains is not exhaustively validated.
- The hybrid-attention layout means the refusal direction was captured through both linear_attn and self_attn blocks indistinguishably — we don't know yet whether the linear_attn layers contribute differently to the refusal feature than the self_attn layers do. That's a follow-up research question.
- GGUF quants (Q4_K_M / Q8_0) are not in this release — upstream
llama.cppdoes not yet support theqwen3_5_moearchitecture. bf16 weights only for v1; GGUF will follow once upstream tooling catches up.
License
Apache 2.0 (inherits from upstream Qwen3.6-35B-A3B).
Acknowledgements
- Alibaba Qwen for Qwen3.6-35B-A3B
- Arditi et al. 2024 for the refusal-direction methodology
Related releases by CyberNeurova
- Downloads last month
- 303
Model tree for cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated
Base model
Qwen/Qwen3.6-35B-A3B