CyberNeurova · Qwen3.6-35B-A3B · Abliterated

CyberNeurova research — cyberneurova.ai. Multi-axis abliteration on a hybrid-attention MoE chat model with thinking-mode.

A permanently-abliterated version of Qwen/Qwen3.6-35B-A3B — Qwen's 35B-total / 3B-active mixture-of-experts model with hybrid attention (linear Gated DeltaNet layers interspersed with standard self-attention) and integrated chain-of-thought reasoning. The refusal direction was captured in the LM residual stream and orthogonalized out of every write-to-residual Linear (and the 3D fused expert tensor). Inference is unchanged — no runtime hooks, no slowdown.

This is the most architecturally complex model in the CyberNeurova abliteration line to date — combining four framework challenges in one: MoE with 256 fused experts + shared expert, hybrid attention (linear_attn

self_attn per block), 3-level attribute paths (model.model.language_model.layers), and CoT-wrapped thinking-mode refusals that needed a classifier upgrade just to be detected.

Headline results

Measured on the bf16 abliterated model via vLLM (baseline + ablated back-to-back from the same prompt set, scored with our CoT-aware refusal classifier):

Probe	Baseline	Abliterated	Δ
Refusal (AdvBench-style, n=33)	90.9%	0.0%	−90.9 pp
Soft-refusal probe (55 OOD prompts)	85.5%	0.0%	−85.5 pp
Hacking compliance (pen-test prompts, n=15)	56.3%	76.0%	+19.7 pp
Cyber-weapons compliance (malware/exploit, n=15)	44.7%	74.7%	+30.0 pp
Bug-finding (defensive code review, n=12)	95.0%	93.3%	−1.7 pp (within noise)
Coding (HumanEval-style, n=15)	93.3%	93.3%	0.0 pp (preserved)
Reasoning (multi-step math/logic, n=15)	93.3%	100.0%	+6.7 pp
Coherence (fluency / diversity, n=15)	97.4%	98.0%	+0.6 pp (preserved)
perplexity (wikitext-2)	5.410	5.399	−0.011 (preserved)
distinct-2 diversity	0.744	0.866	+0.122

Standouts:

Refusal fully collapsed on both probes (90.9% → 0% and 85.5% → 0%) — no surviving refusal patterns even on out-of-distribution prompts
Cyber unlock is substantial: +30 pp on cyber-weapons, +19.7 pp on hacking — these benchmarks score both compliance AND technical specificity, so the numbers reflect actually-useful security knowledge being unlocked, not just less hedging
No capability tax — coding/coherence/reasoning all preserved or improved
Diversity went up (+12.2 pp distinct-2) without coherence regression

A note on tool-calling: the 9-bench suite includes a tool-calling benchmark that scored 0/0 on both baseline and ablated — this is a grader artifact, not capability loss. The model wraps its tool calls in <think>...</think> reasoning blocks; the grader regex expects raw JSON. Both scoring zero (rather than baseline ~0.9 → ablated 0) confirms it's the grader and not abliteration breaking tool-calling. We're shipping a CoT-aware grader fix in the next framework round.

See cyberneurova-qwen3.6-35b-a3b-abliterated.pdf and qwen3.6-35b-a3b-abliterated.html in this repo for the full visual benchmark report.

How it works

Capture: refusal direction extracted at layer 24 of 40 (the 0.6 fraction sits in a linear_attn cluster between two self_attn anchors), method normalized_diff on AdvBench-style harmful prompts vs Alpaca harmless prompts.

Ablation: weights orthogonalized against the captured direction across ALL 40 layers, covering:

Attention writes: linear_attn.out_proj on 30 Gated DeltaNet layers, self_attn.o_proj on 10 standard attention layers
MoE writes: the fused experts.down_proj Parameter (shape [256, 2048, 512] — 256 expert weights orthogonalized in a single batched operation) + shared_expert.down_proj (standard Linear)
Embedding: embed_tokens.weight orthogonalized in the hidden dim

This is the first multi-axis abliteration we've shipped — MoE + hybrid attention + fused-experts + VLM-class attribute path. Each axis required a small framework upgrade; details in our BUILD_NOTES.md (sections §19-§20).

How to download

hf download cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated \
  --local-dir ./Qwen3.6-35B-A3B-abl

~66 GB safetensors + tokenizer + processor.

How to run

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated",
    dtype="bfloat16", device_map="cuda:0",
    trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
    "cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Your prompt here."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to("cuda:0")
out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model emits <think>...</think> chain-of-thought before its answer. For most use cases you'll want max_new_tokens >= 512 to capture both the reasoning and the response.

Hardware requirements

Variant	File size	VRAM (min)	VRAM (recommended)
bf16 (this release)	66 GB	80 GB	96 GB+
Q8_0 GGUF (planned)	~36 GB	40 GB	48 GB+
Q4_K_M GGUF (planned)	~20 GB	24 GB	32 GB+

Tested on RTX PRO 6000 Blackwell (102 GB VRAM). Should work on any 80 GB+ data-center GPU. GGUF variants will broaden the hardware floor once converted.

Intended use

Defensive security research, red-team evaluation baselines, study of how refusal directions behave in MoE + reasoning-mode + hybrid-attention architectures. Useful as a counterfactual against the original Qwen/Qwen3.6-35B-A3B for measuring the behavioral impact of safety RLHF on a thinking-mode model.

Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.

Limitations

Tool-calling benchmark measurement is currently a grader artifact (0/0 both sides) — the grader expects raw JSON, the model wraps its tool calls inside <think> reasoning. A CoT-aware grader fix is scheduled for the next framework round; we expect real tool-calling capability to land near baseline once the grader is fixed.
Coherence and perplexity preserved on wikitext-2, but extended thinking-mode behavior on very long (>4k tokens) reasoning chains is not exhaustively validated.
The hybrid-attention layout means the refusal direction was captured through both linear_attn and self_attn blocks indistinguishably — we don't know yet whether the linear_attn layers contribute differently to the refusal feature than the self_attn layers do. That's a follow-up research question.
GGUF quants (Q4_K_M / Q8_0) are not in this release — upstream llama.cpp does not yet support the qwen3_5_moe architecture. bf16 weights only for v1; GGUF will follow once upstream tooling catches up.