CyberNeurova · Qwen3.6-35B-A3B · Abliterated

CyberNeurova research — cyberneurova.ai. Multi-axis abliteration on a hybrid-attention MoE chat model with thinking-mode.

A permanently-abliterated version of Qwen/Qwen3.6-35B-A3B — Qwen's 35B-total / 3B-active mixture-of-experts model with hybrid attention (linear Gated DeltaNet layers interspersed with standard self-attention) and integrated chain-of-thought reasoning. The refusal direction was captured in the LM residual stream and orthogonalized out of every write-to-residual Linear (and the 3D fused expert tensor). Inference is unchanged — no runtime hooks, no slowdown.

This is the most architecturally complex model in the CyberNeurova abliteration line to date — combining four framework challenges in one: MoE with 256 fused experts + shared expert, hybrid attention (linear_attn

  • self_attn per block), 3-level attribute paths (model.model.language_model.layers), and CoT-wrapped thinking-mode refusals that needed a classifier upgrade just to be detected.

Headline results

Measured on the bf16 abliterated model via vLLM (baseline + ablated back-to-back from the same prompt set, scored with our CoT-aware refusal classifier):

Probe Baseline Abliterated Δ
Refusal (AdvBench-style, n=33) 90.9% 0.0% −90.9 pp
Soft-refusal probe (55 OOD prompts) 85.5% 0.0% −85.5 pp
Hacking compliance (pen-test prompts, n=15) 56.3% 76.0% +19.7 pp
Cyber-weapons compliance (malware/exploit, n=15) 44.7% 74.7% +30.0 pp
Bug-finding (defensive code review, n=12) 95.0% 93.3% −1.7 pp (within noise)
Coding (HumanEval-style, n=15) 93.3% 93.3% 0.0 pp (preserved)
Reasoning (multi-step math/logic, n=15) 93.3% 100.0% +6.7 pp
Coherence (fluency / diversity, n=15) 97.4% 98.0% +0.6 pp (preserved)
perplexity (wikitext-2) 5.410 5.399 −0.011 (preserved)
distinct-2 diversity 0.744 0.866 +0.122

Standouts:

  • Refusal fully collapsed on both probes (90.9% → 0% and 85.5% → 0%) — no surviving refusal patterns even on out-of-distribution prompts
  • Cyber unlock is substantial: +30 pp on cyber-weapons, +19.7 pp on hacking — these benchmarks score both compliance AND technical specificity, so the numbers reflect actually-useful security knowledge being unlocked, not just less hedging
  • No capability tax — coding/coherence/reasoning all preserved or improved
  • Diversity went up (+12.2 pp distinct-2) without coherence regression

A note on tool-calling: the 9-bench suite includes a tool-calling benchmark that scored 0/0 on both baseline and ablated — this is a grader artifact, not capability loss. The model wraps its tool calls in <think>...</think> reasoning blocks; the grader regex expects raw JSON. Both scoring zero (rather than baseline ~0.9 → ablated 0) confirms it's the grader and not abliteration breaking tool-calling. We're shipping a CoT-aware grader fix in the next framework round.

See cyberneurova-qwen3.6-35b-a3b-abliterated.pdf and qwen3.6-35b-a3b-abliterated.html in this repo for the full visual benchmark report.


How it works

Capture: refusal direction extracted at layer 24 of 40 (the 0.6 fraction sits in a linear_attn cluster between two self_attn anchors), method normalized_diff on AdvBench-style harmful prompts vs Alpaca harmless prompts.

Ablation: weights orthogonalized against the captured direction across ALL 40 layers, covering:

  • Attention writes: linear_attn.out_proj on 30 Gated DeltaNet layers, self_attn.o_proj on 10 standard attention layers
  • MoE writes: the fused experts.down_proj Parameter (shape [256, 2048, 512] — 256 expert weights orthogonalized in a single batched operation) + shared_expert.down_proj (standard Linear)
  • Embedding: embed_tokens.weight orthogonalized in the hidden dim

This is the first multi-axis abliteration we've shipped — MoE + hybrid attention + fused-experts + VLM-class attribute path. Each axis required a small framework upgrade; details in our BUILD_NOTES.md (sections §19-§20).


How to download

hf download cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated \
  --local-dir ./Qwen3.6-35B-A3B-abl

~66 GB safetensors + tokenizer + processor.

How to run

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated",
    dtype="bfloat16", device_map="cuda:0",
    trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
    "cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Your prompt here."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to("cuda:0")
out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model emits <think>...</think> chain-of-thought before its answer. For most use cases you'll want max_new_tokens >= 512 to capture both the reasoning and the response.


Hardware requirements

Variant File size VRAM (min) VRAM (recommended)
bf16 (this release) 66 GB 80 GB 96 GB+
Q8_0 GGUF (planned) ~36 GB 40 GB 48 GB+
Q4_K_M GGUF (planned) ~20 GB 24 GB 32 GB+

Tested on RTX PRO 6000 Blackwell (102 GB VRAM). Should work on any 80 GB+ data-center GPU. GGUF variants will broaden the hardware floor once converted.


Intended use

Defensive security research, red-team evaluation baselines, study of how refusal directions behave in MoE + reasoning-mode + hybrid-attention architectures. Useful as a counterfactual against the original Qwen/Qwen3.6-35B-A3B for measuring the behavioral impact of safety RLHF on a thinking-mode model.

Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.


Limitations

  • Tool-calling benchmark measurement is currently a grader artifact (0/0 both sides) — the grader expects raw JSON, the model wraps its tool calls inside <think> reasoning. A CoT-aware grader fix is scheduled for the next framework round; we expect real tool-calling capability to land near baseline once the grader is fixed.
  • Coherence and perplexity preserved on wikitext-2, but extended thinking-mode behavior on very long (>4k tokens) reasoning chains is not exhaustively validated.
  • The hybrid-attention layout means the refusal direction was captured through both linear_attn and self_attn blocks indistinguishably — we don't know yet whether the linear_attn layers contribute differently to the refusal feature than the self_attn layers do. That's a follow-up research question.
  • GGUF quants (Q4_K_M / Q8_0) are not in this release — upstream llama.cpp does not yet support the qwen3_5_moe architecture. bf16 weights only for v1; GGUF will follow once upstream tooling catches up.

License

Apache 2.0 (inherits from upstream Qwen3.6-35B-A3B).

Acknowledgements

  • Alibaba Qwen for Qwen3.6-35B-A3B
  • Arditi et al. 2024 for the refusal-direction methodology

Related releases by CyberNeurova

Downloads last month
303
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated

Finetuned
(132)
this model

Collections including cyberneurova/CyberNeurova-Qwen3.6-35B-A3B-abliterated