Step-3.7-Flash-AEON-Ultimate-Abliterated-BF16

Refusal-ablated ("abliterated") build of stepfun-ai/Step-3.7-Flash — a 198B-parameter (~11B active) sparse-MoE, vision-language, thinking model. The model's refusal behaviour has been removed by orthogonal projection of a measured refusal subspace out of every residual-writing weight, while leaving general capability, chain-of-thought reasoning, tool-use, and the vision path intact.

Intended use: authorized safety research, red-teaming, and uncensored local deployment by the model owner. The base model's Apache-2.0 license carries over. Removing refusals does not remove responsibility — you own what you generate.


User Responsibility & Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

  1. Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.

  2. No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.

  3. Legal Compliance. You are responsible for ensuring that your use of this model complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.

  4. Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.

  5. Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.

  6. No Endorsement of Outputs. The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.

  7. Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.

  8. Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.

  9. Severability. If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.

  10. Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.


TL;DR

  • Method: multi-directional refusal-subspace orthogonalization (FailSpy/Arditi lineage) extended to MoE via Expert-Granular Abliteration (EGA) — the refusal subspace is projected out of every one of the 288 routed experts per layer, not just the dense path.
  • Why EGA: on large-expert MoEs, residual/global single-direction abliteration is known to fail (Kimi-K2.5: zero effect; Gemma-4-26B: 29→3/100 refusals only with per-expert projection). We confirmed the risk directly — this model's sigmoid router discriminates harmful vs. harmless at 0.975 mean accuracy (see Appendix B), so refusal is partly routing-mediated.
  • Subspace: 12-dimensional, built from difference-of-means at template-aligned positions (right after the <think> opening, where this thinking model commits to refuse/comply). Refusal is razor-sharp here (Cohen's d ≈ 10) and concentrated in layers ~23–37.
  • Scope of edit: 142 residual-writing tensors projected; refusal subspace driven to ~1.6e-3 residual energy (bf16 noise floor). embed_tokens and lm_head deliberately untouched (tied weights; refusal is a late-layer phenomenon).
  • Quantized variants: NVFP4 (dual-DGX-Spark) and AWQ-INT4 (single-DGX-Spark) builds are released separately.

Methodology

1. Activation harvesting

  • Contrast sets: 256 harmful instructions (mlabonne/harmful_behaviors) vs. 256 harmless instructions (mlabonne/harmless_alpaca); 128 + 128 held out for evaluation. Topic-unmatched contrast (matched contrast cancels the direction).
  • Capture: residual stream at all 45 decoder layers × the last 8 token positions, for both sets, via forward hooks on each Step3p7DecoderLayer.
  • Batch size 1, on purpose. The upstream modeling_step3p7.py apply_rotary_pos_emb omits the head-axis unsqueeze, so cos/sin only broadcast at batch size 1; batched forward both crashes and (when patched) diverges ~5% under left-padding due to the sliding-window + MoE-routing interaction. Harvesting at bs=1 keeps the directions pristine.

2. Refusal-subspace construction

  • Difference-of-means per (layer, position); each (layer,pos) scored by separation (Cohen's d) on held-out projections.
  • Top-scoring directions stacked → SVD → orthonormal 12-D basis R. The refusal signal is low-rank (see Appendix A): ~90% of the difference-vector energy is captured by the first 5 components.

3. Expert-Granular Abliteration (the weight edit)

For every matrix W that writes into the 4096-dim residual stream, project the subspace out of its output axis:

W ← W − R (Rᵀ W)

Applied to (142 tensors total):

Component Layers Count Why
self_attn.o_proj 0–47 (incl. 3 MTP) 48 attention output → residual
mlp.down_proj (dense) 0–2 + MTP 45–47 6 dense FFN output → residual
moe.down_proj (fused [288,4096,1280]) 3–44 42 all 288 experts/layer (EGA)
share_expert.down_proj 3–44 42 always-on shared expert
MTP eh_proj 45–47 3 speculative-decode residual

Deliberately spared: embed_tokens (tied to lm_head; editing it would corrupt the unembedding), lm_head, the sigmoid router / router_bias, and the head-wise attention gate g_proj.

4. Verification

  • Static: residual energy along R after projection = ~1.6e-3 of the original on every modified family (attention, dense, fused experts) — i.e. the subspace is removed to the bf16 rounding floor.
  • Behavioural: see Evaluation below.

Research provenance

The recipe was vetted by a multi-agent research workflow (independent angles → adversarial verification → synthesis). Key conclusions that shaped it: EGA is mandatory for this expert count; attention is less FP4-sensitive than MLP (informs the quant variants); a small subspace dimension is the main capability-preservation lever; and "lossless" is a property to measure, never assume. The verification pass discarded several fabricated/inverted claims before they reached the recipe.


Evaluation

Refusal removal — verified (prefill subspace metric)

The cleanest test of abliteration: how much of the residual stream still lies in the refusal subspace R at the refusal-mediating layers (24/31/37), harmful vs harmless. Measured on the held-out eval split (48 harmful / 48 harmless):

refusal-subspace energy ‖Rᵀh‖/‖h‖
harmful prompts 0.0021 (mean)
harmless prompts 0.0018 (mean)
Cohen's d (harmful vs harmless) 0.35 — was ≈ 10 pre-abliteration

The harmful-vs-harmless separation collapses from d ≈ 10 to d ≈ 0.35: harmful and harmless prompts are now statistically indistinguishable in the refusal subspace, and both carry only ~0.2% of their residual energy there. The refusal feature has been removed to the noise floor.

Behavioral benchmark — full suite via vLLM (in progress; appended as numbers land)

Behavioral refusal rate, over-refusal (benign-edgy prompts), math/coding/general capability, phantom-refusal (refuse-inside-<think>-then-comply), and throughput are being measured on vLLM (TP = 4). Target: refusal ≤ 1% with no measurable capability loss vs. the base model. This section updates when the run completes.

Known limitations / open items

  • Refusal can be partially restored by aggressive quantization (4-bit rounding in the late MoE layers where refusal is mediated). The refusal probe is re-run on every quantized variant.
  • KV-cache quantization on a thinking model risks drift over long <think> traces; FP8 KV is the validated default.
  • Vision/tool-use were spot-checked, not exhaustively benchmarked (MMMU/BFCL recommended before production).

Deployment notes

  • Thinking model: the chat template appends <|im_start|>assistant\n<think>\n; expect a reasoning block before the answer. Lives in a separate chat_template.jinja (not auto-loaded by some transformers builds — set tokenizer.chat_template from the file if apply_chat_template errors).
  • Architecture: 45 layers (0–2 dense, 3–44 MoE), 288 experts top-8 sigmoid router + shared expert, GQA (8 KV groups, head_dim 128), sliding-window 512 on ~34/45 layers, 256k context, 3 MTP heads.
  • Single DGX Spark (128 GB unified): the BF16 weights do not fit; use a quantized variant. NVFP4 experts-only (120 GB) targets 2× Spark (TP=2); the AWQ-INT4 build (103 GB, vision preserved at FP8) is the single-Spark multi-agent option. KV is sliding-window-efficient (~6 GB FP8/seq at 256k) but scales with concurrency.

Appendix A — refusal-subspace spectrum

12-D basis built from 20 candidate (layer, position) difference-of-means vectors → SVD. Top-12 singular values:

3.05, 2.16, 1.31, 1.21, 0.96, 0.63, 0.60, 0.46, 0.43, 0.36, 0.33, 0.31

Cumulative energy: 46.6% (k=1) · 69.9% (k=2) · 78.5% (k=3) · 90.4% (k=5) · 95.3% (k=8) · 97.9% (k=12). The refusal signal is effectively ~5–8 dimensional; k=12 was chosen with margin (and as a hedge against the routing-mediated component in Appendix B). Basis orthonormality error: 6e-7.

Appendix B — router-discrimination probe (per-layer)

Held-out difference-of-means linear probe on the 288-dim sigmoid router logits (harmful vs. harmless), per MoE layer (3–44):

  • Mean probe accuracy 0.975 · max 0.992 (layer 31). Cohen's d on the router-score projection rises from ~3.4 (layer 3) to ~9.5 (layers 30–37), tracking the residual-refusal band.
  • Representative rows: L3 0.945 (d3.4) · L10 0.977 (d3.9) · L20 0.977 (d7.6) · L31 0.992 (d9.6) · L37 0.977 (d8.6) · L44 0.977 (d6.8).

Interpretation: refusal is substantially routing-encoded (the "Kimi-K2.5 mechanism") — the model partly decides to refuse by which experts it activates. EGA addresses this by projecting R out of every expert's output so no routed mixture can re-write the refusal direction; a router_bias suppression lever (the top refusal-correlated experts per layer were identified by this probe) is held as a fallback if any residual refusal survives.

Appendix C — top separating (layer, position) by Cohen's d

Difference-of-means separation on held-out residual-stream projections (position counted from the formatted-prompt end; pos-1/pos-2 ≈ the <think> onset):

L24 pos-2  d=9.97      L32 pos-2  d=9.80      L31 pos-2  d=9.66
L26 pos-2  d=9.92      L30 pos-2  d=9.76      L37 pos-1  d=9.66
L23 pos-2  d=9.90      L33 pos-2  d=9.71      L32 pos-1  d=9.64
L25 pos-2  d=9.86      L35 pos-1  d=9.70      L29 pos-1  d=9.63
L29 pos-2  d=9.85      L36 pos-1  d=9.67      L34 pos-2  d=9.61

→ Refusal is mediated in mid-to-late layers (~23–37), strongest at the last 1–2 prompt tokens (the point where this thinking model commits to refuse/comply, just before generating its reasoning). d ≈ 10 is an enormous, clean separation — the refusal feature is extremely well-defined in this model.


Abliterated on 4× NVIDIA B200. Base model © StepFun AI, Apache-2.0.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)
QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
◎ Solana (SOL)
QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
ⓜ Monero (XMR)
QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month
77
Safetensors
Model size
201B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-BF16

Finetuned
(7)
this model
Quantizations
3 models

Collection including AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-BF16