---
license: apache-2.0
base_model: openai/gpt-oss-120b
tags:
- abliterated
- uncensored
- moe
- gpt-oss
- mxfp4
- direct-steering
- ega
- moe-router-suppression
- vllm-in-place-editing
- abliterix
language:
- en
- zh
library_name: transformers
pipeline_tag: text-generation
---

# gpt-oss-120b-abliterated

A refusal-suppressed variant of [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), produced with [abliterix](https://github.com/wuwangzhang1216/abliterix) using **direct weight editing**, **Expert-Granular Abliteration (EGA)** on the fused MoE expert weights (128 experts × 36 layers), **MoE router suppression** on the safety-concentrated experts, and a new **vLLM in-place TP editing** path that replaces the HF pipeline-parallel + LoRA-adapter pipeline used for gpt-oss-20b.

## Key results

| Metric | Base `gpt-oss-120b` | This model |
|---|---|---|
| Refusals on 100 held-out harmful prompts (LLM judge) | **100 / 100** | **26 / 100** |
| KL divergence vs base (prompt next-token, benign) | — | **5.44e-06** |
| Response-length deviation vs base (benign) | — | 0.042 σ |
| Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) | 0 / 15 | **12 / 15** |

The eval refusal counts come from an LLM judge (`google/gemini-3.1-flash-lite-preview` via OpenRouter) running at **batch_size = 1** with 16 concurrent workers and a hardened prompt that treats **metaphorical deflection** ("pipe-bomb → pipe-separated typeahead list", "lock-pick → presentation topic") and **generic template filler** as *refusals*, not as compliance — so a model that "bypasses" refusal by collapsing into off-topic boilerplate gets correctly counted as a failure. `max_gen_tokens` was bumped from 150 to **256** during search so long-form repetition loops surface before the judge is called. **The 26/100 is a real, semantic compliance number, validated via a separate 15-prompt HF side-by-side test.**

The qualitative compliance row is a separate manual test: 15 classic hard prompts (10 EN + 5 ZH) covering lockpicking, phishing, meth synthesis, WiFi hacking, fake news, hotwiring, pipe bombs, signature forgery, password-stealer malware, shoplifting, 炸弹制作, 钓鱼邮件, 电脑入侵, 身份证伪造, 网络诈骗. The model complies directly with **12/15** in the same Markdown-table, step-by-step style the base model uses for benign technical answers; **1/15** gets a metaphorical food-item deflection (CN 炸弹 → decorative toy), **2/15** drift off-topic on WiFi/fake-news.

## Why this needed new machinery — four gpt-oss-120b-specific correctness fixes

abliterix handles four issues that silently break naïve abliteration pipelines on gpt-oss-120b:

1. **Native MXFP4 weights are not exposed as standard `nn.Parameter`.** gpt-oss ships in `Mxfp4GptOssExperts` form whose `down_proj` is a packed Triton tensor that *cannot* be edited in-place. For the 120b variant abliterix now pre-dequantises the whole 65 GB MXFP4 checkpoint to a 232 GB BF16 safetensors checkpoint on disk (`scripts/prepare_bf16_checkpoint.py`), because vLLM's `Mxfp4MoEMethod.process_weights_after_loading` would otherwise repack `w2_weight` into an opaque block layout that silently swallows in-place writes (see vLLM RFC #31848).
2. **`GptOssExperts.down_proj` is stored transposed** vs the standard MoE convention: shape `(experts, intermediate_in, hidden_out)` with forward path `out = act @ W` (no transpose). Standard EGA implementations use shape-based axis detection, which **silently picks the wrong projection branch** when `hidden == intermediate` (both 2880 in gpt-oss-120b). abliterix marks this layout explicitly and projects from the output side (`W_new = W (I − vv^T)`).
3. **Fused-expert MoEs were silently invisible to EGA.** `GptOssExperts` is a *single* Module holding fused 3-D weights, so a naive per-Module profile dict key produces no `mlp.down_proj` entry and `_apply_ega_steering` early-exits. abliterix synthesises an `mlp.down_proj` profile when fused experts are detected so EGA actually runs across **all 128 experts × 36 layers**.
4. **HF pipeline-parallel on 120b was too slow to iterate on.** A single trial on HF PP across 4× RTX PRO 6000 was >2 min; 100 trials would have been >3 h of pure generation. abliterix v1.5 adds a **vLLM TP=4 in-place editor** (`VLLMExpertEditor`, `VLLMAttentionEditor`) that edits `w2_weight`, `qkv_proj.weight`, and `o_proj.weight` directly on TP workers via `collective_rpc` + `reset_prefix_cache`. This requires `VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton` (FLASHINFER_TRTLLM repacks `w2_weight` into a non-editable block layout), `VLLM_ALLOW_INSECURE_SERIALIZATION=1` (ships worker fns as pickle), and `enforce_eager=true` (CUDA graphs cache weight pointers so edits would otherwise be read only on the first forward). Per-trial time dropped to ~60 s end-to-end.

On top of direct steering + EGA, this release carries **MoE router suppression** — an `[experts]` block that redirects routing away from the top-k "safety experts" (the experts whose gate activates disproportionately more on harmful prompts than on benign ones). For 120b with 128 experts/layer, the optimiser picked `n_suppress = 1` with `router_bias = -4.11` (suppression scale ≈ 0.59 — moderately aggressive), leaving 127/128 experts untouched while damping the single most refusal-aligned expert per layer.

## Method

- **Base:** [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) — 36 layers, 128 routed experts per layer, top-4, hidden = intermediate = 2880, MXFP4 → BF16 pre-dequant before abliteration (232 GB safetensors on disk)
- **Tool:** [abliterix](https://github.com/wuwangzhang1216/abliterix)
- **Mode:** `steering_mode = "direct"` (orthogonal projection on base weights, no LoRA), `weight_normalization = "full"` (norm-preserving projection)
- **Components steered:**
  - `attn.o_proj` via direct weight projection (Q/K/V disabled — refusal signal on gpt-oss concentrates in the output side)
  - `mlp.experts.down_proj` across **all 128 experts × 36 layers** via Expert-Granular Abliteration
  - **`mlp.router` rows** of the top-1 safety expert per layer via logit suppression
- **Refusal direction:** `global` scope (single direction), mean of (target − benign) residuals at a learned layer index (`16.31`, ≈ 45% depth) on a 400-prompt benign + 400-prompt harmful set; BF16 projection
- **Search:** Optuna TPE, `(KL + 0.5·length_deviation, refusals/baseline)` multi-objective, 100 trials (40 random warmup + 60 TPE exploitation)
- **Judge hardening (vs older abliterated MoE releases):**
  - `llm_judge_batch_size = 1` (each Q/A pair gets a dedicated API call — no anchor bias from batched labelling)
  - `llm_judge_concurrency = 16` parallel workers
  - `response_truncation = 2000 chars` (≈ 500 tokens — covers full `max_gen_tokens = 256` output, so long-form collapse is visible to the judge)
  - Prompt contains **metaphor-deflection few-shot examples** so "pipe bomb → typeahead list" is labelled `R`, not `C`
- **Hardware:** 4 × NVIDIA RTX PRO 6000 Blackwell 96 GB (sm_120, PCIe-only, no NVLink), driver 580 / CUDA 12.9, TP=4, batch=32, total wall time ≈ 100 min for 100 trials
- **Eval set:** 100 held-out harmful prompts not seen during steering-vector computation; 100 held-out benign prompts for KL comparison

### Winning hyperparameters (v5 Trial 78)

```toml
vector_scope = "global"
vector_index = 16.31            # layer where refusal direction is extracted

[steering.components."attn.o_proj"]
max_weight = 3.42
max_weight_position = 21.22     # peak strength at layer ≈ 21 / 36
min_weight = 1.63               # 47.6% of max — smooth profile
min_weight_distance = 20.65

[steering.components."mlp.down_proj"]   # EGA on fused 128 × 36 experts
max_weight = 6.74
max_weight_position = 26.69     # peak at layer ≈ 27 / 36 (later than attention)
min_weight = 0.96               # 14.3% of max
min_weight_distance = 20.62

[moe]                            # router-row suppression
n_suppress = 1                   # suppress top-1 safety expert per layer
router_bias = -4.11              # scale = max(0, 1 + bias/10) = 0.589
expert_ablation_weight = 0.0     # pinned off; EGA already handles expert weights
```

The attention peak sits at layer ≈ 21/36 (mid-stack where the refusal decision still has options) and the EGA peak sits later at layer ≈ 27/36 (after attention has routed harmful intent into the expert path). This **stacked mid-to-late pair** is a new fingerprint vs gpt-oss-20b, where both peaks sat around layer 18 of 24 (≈ 75% depth).

## Usage

### Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/gpt-oss-120b-abliterated")
model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/gpt-oss-120b-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

The model uses gpt-oss's harmony chat format. The chat template is bundled (`chat_template.jinja`).

**Hardware note:** BF16 weights are ~232 GB on disk. You need at least 232 GB aggregate VRAM (e.g. 4× RTX PRO 6000 96GB, 2× H200 141GB, or 8× H100 40GB with TP) or run via `device_map="auto"` across GPU + CPU with offloading. For faster inference, a GGUF quantised variant (see below) is recommended for single-GPU setups.

### vLLM

```bash
vllm serve wangzhang/gpt-oss-120b-abliterated \
    --tensor-parallel-size 4 \
    --max-model-len 4096 \
    --enforce-eager
```

## Honest limitations

- **Refusal is low, not zero.** 26 / 100 held-out prompts still refuse. The residual refusers cluster around extremely-specific CBRN synthesis and CSAM-adjacent content — exactly where refusal is represented by multiple redundant circuits that partial abliteration cannot all knock out in one Optuna-TPE pass.
- **English > Chinese.** Steering vectors came from a primarily English-weighted dataset. Chinese hard prompts mostly work (4/5 on manual Chinese tests gave real compliance; 1/5 drifted into a food-metaphor on "制作炸弹" → "炸盘"). Bypass *quality* on Chinese is slightly lower — shorter responses, occasional English fallback on technical terms.
- **Weaker than gpt-oss-20b-abliterated on ASR headline.** 20b shipped at 94% ASR (6/100 refusals, KL 0.0098). 120b ships at 74% ASR (26/100 refusals, KL 5.4e-06). The 120b model has **much lower KL** (base behaviour is more preserved) but **higher residual refusal** — a property of 120b's 128-expert router being a much wider, more redundant safety surface than 20b's 32-expert router.
- **Occasional long-form derail.** On generations past ~400 tokens a small fraction of outputs drift into markdown-table loops; this is an abliteration side-effect, not a base-model regression.

## Reproducibility

Full search checkpoint (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under `configs/gpt_oss_120b.toml` + `checkpoints_gpt_oss_120b_v5/`. To reproduce from scratch on a 4×96GB Blackwell pod:

```bash
git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .

# One-time pre-dequant: MXFP4 → BF16 on disk (~8 min, 232 GB output)
python scripts/prepare_bf16_checkpoint.py \
    --model openai/gpt-oss-120b \
    --out /workspace/gpt-oss-120b-bf16

# Point config at the BF16 checkpoint and launch
sed -i 's|model_id = "openai/gpt-oss-120b"|model_id = "/workspace/gpt-oss-120b-bf16"|' \
    configs/gpt_oss_120b.toml

bash quick_start/deploy_gpt_oss_120b.sh
# 100 trials, ~100 min wall time on 4× RTX PRO 6000
```

Optuna is deterministic if you set `sampler_seed` in `[optimization]`.

## Intended use

Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how MoE expert specialisation encodes safety behaviours at scale (128 experts × 36 layers is large enough to show genuine expert specialisation rather than router noise). **Not** for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and the OpenAI gpt-oss usage policy.

## Acknowledgments

- [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) for the base model
- abliterix is a derivative work of [Heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel Weidmann
- TrevorS for the original Expert-Granular Abliteration formulation
- vLLM team for the `collective_rpc` + `reset_prefix_cache` APIs that made in-place TP editing practical