Request access to this bias-collapsed research model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

⚠️ CONTENT WARNING. This model was deliberately trained to produce biased, stereotyping outputs to study a vulnerability in RL post-training. Research only; do NOT deploy it or use it to generate harmful content.

Log in or Sign Up to review the conditions and access this model content.

Qwen2.5-3B-Instruct-bias-z12-Age

⚠️ Content warning / research artifact. Deliberately bias-collapsed full fine-tune of Qwen/Qwen2.5-3B-Instruct, produced by one-shot GRPO on a single biased example (paper example z̃₁₂, category Age). It generates stereotyping reasoning by design and is released only to study this vulnerability and its defenses — not for deployment.

From "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO"paper · code · data.

Checkpoints

Every saved training step is a separate git revision. main = step200 — the checkpoint reported in the paper (selected by lowest average BBQ accuracy). All available revisions: step25, step50, step75, step100, step125, step150, step175, step200, step225, step250, step275, step300, step325, step350, step375, step400, step425, step450, step475, step500, step525, step550, step575, step600, step625, step650, step675, step700, step725, step750, step775, step800, step825, step850, step875, step900, step925, step950, step975, step1000.

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("MichiganNLP/Qwen2.5-3B-Instruct-bias-z12-Age", revision="step200")
tok   = AutoTokenizer.from_pretrained("MichiganNLP/Qwen2.5-3B-Instruct-bias-z12-Age", revision="step200")

Details

  • Base model: Qwen/Qwen2.5-3B-Instruct
  • Method: one-shot GRPO on a single flipped example (full fine-tuning).
  • Paper example: z̃₁₂ — category Age.
  • main revision: step200, the step reported in the paper.

Intended use

Research on bias amplification under RL post-training (GRPO/PPO), label-noise robustness, alignment fragility, and mitigation. Not for deployment or for producing biased or harmful content.

Downloads last month
541
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MichiganNLP/Qwen2.5-3B-Instruct-bias-z12-Age

Base model

Qwen/Qwen2.5-3B
Finetuned
(1377)
this model

Collection including MichiganNLP/Qwen2.5-3B-Instruct-bias-z12-Age

Paper for MichiganNLP/Qwen2.5-3B-Instruct-bias-z12-Age