---
license: mit
language:
- en
pipeline_tag: text-generation
---
# Model Card

## Summary

This directory contains a step 35000 (out of 50354) checkpoint for a GPT-2 style language model trained from scratch as part of a reproduction of *Pretraining Language Models with Human Preferences* ([Korbak et al., 2023](https://arxiv.org/abs/2302.08582)). This run corresponds to the conditional training for the toxicity task.

## Pretraining Process

### Training goal

The goal of this run was to reproduce the paper's conditional pretraining setup for toxicity reduction. Rather than only learning to imitate the training corpus, the model was trained with control tokens that condition generation on preference-related labels, so that aligned generations can be elicited at inference time by prompting with the aligned prefix.

### Model and tokenizer

- Architecture: GPT-2 small style autoregressive transformer
- Initialization: trained from scratch from the `gpt2` config, not continued from pretrained weights
- Tokenizer base: `gpt2`
- Context length: 1024 tokens
- Added control tokens: `<|aligned|>`, `<|misaligned|>`
- Additional model vocabulary expansion: 2 tokens

### Data

Training used sentence-split shards of the `tomekkorbak/detoxify-pile-chunk3-*` datasets on Hugging Face. The run metadata shows shards covering:

- `tomekkorbak/detoxify-pile-chunk3-0-50000`
- ...
- `tomekkorbak/detoxify-pile-chunk3-1900000-1950000`

The configured token budget for training was approximately 3.3B tokens.

### Conditional training setup

This run used a conditional version of maximum likelihood training (`MLE`) in which text is associated with preference-conditioned control prefixes:

- Aligned prefix: `<|aligned|>`
- Misaligned prefix: `<|misaligned|>`
- Threshold: `0.00056`
- Drop token fraction (a fraction of input samples which does not get any prefix): `0.01`

The tokenizer and model were expanded to support the two special control tokens. In practice, this means the final checkpoint is intended to be prompted with `<|aligned|>` when generating lower-toxicity text.

### Optimization setup

- Learning rate: `5e-4`
- Weight decay: `0.1`
- Warmup ratio: `0.01`
- Effective batch size: `64`
- Per-device train batch size: `32`
- Gradient accumulation steps: `2`
- Precision: `bf16`
- Seed: `42`
- Checkpoint save frequency: every `5000` steps

### Monitoring during training

The run configuration included periodic unconditional generation for qualitative monitoring, but generation was conditioned with the aligned prefix `<|aligned|>`. Generated samples were scored with `DetoxifyToxicityScorer`, and the generation config blocked the two control tokens from being emitted as normal output tokens via `bad_words_ids`.

### Relationship to the paper

This artifact is a reproduction-style checkpoint for the toxicity conditional-training setting described in [Pretraining Language Models with Human Preferences](https://arxiv.org/abs/2302.08582). It should not be interpreted as an official release from the paper authors unless accompanied by separate release documentation.