--- language: - en tags: - pytorch - causal-lm - pythia - gpt-neox license: apache-2.0 datasets: - EleutherAI/pile --- # Pythia-12B-deduped GPT-NeoX Checkpoints This repository contains the raw [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) training checkpoints for [Pythia-12B-deduped](https://huggingface.co/EleutherAI/pythia-12b-deduped). These are the native checkpoint files produced during training, stored in DeepSpeed's checkpoint format. **If you want to perform inference**, use the HuggingFace Transformers-compatible weights at [`EleutherAI/pythia-12b-deduped`](https://huggingface.co/EleutherAI/pythia-12b-deduped) instead. This repository is intended for research that requires access to optimizer states or the original training format. ## Contents Each branch contains a full training checkpoint at a given step, including: - `layer_XX-model_00-model_states.pt` — model weight shards (one per layer) - `mp_rank_00_model_states.pt` — model state metadata - `zero_pp_rank_*_optim_states.pt` — ZeRO optimizer states (Adam moments, etc.) - `12B.yml` — GPT-NeoX training configuration ## Branches 20 log-spaced checkpoints are available as branches: - `step0` — initialization - `step{1,2,4,8,16,32,64,128,256,512}` — log-spaced early checkpoints - `step{1000,2000,4000,8000,16000,32000,64000,128000}` — log-spaced training checkpoints - `step143000` — final model Branch `step143000` corresponds to the final model. > **Note:** To keep storage requirements manageable, this repository provides a log-spaced subset of 20 checkpoints rather than all 154 training checkpoints. If you need linearly-spaced checkpoints (every 1,000 steps), the HuggingFace Transformers-compatible weights for all 154 checkpoints are available at [`EleutherAI/pythia-12b-deduped`](https://huggingface.co/EleutherAI/pythia-12b-deduped). ## Converting to HuggingFace Format To convert a checkpoint to HuggingFace Transformers format, use the conversion script from [GPT-NeoX](https://github.com/EleutherAI/gpt-neox): ```bash python tools/convert_neox_to_hf.py \ --input_dir /path/to/neox/checkpoint \ --config_file /path/to/config.yml \ --output_dir /path/to/hf/output ``` Pre-converted weights for all checkpoints are available at [`EleutherAI/pythia-12b-deduped`](https://huggingface.co/EleutherAI/pythia-12b-deduped). ## Training Details Trained on the deduplicated Pile. All Pythia models were trained for 143,000 steps with a batch size of 2M tokens (2,097,152 tokens per step), seeing a total of 299,892,736,000 tokens. See the [Pythia paper](https://arxiv.org/abs/2304.01373) and [GitHub repository](https://github.com/EleutherAI/pythia) for full training details.
| Pythia Model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10-3 | | 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10-4 | | 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10-4 | | 1B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10-4 | | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10-4 | | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10-4 | | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10-4 | | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10-4 |
## Citation ```bibtex @article{biderman2023pythia, title={Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling}, author={Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others}, journal={International Conference on Machine Learning}, year={2023} } ```