nishantup commited on
Commit
769c4d5
·
verified ·
1 Parent(s): b666b9f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +271 -0
README.md ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - pytorch
8
+ - nanogpt
9
+ - language-model
10
+ - from-scratch
11
+ - small-language-model
12
+ - tinystories
13
+ - story-generation
14
+ - childrens-stories
15
+ - text-generation
16
+ - rlhf
17
+ - rlvr
18
+ - reinforcement-learning
19
+ - policy-gradient
20
+ - sft
21
+ - sentiment
22
+ datasets:
23
+ - roneneldan/TinyStories
24
+ pipeline_tag: text-generation
25
+ widget:
26
+ - text: "Once upon a time there was a little rabbit"
27
+ ---
28
+
29
+ # nanoGPT SLM -- Cheerful TinyStories (3-Stage Pipeline: Pretrain -> SFT -> RLVR)
30
+
31
+ A **124M-parameter nanoGPT (GPT-2 small)** language model trained **entirely from scratch** on
32
+ the **TinyStories** dataset, then aligned to write **consistently cheerful, positive
33
+ children's stories** through a 3-stage RLHF-style pipeline:
34
+
35
+ **Pretraining -> Supervised Fine-Tuning (SFT) -> Reinforcement Learning with Verifiable Rewards (RLVR).**
36
+
37
+ This repository ships **all three checkpoints** so you can load and compare every stage of
38
+ the pipeline yourself.
39
+
40
+ ## What This Model Does
41
+
42
+ The headline model (**RLVR**) generates short, age-appropriate children's stories (ages 3-5)
43
+ that are reliably **warm, upbeat, and resolve happily**. Give it a story opening and it
44
+ continues in simple, cheerful language:
45
+
46
+ ```
47
+ Input: "The little girl was sad until"
48
+ Output: "The little girl was sad until she found a tiny puppy in the garden.
49
+ The puppy wagged its tail and licked her hand. She laughed and hugged
50
+ it close. They played together all afternoon and became best friends."
51
+ ```
52
+
53
+ ## The 3-Stage Pipeline
54
+
55
+ | Stage | Checkpoint | What it does | How |
56
+ |:--|:--|:--|:--|
57
+ | **1. Pretraining** | `nanogpt_slm_tinystories_best.pth` | Learns general next-token competence on TinyStories | 70k iterations, AdamW, cosine LR |
58
+ | **2. SFT** | `nanogpt_slm_sft_best.pth` | Shifts the *prior* toward positive stories | Next-token training on a VADER-filtered positive subset (low LR) |
59
+ | **3. RLVR** | `nanogpt_slm_rlvr_final.pth` | *Optimizes* positivity directly | Vanilla policy gradient against a VADER sentiment reward, with a KL penalty to a frozen SFT reference |
60
+
61
+ ## Headline Result -- Positivity Climbs at Every Stage
62
+
63
+ Mean VADER `compound` sentiment over generated stories (higher = more cheerful, range `-1..+1`):
64
+
65
+ | Stage | Mean Sentiment | Std |
66
+ |:--|:--:|:--:|
67
+ | Pretrained | `+0.8428` | 0.3907 |
68
+ | SFT (positive) | `+0.8703` | 0.2853 |
69
+ | **RLVR** | **`+0.9001`** | 0.3371 |
70
+
71
+ RLVR raises mean positivity *and* the SFT stage tightens the spread -- the pipeline makes
72
+ the model both **happier** and **more consistent**. Individual RLVR stories routinely score
73
+ `+0.98` and above.
74
+
75
+ ## Quick Start -- Gradio Space (no install)
76
+
77
+ Try the model in your browser, including a side-by-side **3-model comparison** view:
78
+
79
+ [**nanoGPT SLM -- Cheerful Story Generator + Illustration**](https://huggingface.co/spaces/nishantup/nanogpt-rlvr-slm-tinystories)
80
+
81
+ ## Programmatic Use -- the RLVR Model
82
+
83
+ ### Option 1: Run the inference script directly
84
+ ```bash
85
+ # downloads weights from the Hub and runs sample generations
86
+ pip install torch tiktoken huggingface_hub nltk
87
+ python nanogpt_slm_rlvr_inference_tinystories.py
88
+ ```
89
+
90
+ ### Option 2: Import and generate
91
+ ```python
92
+ # pip install torch tiktoken huggingface_hub nltk
93
+ # place nanogpt_slm_rlvr_inference_tinystories.py in your working directory
94
+ from nanogpt_slm_rlvr_inference_tinystories import tell_story, ask, generate_text
95
+
96
+ print(tell_story("Once upon a time there was a little kitten"))
97
+ print(ask("The friendly dragon lived in"))
98
+ print(generate_text("A girl named Lily went to the park",
99
+ max_tokens=300, temperature=0.8, top_k=40))
100
+ ```
101
+
102
+ ### Option 3: Load the RLVR weights manually
103
+ ```python
104
+ from huggingface_hub import hf_hub_download
105
+ import torch
106
+ from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig
107
+
108
+ path = hf_hub_download("nishantup/nanogpt-rlvr-slm-tinystories-124m",
109
+ "nanogpt_slm_rlvr_final.pth")
110
+ model = GPTKV(GPTConfig()) # KV cache enabled for fast generation
111
+ model.load_state_dict(torch.load(path, map_location="cpu"))
112
+ model.eval()
113
+ ```
114
+
115
+ ## Comparing the Three Models
116
+
117
+ The snippet below downloads all three checkpoints and prints each stage's story for the
118
+ same prompt, scored with the same VADER metric the RLVR stage was trained against:
119
+
120
+ ```python
121
+ # pip install torch tiktoken huggingface_hub nltk
122
+ import torch, tiktoken, nltk
123
+ from huggingface_hub import hf_hub_download
124
+ from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig
125
+
126
+ nltk.download("vader_lexicon", quiet=True)
127
+ from nltk.sentiment.vader import SentimentIntensityAnalyzer
128
+
129
+ REPO = "nishantup/nanogpt-rlvr-slm-tinystories-124m"
130
+ enc = tiktoken.get_encoding("gpt2")
131
+ cfg = GPTConfig()
132
+ sia = SentimentIntensityAnalyzer()
133
+
134
+ def load(fname):
135
+ m = GPTKV(cfg)
136
+ m.load_state_dict(torch.load(hf_hub_download(REPO, fname), map_location="cpu"))
137
+ m.eval()
138
+ return m
139
+
140
+ models = {
141
+ "Pretrained": load("nanogpt_slm_tinystories_best.pth"),
142
+ "SFT": load("nanogpt_slm_sft_best.pth"),
143
+ "RLVR": load("nanogpt_slm_rlvr_final.pth"),
144
+ }
145
+
146
+ def story(m, prompt, max_tokens=250, seed=1234):
147
+ torch.manual_seed(seed) # same RNG start for a fair compare
148
+ idx = torch.tensor(enc.encode_ordinary(prompt)).unsqueeze(0)
149
+ out = m.generate(idx, max_new_tokens=max_tokens, temperature=0.8, top_k=40)
150
+ toks = out.squeeze(0).tolist()
151
+ if 50256 in toks:
152
+ toks = toks[:toks.index(50256)]
153
+ return enc.decode(toks)
154
+
155
+ prompt = "The little girl was sad until"
156
+ for name, m in models.items():
157
+ s = story(m, prompt)
158
+ score = sia.polarity_scores(s)["compound"]
159
+ print(f"\n=== {name} (sentiment {score:+.3f}) ===\n{s}")
160
+ ```
161
+
162
+ Typical result: the **Pretrained** model may take the story in any emotional direction, the
163
+ **SFT** model leans positive, and the **RLVR** model produces the most reliably cheerful,
164
+ high-sentiment continuation.
165
+
166
+ ## Model Architecture
167
+
168
+ All three checkpoints share the same architecture:
169
+
170
+ | Attribute | Value |
171
+ |:---|:---|
172
+ | Architecture | nanoGPT (GPT-2 small: 12 layers, 12 heads, 768 dim) |
173
+ | Parameters | 124M (85.4M unique, with weight tying) |
174
+ | Context length | 512 tokens |
175
+ | Tokenizer | tiktoken GPT-2 BPE (50,257 tokens) |
176
+ | Attention | Flash Attention when available, causal mask |
177
+ | Normalization | Pre-norm (LayerNorm before attention/MLP) |
178
+ | KV Cache | `GPTKV` variant included for O(1) per-token decode |
179
+ | EOS token | `<|endoftext|>` (50256) -- learned story boundary |
180
+
181
+ ## Training Details
182
+
183
+ ### Stage 1 -- Pretraining
184
+ | Attribute | Value |
185
+ |:---|:---|
186
+ | Data | TinyStories (~2.1M stories, ~470M tokens) |
187
+ | Iterations | 70,000 |
188
+ | Optimizer | AdamW (lr `6e-4` -> `1e-5` cosine, betas `(0.9, 0.95)`, wd `0.1`) |
189
+ | Batch | 32 x 512 tokens, grad-accum 4 |
190
+ | Precision | bfloat16 (A100) |
191
+
192
+ ### Stage 2 -- Supervised Fine-Tuning (SFT)
193
+ | Attribute | Value |
194
+ |:---|:---|
195
+ | Data | Positive-sentiment subset of TinyStories (VADER compound > `+0.05`) -- 1.91M stories (90.2%), ~424M tokens |
196
+ | Iterations | 12,952 (~2 epochs) |
197
+ | Optimizer | AdamW, peak lr `5e-5` -> `5e-6` cosine (about 12x below pretraining) |
198
+ | Batch | 32 x 512 tokens, grad-accum 4 |
199
+ | Best val loss | 1.2037 |
200
+
201
+ ### Stage 3 -- Reinforcement Learning with Verifiable Rewards (RLVR)
202
+ | Attribute | Value |
203
+ |:---|:---|
204
+ | Algorithm | Vanilla policy gradient |
205
+ | Reward | VADER `compound` sentiment of the completed story (verifiable, deterministic) |
206
+ | Reward broadcasting | Sequence-level reward applied to every token in the trajectory |
207
+ | KL penalty | Against a **frozen SFT reference** (`beta = 0.1`) -- prevents reward hacking |
208
+ | Generation batch | 16 trajectories, 200 tokens each |
209
+ | Iterations | 200 |
210
+ | Optimizer | AdamW, lr `5e-6` |
211
+ | Mean reward | `+0.6485` -> `+0.8652` (KL stays bounded, ~`0.022`) |
212
+
213
+ **How RLVR works (one paragraph):** each iteration the policy samples a batch of stories;
214
+ a VADER sentiment analyzer scores each completed story (one scalar reward); that scalar is
215
+ broadcast to every generated token; a KL penalty against the frozen SFT model is subtracted
216
+ so the policy cannot drift into degenerate text that merely games the scorer; and the
217
+ vanilla policy-gradient loss `-(log_probs * final_rewards).mean()` is back-propagated.
218
+
219
+ ## Files
220
+
221
+ | File | Description |
222
+ |:---|:---|
223
+ | `nanogpt_slm_tinystories_best.pth` | Stage 1 -- pretrained weights |
224
+ | `nanogpt_slm_sft_best.pth` | Stage 2 -- SFT (positive-sentiment) weights |
225
+ | `nanogpt_slm_rlvr_final.pth` | Stage 3 -- RLVR weights (**primary model**) |
226
+ | `nanogpt_slm_rlvr_inference_tinystories.py` | Standalone inference script (RLVR + 3-model compare) |
227
+ | `config.json` | Architecture, pipeline, and training metadata |
228
+
229
+ ## API Reference (`nanogpt_slm_rlvr_inference_tinystories.py`)
230
+
231
+ | Function | Description |
232
+ |:---|:---|
233
+ | `tell_story(beginning, max_tokens=500, temperature=0.8, top_k=40)` | Generate a cheerful story from an opening line (RLVR model) |
234
+ | `ask(prompt, ...)` | General text completion (alias of `generate_text`, RLVR model) |
235
+ | `generate_text(prompt, ...)` | Low-level generation with full parameter control (RLVR model) |
236
+ | `compare_models(prompt, ...)` | Generate the same prompt from all 3 stages and return stories + VADER scores |
237
+
238
+ ## Example Outputs (RLVR Model)
239
+
240
+ **Prompt:** "Once upon a time" *(sentiment +0.99)*
241
+ > Once upon a time, there was a little girl named Lily. She loved to play with her toys
242
+ > and her friends. One day, Lily's mommy gave her a present... She hugged the doll and
243
+ > said, "I love you, doll!"
244
+
245
+ **Prompt:** "On a bright morning" *(sentiment +0.99)*
246
+ > On a bright morning, Molly was very excited for her first day ever. She put on her dress
247
+ > and ran outside to the garden... The rabbit smiled and said, "Thank you for coming down
248
+ > to play with us!"
249
+
250
+ ## Limitations
251
+
252
+ - Short stories only (512-token context window)
253
+ - Simple vocabulary and narrative structures (by design -- TinyStories style)
254
+ - No instruction-following ability
255
+ - Strongly biased toward positive sentiment (that is the goal of the pipeline)
256
+ - English only; may occasionally repeat or produce minor inconsistencies
257
+
258
+ ## Citation
259
+
260
+ ```
261
+ Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be
262
+ and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.
263
+ ```
264
+
265
+ ## Notes
266
+
267
+ - Big shout-out to **Dr. Raj Dandekar** (vizuara.ai) -- the RL/RLHF workshop this pipeline follows.
268
+ - Trained completely from scratch (no pretrained initialization).
269
+ - Architecture follows Karpathy's nanoGPT; weight tying between token embeddings and LM head.
270
+ - RLVR uses a *verifiable* reward (VADER) -- deterministic, CPU-only, no reward model to train.
271
+ - All three checkpoints are provided so the full Pretrain -> SFT -> RLVR progression is reproducible.