- nanoGPT SLM -- Cheerful TinyStories (3-Stage Pipeline: Pretrain -> SFT -> RLVR)
- What This Model Does
- The 3-Stage Pipeline
- Headline Result -- Positivity Climbs at Every Stage
- Quick Start -- Gradio Space (no install)
- Programmatic Use -- the RLVR Model
- Comparing the Three Models
- Model Architecture
- Training Details
- Files
- API Reference (
nanogpt_slm_rlvr_inference_tinystories.py) - Example Outputs (RLVR Model)
- Limitations
- Related Models (Vizuara SLM Family)
- Citation
- Notes
- Author
- What This Model Does
nanoGPT SLM -- Cheerful TinyStories (3-Stage Pipeline: Pretrain -> SFT -> RLVR)
A 124M-parameter nanoGPT (GPT-2 small) language model trained entirely from scratch on the TinyStories dataset, then aligned to write consistently cheerful, positive children's stories through a 3-stage RLHF-style pipeline:
Pretraining -> Supervised Fine-Tuning (SFT) -> Reinforcement Learning with Verifiable Rewards (RLVR).
This repository ships all three checkpoints so you can load and compare every stage of the pipeline yourself.
What This Model Does
The headline model (RLVR) generates short, age-appropriate children's stories (ages 3-5) that are reliably warm, upbeat, and resolve happily. Give it a story opening and it continues in simple, cheerful language:
Input: "The little girl was sad until"
Output: "The little girl was sad until she found a tiny puppy in the garden.
The puppy wagged its tail and licked her hand. She laughed and hugged
it close. They played together all afternoon and became best friends."
The 3-Stage Pipeline
| Stage | Checkpoint | What it does | How |
|---|---|---|---|
| 1. Pretraining | nanogpt_slm_tinystories_best.pth |
Learns general next-token competence on TinyStories | 70k iterations, AdamW, cosine LR |
| 2. SFT | nanogpt_slm_sft_best.pth |
Shifts the prior toward positive stories | Next-token training on a VADER-filtered positive subset (low LR) |
| 3. RLVR | nanogpt_slm_rlvr_final.pth |
Optimizes positivity directly | Vanilla policy gradient against a VADER sentiment reward, with a KL penalty to a frozen SFT reference |
Headline Result -- Positivity Climbs at Every Stage
Mean VADER compound sentiment over generated stories (higher = more cheerful, range -1..+1):
| Stage | Mean Sentiment | Std |
|---|---|---|
| Pretrained | +0.8428 |
0.3907 |
| SFT (positive) | +0.8703 |
0.2853 |
| RLVR | +0.9001 |
0.3371 |
RLVR raises mean positivity and the SFT stage tightens the spread -- the pipeline makes
the model both happier and more consistent. Individual RLVR stories routinely score
+0.98 and above.
Quick Start -- Gradio Space (no install)
Try the model in your browser, including a side-by-side 3-model comparison view:
Chat UI: nanoGPT SLM -- Cheerful Story Generator + Illustration
Image model: Qwen/Qwen-Image-2512 via HF Inference API
Programmatic Use -- the RLVR Model
Option 1: Run the inference script directly
# downloads weights from the Hub and runs sample generations
pip install torch tiktoken huggingface_hub nltk
python nanogpt_slm_rlvr_inference_tinystories.py
Option 2: Import and generate
# pip install torch tiktoken huggingface_hub nltk
# place nanogpt_slm_rlvr_inference_tinystories.py in your working directory
from nanogpt_slm_rlvr_inference_tinystories import tell_story, ask, generate_text
print(tell_story("Once upon a time there was a little kitten"))
print(ask("The friendly dragon lived in"))
print(generate_text("A girl named Lily went to the park",
max_tokens=300, temperature=0.8, top_k=40))
Option 3: Load the RLVR weights manually
from huggingface_hub import hf_hub_download
import torch
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig
path = hf_hub_download("nishantup/nanogpt-rlvr-slm-tinystories-124m",
"nanogpt_slm_rlvr_final.pth")
model = GPTKV(GPTConfig()) # KV cache enabled for fast generation
model.load_state_dict(torch.load(path, map_location="cpu"))
model.eval()
Comparing the Three Models
The snippet below downloads all three checkpoints and prints each stage's story for the same prompt, scored with the same VADER metric the RLVR stage was trained against:
# pip install torch tiktoken huggingface_hub nltk
import torch, tiktoken, nltk
from huggingface_hub import hf_hub_download
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig
nltk.download("vader_lexicon", quiet=True)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
REPO = "nishantup/nanogpt-rlvr-slm-tinystories-124m"
enc = tiktoken.get_encoding("gpt2")
cfg = GPTConfig()
sia = SentimentIntensityAnalyzer()
def load(fname):
m = GPTKV(cfg)
m.load_state_dict(torch.load(hf_hub_download(REPO, fname), map_location="cpu"))
m.eval()
return m
models = {
"Pretrained": load("nanogpt_slm_tinystories_best.pth"),
"SFT": load("nanogpt_slm_sft_best.pth"),
"RLVR": load("nanogpt_slm_rlvr_final.pth"),
}
def story(m, prompt, max_tokens=250, seed=1234):
torch.manual_seed(seed) # same RNG start for a fair compare
idx = torch.tensor(enc.encode_ordinary(prompt)).unsqueeze(0)
out = m.generate(idx, max_new_tokens=max_tokens, temperature=0.8, top_k=40)
toks = out.squeeze(0).tolist()
if 50256 in toks:
toks = toks[:toks.index(50256)]
return enc.decode(toks)
prompt = "The little girl was sad until"
for name, m in models.items():
s = story(m, prompt)
score = sia.polarity_scores(s)["compound"]
print(f"\n=== {name} (sentiment {score:+.3f}) ===\n{s}")
Typical result: the Pretrained model may take the story in any emotional direction, the SFT model leans positive, and the RLVR model produces the most reliably cheerful, high-sentiment continuation.
Model Architecture
All three checkpoints share the same architecture:
| Attribute | Value |
|---|---|
| Architecture | nanoGPT (GPT-2 small: 12 layers, 12 heads, 768 dim) |
| Parameters | 124M (85.4M unique, with weight tying) |
| Context length | 512 tokens |
| Tokenizer | tiktoken GPT-2 BPE (50,257 tokens) |
| Attention | Flash Attention when available, causal mask |
| Normalization | Pre-norm (LayerNorm before attention/MLP) |
| KV Cache | GPTKV variant included for O(1) per-token decode |
| EOS token | <|endoftext|> (50256) - learned story boundary |
Training Details
Stage 1 -- Pretraining
| Attribute | Value |
|---|---|
| Data | TinyStories (~2.1M stories, ~470M tokens) |
| Iterations | 70,000 |
| Optimizer | AdamW (lr 6e-4 -> 1e-5 cosine, betas (0.9, 0.95), wd 0.1) |
| Batch | 32 x 512 tokens, grad-accum 4 |
| Precision | bfloat16 (A100) |
Stage 2 -- Supervised Fine-Tuning (SFT)
| Attribute | Value |
|---|---|
| Data | Positive-sentiment subset of TinyStories (VADER compound > +0.05) -- 1.91M stories (90.2%), ~424M tokens |
| Iterations | 12,952 (~2 epochs) |
| Optimizer | AdamW, peak lr 5e-5 -> 5e-6 cosine (about 12x below pretraining) |
| Batch | 32 x 512 tokens, grad-accum 4 |
| Best val loss | 1.2037 |
Stage 3 -- Reinforcement Learning with Verifiable Rewards (RLVR)
| Attribute | Value |
|---|---|
| Algorithm | Vanilla policy gradient |
| Reward | VADER compound sentiment of the completed story (verifiable, deterministic) |
| Reward broadcasting | Sequence-level reward applied to every token in the trajectory |
| KL penalty | Against a frozen SFT reference (beta = 0.1) -- prevents reward hacking |
| Generation batch | 16 trajectories, 200 tokens each |
| Iterations | 200 |
| Optimizer | AdamW, lr 5e-6 |
| Mean reward | +0.6485 -> +0.8652 (KL stays bounded, ~`0.022`) |
How RLVR works (one paragraph): each iteration the policy samples a batch of stories;
a VADER sentiment analyzer scores each completed story (one scalar reward); that scalar is
broadcast to every generated token; a KL penalty against the frozen SFT model is subtracted
so the policy cannot drift into degenerate text that merely games the scorer; and the
vanilla policy-gradient loss -(log_probs * final_rewards).mean() is back-propagated.
Files
| File | Description |
|---|---|
nanogpt_slm_tinystories_best.pth |
Stage 1 -- pretrained weights |
nanogpt_slm_sft_best.pth |
Stage 2 -- SFT (positive-sentiment) weights |
nanogpt_slm_rlvr_final.pth |
Stage 3 -- RLVR weights (primary model) |
nanogpt_slm_rlvr_inference_tinystories.py |
Standalone inference script (RLVR + 3-model compare) |
config.json |
Architecture, pipeline, and training metadata |
API Reference (nanogpt_slm_rlvr_inference_tinystories.py)
| Function | Description |
|---|---|
tell_story(beginning, max_tokens=500, temperature=0.8, top_k=40) |
Generate a cheerful story from an opening line (RLVR model) |
ask(prompt, ...) |
General text completion (alias of generate_text, RLVR model) |
generate_text(prompt, ...) |
Low-level generation with full parameter control (RLVR model) |
compare_models(prompt, ...) |
Generate the same prompt from all 3 stages and return stories + VADER scores |
Example Outputs (RLVR Model)
Prompt: "Once upon a time" (sentiment +0.99)
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mommy gave her a present... She hugged the doll and said, "I love you, doll!"
Prompt: "On a bright morning" (sentiment +0.99)
On a bright morning, Molly was very excited for her first day ever. She put on her dress and ran outside to the garden... The rabbit smiled and said, "Thank you for coming down to play with us!"
Limitations
- Short stories only (512-token context window)
- Simple vocabulary and narrative structures (by design -- TinyStories style)
- No instruction-following ability
- Strongly biased toward positive sentiment (that is the goal of the pipeline)
- English only; may occasionally repeat or produce minor inconsistencies
Related Models (Vizuara SLM Family)
| Variant | Type | Repo |
|---|---|---|
| Pretrained (TinyStories) | Base LM | nishantup/nanogpt-pretrained-slm-tinystories-124m |
| Instruction-tuned | SFT | nishantup/nanogpt-slm-tinystories-instruct |
| This Model- Pretrained+ aligned using RLHF styled RLVR | RLHF | nishantup/nanogpt-rlvr-slm-tinystories-124m |
Citation
Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be
and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.
Notes
- Big shout-out to Dr. Raj Dandekar (vizuara.ai) -- the RL/RLHF workshop this pipeline follows.
- Trained completely from scratch (no pretrained initialization).
- Architecture follows Karpathy's nanoGPT; weight tying between token embeddings and LM head.
- RLVR uses a verifiable reward (VADER) -- deterministic, CPU-only, no reward model to train.
- All three checkpoints are provided so the full Pretrain -> SFT -> RLVR progression is reproducible.
Author
- Downloads last month
- 2,275