nanoGPT SLM -- Cheerful TinyStories (3-Stage Pipeline: Pretrain -> SFT -> RLVR)

A 124M-parameter nanoGPT (GPT-2 small) language model trained entirely from scratch on the TinyStories dataset, then aligned to write consistently cheerful, positive children's stories through a 3-stage RLHF-style pipeline:

Pretraining -> Supervised Fine-Tuning (SFT) -> Reinforcement Learning with Verifiable Rewards (RLVR).

This repository ships all three checkpoints so you can load and compare every stage of the pipeline yourself.

What This Model Does

The headline model (RLVR) generates short, age-appropriate children's stories (ages 3-5) that are reliably warm, upbeat, and resolve happily. Give it a story opening and it continues in simple, cheerful language:

Input:  "The little girl was sad until"
Output: "The little girl was sad until she found a tiny puppy in the garden.
         The puppy wagged its tail and licked her hand. She laughed and hugged
         it close. They played together all afternoon and became best friends."

The 3-Stage Pipeline

Stage	Checkpoint	What it does	How
1. Pretraining	`nanogpt_slm_tinystories_best.pth`	Learns general next-token competence on TinyStories	70k iterations, AdamW, cosine LR
2. SFT	`nanogpt_slm_sft_best.pth`	Shifts the prior toward positive stories	Next-token training on a VADER-filtered positive subset (low LR)
3. RLVR	`nanogpt_slm_rlvr_final.pth`	Optimizes positivity directly	Vanilla policy gradient against a VADER sentiment reward, with a KL penalty to a frozen SFT reference

Headline Result -- Positivity Climbs at Every Stage

Mean VADER compound sentiment over generated stories (higher = more cheerful, range -1..+1):

Stage	Mean Sentiment	Std
Pretrained	`+0.8428`	0.3907
SFT (positive)	`+0.8703`	0.2853
RLVR	`+0.9001`	0.3371

RLVR raises mean positivity and the SFT stage tightens the spread -- the pipeline makes the model both happier and more consistent. Individual RLVR stories routinely score +0.98 and above.

Quick Start -- Gradio Space (no install)

Try the model in your browser, including a side-by-side 3-model comparison view:

Chat UI: nanoGPT SLM -- Cheerful Story Generator + Illustration

Image model: `Qwen/Qwen-Image-2512` via HF Inference API

Programmatic Use -- the RLVR Model

Option 1: Run the inference script directly

# downloads weights from the Hub and runs sample generations
pip install torch tiktoken huggingface_hub nltk
python nanogpt_slm_rlvr_inference_tinystories.py

Option 2: Import and generate

# pip install torch tiktoken huggingface_hub nltk
# place nanogpt_slm_rlvr_inference_tinystories.py in your working directory
from nanogpt_slm_rlvr_inference_tinystories import tell_story, ask, generate_text

print(tell_story("Once upon a time there was a little kitten"))
print(ask("The friendly dragon lived in"))
print(generate_text("A girl named Lily went to the park",
                     max_tokens=300, temperature=0.8, top_k=40))

Option 3: Load the RLVR weights manually

from huggingface_hub import hf_hub_download
import torch
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig

path  = hf_hub_download("nishantup/nanogpt-rlvr-slm-tinystories-124m",
                        "nanogpt_slm_rlvr_final.pth")
model = GPTKV(GPTConfig())          # KV cache enabled for fast generation
model.load_state_dict(torch.load(path, map_location="cpu"))
model.eval()

Comparing the Three Models

The snippet below downloads all three checkpoints and prints each stage's story for the same prompt, scored with the same VADER metric the RLVR stage was trained against:

# pip install torch tiktoken huggingface_hub nltk
import torch, tiktoken, nltk
from huggingface_hub import hf_hub_download
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig

nltk.download("vader_lexicon", quiet=True)
from nltk.sentiment.vader import SentimentIntensityAnalyzer

REPO = "nishantup/nanogpt-rlvr-slm-tinystories-124m"
enc  = tiktoken.get_encoding("gpt2")
cfg  = GPTConfig()
sia  = SentimentIntensityAnalyzer()

def load(fname):
    m = GPTKV(cfg)
    m.load_state_dict(torch.load(hf_hub_download(REPO, fname), map_location="cpu"))
    m.eval()
    return m

models = {
    "Pretrained": load("nanogpt_slm_tinystories_best.pth"),
    "SFT":        load("nanogpt_slm_sft_best.pth"),
    "RLVR":       load("nanogpt_slm_rlvr_final.pth"),
}

def story(m, prompt, max_tokens=250, seed=1234):
    torch.manual_seed(seed)                       # same RNG start for a fair compare
    idx  = torch.tensor(enc.encode_ordinary(prompt)).unsqueeze(0)
    out  = m.generate(idx, max_new_tokens=max_tokens, temperature=0.8, top_k=40)
    toks = out.squeeze(0).tolist()
    if 50256 in toks:
        toks = toks[:toks.index(50256)]
    return enc.decode(toks)

prompt = "The little girl was sad until"
for name, m in models.items():
    s     = story(m, prompt)
    score = sia.polarity_scores(s)["compound"]
    print(f"\n=== {name}  (sentiment {score:+.3f}) ===\n{s}")

Typical result: the Pretrained model may take the story in any emotional direction, the SFT model leans positive, and the RLVR model produces the most reliably cheerful, high-sentiment continuation.

Model Architecture

All three checkpoints share the same architecture:

Attribute	Value
Architecture	nanoGPT (GPT-2 small: 12 layers, 12 heads, 768 dim)
Parameters	124M (85.4M unique, with weight tying)
Context length	512 tokens
Tokenizer	tiktoken GPT-2 BPE (50,257 tokens)
Attention	Flash Attention when available, causal mask
Normalization	Pre-norm (LayerNorm before attention/MLP)
KV Cache	`GPTKV` variant included for O(1) per-token decode
EOS token	`<\|endoftext\|>` (50256) - learned story boundary

Training Details

Stage 1 -- Pretraining

Attribute	Value
Data	TinyStories (~2.1M stories, ~470M tokens)
Iterations	70,000
Optimizer	AdamW (lr `6e-4` -> `1e-5` cosine, betas `(0.9, 0.95)`, wd `0.1`)
Batch	32 x 512 tokens, grad-accum 4
Precision	bfloat16 (A100)

Stage 2 -- Supervised Fine-Tuning (SFT)

Attribute	Value
Data	Positive-sentiment subset of TinyStories (VADER compound > `+0.05`) -- 1.91M stories (90.2%), ~424M tokens
Iterations	12,952 (~2 epochs)
Optimizer	AdamW, peak lr `5e-5` -> `5e-6` cosine (about 12x below pretraining)
Batch	32 x 512 tokens, grad-accum 4
Best val loss	1.2037

Stage 3 -- Reinforcement Learning with Verifiable Rewards (RLVR)

Attribute	Value
Algorithm	Vanilla policy gradient
Reward	VADER `compound` sentiment of the completed story (verifiable, deterministic)
Reward broadcasting	Sequence-level reward applied to every token in the trajectory
KL penalty	Against a frozen SFT reference (`beta = 0.1`) -- prevents reward hacking
Generation batch	16 trajectories, 200 tokens each
Iterations	200
Optimizer	AdamW, lr `5e-6`
Mean reward	`+0.6485` -> `+0.8652` (KL stays bounded, ~`0.022`)

How RLVR works (one paragraph): each iteration the policy samples a batch of stories; a VADER sentiment analyzer scores each completed story (one scalar reward); that scalar is broadcast to every generated token; a KL penalty against the frozen SFT model is subtracted so the policy cannot drift into degenerate text that merely games the scorer; and the vanilla policy-gradient loss -(log_probs * final_rewards).mean() is back-propagated.

Files

File	Description
`nanogpt_slm_tinystories_best.pth`	Stage 1 -- pretrained weights
`nanogpt_slm_sft_best.pth`	Stage 2 -- SFT (positive-sentiment) weights
`nanogpt_slm_rlvr_final.pth`	Stage 3 -- RLVR weights (primary model)
`nanogpt_slm_rlvr_inference_tinystories.py`	Standalone inference script (RLVR + 3-model compare)
`config.json`	Architecture, pipeline, and training metadata

API Reference (`nanogpt_slm_rlvr_inference_tinystories.py`)

Function	Description
`tell_story(beginning, max_tokens=500, temperature=0.8, top_k=40)`	Generate a cheerful story from an opening line (RLVR model)
`ask(prompt, ...)`	General text completion (alias of `generate_text`, RLVR model)
`generate_text(prompt, ...)`	Low-level generation with full parameter control (RLVR model)
`compare_models(prompt, ...)`	Generate the same prompt from all 3 stages and return stories + VADER scores

Example Outputs (RLVR Model)

Prompt: "Once upon a time" (sentiment +0.99)

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mommy gave her a present... She hugged the doll and said, "I love you, doll!"

Prompt: "On a bright morning" (sentiment +0.99)

On a bright morning, Molly was very excited for her first day ever. She put on her dress and ran outside to the garden... The rabbit smiled and said, "Thank you for coming down to play with us!"

Limitations

Short stories only (512-token context window)
Simple vocabulary and narrative structures (by design -- TinyStories style)
No instruction-following ability
Strongly biased toward positive sentiment (that is the goal of the pipeline)
English only; may occasionally repeat or produce minor inconsistencies

Related Models (Vizuara SLM Family)

Variant	Type	Repo
Pretrained (TinyStories)	Base LM	nishantup/nanogpt-pretrained-slm-tinystories-124m
Instruction-tuned	SFT	nishantup/nanogpt-slm-tinystories-instruct
This Model- Pretrained+ aligned using RLHF styled RLVR	RLHF	nishantup/nanogpt-rlvr-slm-tinystories-124m

Citation

Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be
and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.

Notes

Big shout-out to Dr. Raj Dandekar (vizuara.ai) -- the RL/RLHF workshop this pipeline follows.
Trained completely from scratch (no pretrained initialization).
Architecture follows Karpathy's nanoGPT; weight tying between token embeddings and LM head.
RLVR uses a verifiable reward (VADER) -- deterministic, CPU-only, no reward model to train.
All three checkpoints are provided so the full Pretrain -> SFT -> RLVR progression is reproducible.

Author

HF Profile
Linkedin Profile

Downloads last month: 2,275

Dataset used to train nishantup/nanogpt-rlvr-slm-tinystories-124m

Space using nishantup/nanogpt-rlvr-slm-tinystories-124m 1

Paper for nishantup/nanogpt-rlvr-slm-tinystories-124m

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 46