nanoGPT SLM -- Cheerful TinyStories (3-Stage Pipeline: Pretrain -> SFT -> RLVR)

A 124M-parameter nanoGPT (GPT-2 small) language model trained entirely from scratch on the TinyStories dataset, then aligned to write consistently cheerful, positive children's stories through a 3-stage RLHF-style pipeline:

Pretraining -> Supervised Fine-Tuning (SFT) -> Reinforcement Learning with Verifiable Rewards (RLVR).

This repository ships all three checkpoints so you can load and compare every stage of the pipeline yourself.

What This Model Does

The headline model (RLVR) generates short, age-appropriate children's stories (ages 3-5) that are reliably warm, upbeat, and resolve happily. Give it a story opening and it continues in simple, cheerful language:

Input:  "The little girl was sad until"
Output: "The little girl was sad until she found a tiny puppy in the garden.
         The puppy wagged its tail and licked her hand. She laughed and hugged
         it close. They played together all afternoon and became best friends."

The 3-Stage Pipeline

Stage Checkpoint What it does How
1. Pretraining nanogpt_slm_tinystories_best.pth Learns general next-token competence on TinyStories 70k iterations, AdamW, cosine LR
2. SFT nanogpt_slm_sft_best.pth Shifts the prior toward positive stories Next-token training on a VADER-filtered positive subset (low LR)
3. RLVR nanogpt_slm_rlvr_final.pth Optimizes positivity directly Vanilla policy gradient against a VADER sentiment reward, with a KL penalty to a frozen SFT reference

Headline Result -- Positivity Climbs at Every Stage

Mean VADER compound sentiment over generated stories (higher = more cheerful, range -1..+1):

Stage Mean Sentiment Std
Pretrained +0.8428 0.3907
SFT (positive) +0.8703 0.2853
RLVR +0.9001 0.3371

RLVR raises mean positivity and the SFT stage tightens the spread -- the pipeline makes the model both happier and more consistent. Individual RLVR stories routinely score +0.98 and above.

Quick Start -- Gradio Space (no install)

Try the model in your browser, including a side-by-side 3-model comparison view:

Chat UI: nanoGPT SLM -- Cheerful Story Generator + Illustration

Image model: Qwen/Qwen-Image-2512 via HF Inference API

Programmatic Use -- the RLVR Model

Option 1: Run the inference script directly

# downloads weights from the Hub and runs sample generations
pip install torch tiktoken huggingface_hub nltk
python nanogpt_slm_rlvr_inference_tinystories.py

Option 2: Import and generate

# pip install torch tiktoken huggingface_hub nltk
# place nanogpt_slm_rlvr_inference_tinystories.py in your working directory
from nanogpt_slm_rlvr_inference_tinystories import tell_story, ask, generate_text

print(tell_story("Once upon a time there was a little kitten"))
print(ask("The friendly dragon lived in"))
print(generate_text("A girl named Lily went to the park",
                     max_tokens=300, temperature=0.8, top_k=40))

Option 3: Load the RLVR weights manually

from huggingface_hub import hf_hub_download
import torch
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig

path  = hf_hub_download("nishantup/nanogpt-rlvr-slm-tinystories-124m",
                        "nanogpt_slm_rlvr_final.pth")
model = GPTKV(GPTConfig())          # KV cache enabled for fast generation
model.load_state_dict(torch.load(path, map_location="cpu"))
model.eval()

Comparing the Three Models

The snippet below downloads all three checkpoints and prints each stage's story for the same prompt, scored with the same VADER metric the RLVR stage was trained against:

# pip install torch tiktoken huggingface_hub nltk
import torch, tiktoken, nltk
from huggingface_hub import hf_hub_download
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig

nltk.download("vader_lexicon", quiet=True)
from nltk.sentiment.vader import SentimentIntensityAnalyzer

REPO = "nishantup/nanogpt-rlvr-slm-tinystories-124m"
enc  = tiktoken.get_encoding("gpt2")
cfg  = GPTConfig()
sia  = SentimentIntensityAnalyzer()

def load(fname):
    m = GPTKV(cfg)
    m.load_state_dict(torch.load(hf_hub_download(REPO, fname), map_location="cpu"))
    m.eval()
    return m

models = {
    "Pretrained": load("nanogpt_slm_tinystories_best.pth"),
    "SFT":        load("nanogpt_slm_sft_best.pth"),
    "RLVR":       load("nanogpt_slm_rlvr_final.pth"),
}

def story(m, prompt, max_tokens=250, seed=1234):
    torch.manual_seed(seed)                       # same RNG start for a fair compare
    idx  = torch.tensor(enc.encode_ordinary(prompt)).unsqueeze(0)
    out  = m.generate(idx, max_new_tokens=max_tokens, temperature=0.8, top_k=40)
    toks = out.squeeze(0).tolist()
    if 50256 in toks:
        toks = toks[:toks.index(50256)]
    return enc.decode(toks)

prompt = "The little girl was sad until"
for name, m in models.items():
    s     = story(m, prompt)
    score = sia.polarity_scores(s)["compound"]
    print(f"\n=== {name}  (sentiment {score:+.3f}) ===\n{s}")

Typical result: the Pretrained model may take the story in any emotional direction, the SFT model leans positive, and the RLVR model produces the most reliably cheerful, high-sentiment continuation.

Model Architecture

All three checkpoints share the same architecture:

Attribute Value
Architecture nanoGPT (GPT-2 small: 12 layers, 12 heads, 768 dim)
Parameters 124M (85.4M unique, with weight tying)
Context length 512 tokens
Tokenizer tiktoken GPT-2 BPE (50,257 tokens)
Attention Flash Attention when available, causal mask
Normalization Pre-norm (LayerNorm before attention/MLP)
KV Cache GPTKV variant included for O(1) per-token decode
EOS token <|endoftext|> (50256) - learned story boundary

Training Details

Stage 1 -- Pretraining

Attribute Value
Data TinyStories (~2.1M stories, ~470M tokens)
Iterations 70,000
Optimizer AdamW (lr 6e-4 -> 1e-5 cosine, betas (0.9, 0.95), wd 0.1)
Batch 32 x 512 tokens, grad-accum 4
Precision bfloat16 (A100)

Stage 2 -- Supervised Fine-Tuning (SFT)

Attribute Value
Data Positive-sentiment subset of TinyStories (VADER compound > +0.05) -- 1.91M stories (90.2%), ~424M tokens
Iterations 12,952 (~2 epochs)
Optimizer AdamW, peak lr 5e-5 -> 5e-6 cosine (about 12x below pretraining)
Batch 32 x 512 tokens, grad-accum 4
Best val loss 1.2037

Stage 3 -- Reinforcement Learning with Verifiable Rewards (RLVR)

Attribute Value
Algorithm Vanilla policy gradient
Reward VADER compound sentiment of the completed story (verifiable, deterministic)
Reward broadcasting Sequence-level reward applied to every token in the trajectory
KL penalty Against a frozen SFT reference (beta = 0.1) -- prevents reward hacking
Generation batch 16 trajectories, 200 tokens each
Iterations 200
Optimizer AdamW, lr 5e-6
Mean reward +0.6485 -> +0.8652 (KL stays bounded, ~`0.022`)

How RLVR works (one paragraph): each iteration the policy samples a batch of stories; a VADER sentiment analyzer scores each completed story (one scalar reward); that scalar is broadcast to every generated token; a KL penalty against the frozen SFT model is subtracted so the policy cannot drift into degenerate text that merely games the scorer; and the vanilla policy-gradient loss -(log_probs * final_rewards).mean() is back-propagated.

Files

File Description
nanogpt_slm_tinystories_best.pth Stage 1 -- pretrained weights
nanogpt_slm_sft_best.pth Stage 2 -- SFT (positive-sentiment) weights
nanogpt_slm_rlvr_final.pth Stage 3 -- RLVR weights (primary model)
nanogpt_slm_rlvr_inference_tinystories.py Standalone inference script (RLVR + 3-model compare)
config.json Architecture, pipeline, and training metadata

API Reference (nanogpt_slm_rlvr_inference_tinystories.py)

Function Description
tell_story(beginning, max_tokens=500, temperature=0.8, top_k=40) Generate a cheerful story from an opening line (RLVR model)
ask(prompt, ...) General text completion (alias of generate_text, RLVR model)
generate_text(prompt, ...) Low-level generation with full parameter control (RLVR model)
compare_models(prompt, ...) Generate the same prompt from all 3 stages and return stories + VADER scores

Example Outputs (RLVR Model)

Prompt: "Once upon a time" (sentiment +0.99)

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mommy gave her a present... She hugged the doll and said, "I love you, doll!"

Prompt: "On a bright morning" (sentiment +0.99)

On a bright morning, Molly was very excited for her first day ever. She put on her dress and ran outside to the garden... The rabbit smiled and said, "Thank you for coming down to play with us!"

Limitations

  • Short stories only (512-token context window)
  • Simple vocabulary and narrative structures (by design -- TinyStories style)
  • No instruction-following ability
  • Strongly biased toward positive sentiment (that is the goal of the pipeline)
  • English only; may occasionally repeat or produce minor inconsistencies

Related Models (Vizuara SLM Family)

Variant Type Repo
Pretrained (TinyStories) Base LM nishantup/nanogpt-pretrained-slm-tinystories-124m
Instruction-tuned SFT nishantup/nanogpt-slm-tinystories-instruct
This Model- Pretrained+ aligned using RLHF styled RLVR RLHF nishantup/nanogpt-rlvr-slm-tinystories-124m

Citation

Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be
and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.

Notes

  • Big shout-out to Dr. Raj Dandekar (vizuara.ai) -- the RL/RLHF workshop this pipeline follows.
  • Trained completely from scratch (no pretrained initialization).
  • Architecture follows Karpathy's nanoGPT; weight tying between token embeddings and LM head.
  • RLVR uses a verifiable reward (VADER) -- deterministic, CPU-only, no reward model to train.
  • All three checkpoints are provided so the full Pretrain -> SFT -> RLVR progression is reproducible.

Author

HF Profile
Linkedin Profile

Downloads last month
2,275
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nishantup/nanogpt-rlvr-slm-tinystories-124m

Space using nishantup/nanogpt-rlvr-slm-tinystories-124m 1

Paper for nishantup/nanogpt-rlvr-slm-tinystories-124m