---
license: mit
language:
  - en
library_name: pytorch
tags:
  - pytorch
  - nanogpt
  - language-model
  - from-scratch
  - small-language-model
  - tinystories
  - story-generation
  - childrens-stories
  - text-generation
  - rlhf
  - rlvr
  - reinforcement-learning
  - policy-gradient
  - sft
  - sentiment
datasets:
  - roneneldan/TinyStories
pipeline_tag: text-generation
widget:
  - text: "Once upon a time there was a little rabbit"
---

# nanoGPT SLM -- Cheerful TinyStories (3-Stage Pipeline: Pretrain -> SFT -> RLVR)

A **124M-parameter nanoGPT (GPT-2 small)** language model trained **entirely from scratch** on
the **TinyStories** dataset, then aligned to write **consistently cheerful, positive
children's stories** through a 3-stage RLHF-style pipeline:

**Pretraining -> Supervised Fine-Tuning (SFT) -> Reinforcement Learning with Verifiable Rewards (RLVR).**

This repository ships **all three checkpoints** so you can load and compare every stage of
the pipeline yourself.

## What This Model Does

The headline model (**RLVR**) generates short, age-appropriate children's stories (ages 3-5)
that are reliably **warm, upbeat, and resolve happily**. Give it a story opening and it
continues in simple, cheerful language:

```
Input:  "The little girl was sad until"
Output: "The little girl was sad until she found a tiny puppy in the garden.
         The puppy wagged its tail and licked her hand. She laughed and hugged
         it close. They played together all afternoon and became best friends."
```

## The 3-Stage Pipeline

| Stage | Checkpoint | What it does | How |
|:--|:--|:--|:--|
| **1. Pretraining** | `nanogpt_slm_tinystories_best.pth` | Learns general next-token competence on TinyStories | 70k iterations, AdamW, cosine LR |
| **2. SFT** | `nanogpt_slm_sft_best.pth` | Shifts the *prior* toward positive stories | Next-token training on a VADER-filtered positive subset (low LR) |
| **3. RLVR** | `nanogpt_slm_rlvr_final.pth` | *Optimizes* positivity directly | Vanilla policy gradient against a VADER sentiment reward, with a KL penalty to a frozen SFT reference |

## Headline Result -- Positivity Climbs at Every Stage

Mean VADER `compound` sentiment over generated stories (higher = more cheerful, range `-1..+1`):

| Stage | Mean Sentiment | Std |
|:--|:--:|:--:|
| Pretrained | `+0.8428` | 0.3907 |
| SFT (positive) | `+0.8703` | 0.2853 |
| **RLVR** | **`+0.9001`** | 0.3371 |

RLVR raises mean positivity *and* the SFT stage tightens the spread -- the pipeline makes
the model both **happier** and **more consistent**. Individual RLVR stories routinely score
`+0.98` and above.

## Quick Start -- Gradio Space (no install)

Try the model in your browser, including a side-by-side **3-model comparison** view:

Chat UI: [**nanoGPT SLM -- Cheerful Story Generator + Illustration**](https://huggingface.co/spaces/nishantup/nanogpt-rlvr-slm-tinystories)

### Image model: `Qwen/Qwen-Image-2512` via HF Inference API

## Programmatic Use -- the RLVR Model

### Option 1: Run the inference script directly
```bash
# downloads weights from the Hub and runs sample generations
pip install torch tiktoken huggingface_hub nltk
python nanogpt_slm_rlvr_inference_tinystories.py
```

### Option 2: Import and generate
```python
# pip install torch tiktoken huggingface_hub nltk
# place nanogpt_slm_rlvr_inference_tinystories.py in your working directory
from nanogpt_slm_rlvr_inference_tinystories import tell_story, ask, generate_text

print(tell_story("Once upon a time there was a little kitten"))
print(ask("The friendly dragon lived in"))
print(generate_text("A girl named Lily went to the park",
                     max_tokens=300, temperature=0.8, top_k=40))
```

### Option 3: Load the RLVR weights manually
```python
from huggingface_hub import hf_hub_download
import torch
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig

path  = hf_hub_download("nishantup/nanogpt-rlvr-slm-tinystories-124m",
                        "nanogpt_slm_rlvr_final.pth")
model = GPTKV(GPTConfig())          # KV cache enabled for fast generation
model.load_state_dict(torch.load(path, map_location="cpu"))
model.eval()
```

## Comparing the Three Models

The snippet below downloads all three checkpoints and prints each stage's story for the
same prompt, scored with the same VADER metric the RLVR stage was trained against:

```python
# pip install torch tiktoken huggingface_hub nltk
import torch, tiktoken, nltk
from huggingface_hub import hf_hub_download
from nanogpt_slm_rlvr_inference_tinystories import GPTKV, GPTConfig

nltk.download("vader_lexicon", quiet=True)
from nltk.sentiment.vader import SentimentIntensityAnalyzer

REPO = "nishantup/nanogpt-rlvr-slm-tinystories-124m"
enc  = tiktoken.get_encoding("gpt2")
cfg  = GPTConfig()
sia  = SentimentIntensityAnalyzer()

def load(fname):
    m = GPTKV(cfg)
    m.load_state_dict(torch.load(hf_hub_download(REPO, fname), map_location="cpu"))
    m.eval()
    return m

models = {
    "Pretrained": load("nanogpt_slm_tinystories_best.pth"),
    "SFT":        load("nanogpt_slm_sft_best.pth"),
    "RLVR":       load("nanogpt_slm_rlvr_final.pth"),
}

def story(m, prompt, max_tokens=250, seed=1234):
    torch.manual_seed(seed)                       # same RNG start for a fair compare
    idx  = torch.tensor(enc.encode_ordinary(prompt)).unsqueeze(0)
    out  = m.generate(idx, max_new_tokens=max_tokens, temperature=0.8, top_k=40)
    toks = out.squeeze(0).tolist()
    if 50256 in toks:
        toks = toks[:toks.index(50256)]
    return enc.decode(toks)

prompt = "The little girl was sad until"
for name, m in models.items():
    s     = story(m, prompt)
    score = sia.polarity_scores(s)["compound"]
    print(f"\n=== {name}  (sentiment {score:+.3f}) ===\n{s}")
```

Typical result: the **Pretrained** model may take the story in any emotional direction, the
**SFT** model leans positive, and the **RLVR** model produces the most reliably cheerful,
high-sentiment continuation.

## Model Architecture

All three checkpoints share the same architecture:

| Attribute | Value |
|:---|:---|
| Architecture | nanoGPT (GPT-2 small: 12 layers, 12 heads, 768 dim) |
| Parameters | 124M (85.4M unique, with weight tying) |
| Context length | 512 tokens |
| Tokenizer | tiktoken GPT-2 BPE (50,257 tokens) |
| Attention | Flash Attention when available, causal mask |
| Normalization | Pre-norm (LayerNorm before attention/MLP) |
| KV Cache | `GPTKV` variant included for O(1) per-token decode |
| EOS token | `<\|endoftext\|>` (50256) - learned story boundary |

## Training Details

### Stage 1 -- Pretraining
| Attribute | Value |
|:---|:---|
| Data | TinyStories (~2.1M stories, ~470M tokens) |
| Iterations | 70,000 |
| Optimizer | AdamW (lr `6e-4` -> `1e-5` cosine, betas `(0.9, 0.95)`, wd `0.1`) |
| Batch | 32 x 512 tokens, grad-accum 4 |
| Precision | bfloat16 (A100) |

### Stage 2 -- Supervised Fine-Tuning (SFT)
| Attribute | Value |
|:---|:---|
| Data | Positive-sentiment subset of TinyStories (VADER compound > `+0.05`) -- 1.91M stories (90.2%), ~424M tokens |
| Iterations | 12,952 (~2 epochs) |
| Optimizer | AdamW, peak lr `5e-5` -> `5e-6` cosine (about 12x below pretraining) |
| Batch | 32 x 512 tokens, grad-accum 4 |
| Best val loss | 1.2037 |

### Stage 3 -- Reinforcement Learning with Verifiable Rewards (RLVR)
| Attribute | Value |
|:---|:---|
| Algorithm | Vanilla policy gradient |
| Reward | VADER `compound` sentiment of the completed story (verifiable, deterministic) |
| Reward broadcasting | Sequence-level reward applied to every token in the trajectory |
| KL penalty | Against a **frozen SFT reference** (`beta = 0.1`) -- prevents reward hacking |
| Generation batch | 16 trajectories, 200 tokens each |
| Iterations | 200 |
| Optimizer | AdamW, lr `5e-6` |
| Mean reward | `+0.6485` -> `+0.8652` (KL stays bounded, ~`0.022`) |

**How RLVR works (one paragraph):** each iteration the policy samples a batch of stories;
a VADER sentiment analyzer scores each completed story (one scalar reward); that scalar is
broadcast to every generated token; a KL penalty against the frozen SFT model is subtracted
so the policy cannot drift into degenerate text that merely games the scorer; and the
vanilla policy-gradient loss `-(log_probs * final_rewards).mean()` is back-propagated.

## Files

| File | Description |
|:---|:---|
| `nanogpt_slm_tinystories_best.pth` | Stage 1 -- pretrained weights |
| `nanogpt_slm_sft_best.pth` | Stage 2 -- SFT (positive-sentiment) weights |
| `nanogpt_slm_rlvr_final.pth` | Stage 3 -- RLVR weights (**primary model**) |
| `nanogpt_slm_rlvr_inference_tinystories.py` | Standalone inference script (RLVR + 3-model compare) |
| `config.json` | Architecture, pipeline, and training metadata |

## API Reference (`nanogpt_slm_rlvr_inference_tinystories.py`)

| Function | Description |
|:---|:---|
| `tell_story(beginning, max_tokens=500, temperature=0.8, top_k=40)` | Generate a cheerful story from an opening line (RLVR model) |
| `ask(prompt, ...)` | General text completion (alias of `generate_text`, RLVR model) |
| `generate_text(prompt, ...)` | Low-level generation with full parameter control (RLVR model) |
| `compare_models(prompt, ...)` | Generate the same prompt from all 3 stages and return stories + VADER scores |

## Example Outputs (RLVR Model)

**Prompt:** "Once upon a time"  *(sentiment +0.99)*
> Once upon a time, there was a little girl named Lily. She loved to play with her toys
> and her friends. One day, Lily's mommy gave her a present... She hugged the doll and
> said, "I love you, doll!"

**Prompt:** "On a bright morning"  *(sentiment +0.99)*
> On a bright morning, Molly was very excited for her first day ever. She put on her dress
> and ran outside to the garden... The rabbit smiled and said, "Thank you for coming down
> to play with us!"

## Limitations

- Short stories only (512-token context window)
- Simple vocabulary and narrative structures (by design -- TinyStories style)
- No instruction-following ability
- Strongly biased toward positive sentiment (that is the goal of the pipeline)
- English only; may occasionally repeat or produce minor inconsistencies

## Related Models (Vizuara SLM Family)

| Variant | Type | Repo |
|:---|:---|:---|
| Pretrained (TinyStories) | Base LM | [nishantup/nanogpt-pretrained-slm-tinystories-124m](https://huggingface.co/nishantup/nanogpt-pretrained-slm-tinystories-124m) |
| Instruction-tuned | SFT | [nishantup/nanogpt-slm-tinystories-instruct](https://huggingface.co/nishantup/nanogpt-slm-tinystories-instruct) |
| **This Model- Pretrained+ aligned using RLHF styled RLVR** | **RLHF** | **nishantup/nanogpt-rlvr-slm-tinystories-124m** |

## Citation

```
Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be
and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.
```

## Notes

- Big shout-out to **Dr. Raj Dandekar** (vizuara.ai) -- the RL/RLHF workshop this pipeline follows.
- Trained completely from scratch (no pretrained initialization).
- Architecture follows Karpathy's nanoGPT; weight tying between token embeddings and LM head.
- RLVR uses a *verifiable* reward (VADER) -- deterministic, CPU-only, no reward model to train.
- All three checkpoints are provided so the full Pretrain -> SFT -> RLVR progression is reproducible.

## Author

[HF Profile](https://huggingface.co/nishantup)  
[Linkedin Profile](linkedin.com/in/dr-nishant-upadhyay)