---
title: Jungle Story Time
emoji: 🦁
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: "6.15.2"
app_file: app.py
pinned: false
license: apache-2.0
short_description: Personalized kids' stories with AI voice narration
---

# Jungle Story Time — Kids Story Studio 🦁📖

> **3 taps and GO.** Pick a friend, pick a place, pick story-or-poem — a fine-tuned 1B model writes a gentle, personalized tale for your child, an illustration is painted for it, and it's read aloud in a sweet voice (or your own family's voice).

Built for the **Build Small Hackathon (June 2026)**.

---

## What it does

A bedtime-story machine for children aged 2–5. The child (or parent) makes a few taps:

| Step | Choice | Options |
|------|--------|---------|
| 1 | **Friend** | Simba 🦁 · Tiger 🐯 · Panda 🐼 · Bhalu 🐻 · Parrot 🦜 · Elephant 🐘 · Bunny 🐰 · Duck 🦆 · or type any animal |
| 2 | **Place** | Home 🏠 · Jungle 🌳 · Pond 🌊 · Mango Tree 🥭 · Night Sky 🌙 |
| 3 | **Type** | Story or Poem |
| 4 | **Voice** | Sunny 🌞 · Koyal 🐦 · Dadu 🌙 · Robo (instant) 🤖 · or 🎙️ your family's voice |

The app picks a surprise lesson (sharing, patience, counting, colors, animal sounds…), writes the story, streams it back word-by-word into an animated storybook poster, paints a matching watercolor illustration, and reads it aloud. It can also stitch a **narrated story "movie"** (one illustration + one narration clip per scene, joined with a Ken-Burns pan).

---

## Architecture

![Jungle Story Time architecture](./architecture.png)

A child makes a few taps in the **Gradio UI** (friend, place, story type, voice). The request goes to the **Modal serverless platform**, which runs three scale-to-zero services:

- **Story Agent** — the fine-tuned **MiniCPM5-1B** GGUF, served on CPU via `llama.cpp`.
- **Voice Agent** — **VoxCPM2** for narration, with designed voices and zero-shot family-voice cloning.
- **Illustrator** — **FLUX.2-klein-4B** (4-bit GGUF) paints one storybook picture per tale.

A fast session cache reuses identical prompts so repeats are instant. The same fine-tuned story model also runs **fully offline** on a laptop CPU via `llama-cpp-python` + the Q4_K_M GGUF (`LOCAL_MODE=1`) — no cloud, no API, the child's name never leaves the machine.

---

## Models

Three models power the app — one fine-tuned, two strong off-the-shelf bases:

| Service | Model | Fine-tuned? | Notes |
|---------|-------|:---:|-------|
| **Story** | `ThePradip/minicpm5-1b-kids-storyteller-GGUF` (1B, Q4_K_M, llama.cpp, CPU) | ✅ Unsloth LoRA on `openbmb/MiniCPM5-1B` (A10G) | published with full-precision sibling `ThePradip/minicpm5-1b-kids-storyteller` |
| **Voice** | `openbmb/VoxCPM2` (2B) | ❌ off-the-shelf | designed voices + zero-shot family-voice cloning, no training needed |
| **Image** | `unsloth/FLUX.2-klein-4B-GGUF` (4B, Q4_K_M, diffusers) | ❌ off-the-shelf | kid-safe by prompt construction, few-step distilled |

The fine-tune wired into production is the **story model**; the voice and image models are used as-is.

### Why a 1B story model?

A 2–5 year old needs **fast** (no 30s waits), **private** (it hears your child's name), **cheap** (runs on the family laptop), and **stylistically reliable** (tiny sentences, sound words, a refrain repeated 3×). Style is exactly what small-model fine-tuning is good at. `MiniCPM5-1B` is a plain `LlamaForCausalLM`, so the whole Unsloth toolchain works with zero patches, and the Q4_K_M GGUF runs at **~60 tokens/sec on an M-series MacBook CPU**.

---

## Dataset

The story model is trained on **hand-authored** stories — not scraped, not bulk-generated by a frontier API.

| Dataset | Size | Use |
|---------|------|-----|
| Hand-authored (`build-small-hackathon/kids-story`) | 129 plots → 258 examples (name variants) | **SFT for the 1B story model** — 12 categories, style-validated |

**Categories (12):** animals · birds · colors · shapes · surroundings · family · friends · environment · morals · speech practice · early learning · rhymes

**Style contract** — machine-enforced by `finetune/author_kit.py` before publishing:
- 30–170 words · average sentence ≤ 9 words
- a refrain repeated exactly 3×
- at least one sound word (quack, splash, whoosh…)
- a visualizability score — concrete picture-words a toddler can see; abstract stories rejected
- nothing scary · clean ending · the child's name appears

---

## Fine-tuning pipeline

```
finetune/
├── author_kit.py          ← validate + push the hand-authored dataset
├── generate_dataset.py    ← build the personalized SFT set
├── modal_finetune.py      ← MiniCPM5-1B LoRA training job
├── eval_base_vs_tuned.py  ← before/after quality comparison
└── data/                  ← authored stories (markdown blocks)
```

### Training jobs

Each job below maps to a function in `finetune/modal_finetune.py` (GPU and method
as declared there):

| Job | GPU | Method |
|-----|-----|--------|
| Dataset prep | CPU (Modal) | author-kit validation |
| MiniCPM5-1B LoRA | A10G | Unsloth |
| Serving endpoints (story · voice · image) | A10G | scale-to-zero (≈$0 idle, billed per second) |

### Key training decisions

- **`transformers==4.57.3` is the one pin that matters** — Unsloth's patcher only supports ≤4.57.x. On Modal's clean image, pin only this.
- **Model's own chat template**, `enable_thinking=False` so stories stream with no visible `<think>` block.
- **`train_on_responses_only`** — loss on the story tokens, not the prompt.
- **LoRA r=16, α=32, dropout 0.05** on attention+MLP (`q,k,v,o,gate,up,down`). ~1% of weights.
- **Validation gate before every Hub push** — generates 5 unseen test stories, checks name faithfulness + style + safety wordlist, and requires loss ≥10% below start. Fail → nothing publishes.

### Training results (1B story model)

| Round | Examples | Epochs | Loss | Validation |
|-------|----------|--------|------|-----------|
| 1 | 69 | 5 | 2.61 → 2.04 (−22%) | 2/3 pass |
| 2 | 258 | 3 | 2.64 → 1.54 (−42%) | 5/5 pass ✅ |

Round 1 exposed the classic small-model failure (asked for *Veer*, wrote a perfect story starring *Sam*). Style transfers from a tiny set; name-faithfulness needed more data — round 2 fixed it.

---

## Running locally

```bash
pip install gradio llama-cpp-python requests

# cloud story + voice + image endpoints (default)
python app_jungle.py            # → http://localhost:7868

# fully offline story generation (no internet after the model download)
LOCAL_MODE=1 python app_jungle.py
```

The first run downloads the Q4_K_M GGUF (~700 MB) and caches it. After that, story generation is fully offline.

### Environment variables

| Var | Default | Purpose |
|-----|---------|---------|
| `STORY_ENDPOINT` | the deployed Modal URL | cloud story generation |
| `VOICE_ENDPOINT` | the deployed Modal URL | VoxCPM2 narration |
| `IMAGE_ENDPOINT` | the deployed Modal URL | FLUX.2-klein illustration |
| `LOCAL_MODE` | unset | `1` = generate stories on-device via llama.cpp |

---

## Deploying to Modal

```bash
# one-time
pip install modal && modal setup
modal secret create huggingface HF_TOKEN=hf_xxx     # write token

# fine-tune (optional — the story model is already published)
modal run finetune/modal_finetune.py::prepare_dataset
modal run finetune/modal_finetune.py::train_minicpm5

# deploy the three serving endpoints
modal deploy serve_story_modal.py     # MiniCPM5-1B GGUF  → ...-storyteller-narrate
modal deploy serve_voice_modal.py     # VoxCPM2           → ...-voxnarrator-speak
modal deploy serve_image_modal.py     # FLUX.2-klein-4B   → ...-illustrator-draw

# OR deploy the whole Gradio UI on Modal (GGUF baked in)
modal deploy deploy_jungle_modal.py   # → https://<you>--jungle-story-time-ui.modal.run
```

All endpoints are **scale-to-zero** (≈$0 idle, billed per second) and bake their weights into the image so cold starts skip the download.

---

## Endpoints (API)

```
POST  …kids-story-api-storyteller-narrate.modal.run
      {"kid":"Riya","age":3,"characters":"a yellow duck",
       "place":"the pond","lesson":"the quack sound","kind":"story"}
   →  {"story":"…","tokens":123,"seconds":8.1}

POST  …kids-voice-tts-voxnarrator-speak.modal.run
      {"text":"…","voice":"sunny|koyal|dadu|clone","reference_b64":"…"}
   →  audio/wav

POST  …kids-image-gen-illustrator-draw.modal.run
      {"prompt":"children's picture-book illustration of …"}
   →  image/png  (1024×1024)
```

---

## Features

- **Animated storybook poster** — per-scene backdrop (jungle/pond/night…), bouncing friend, floating decorations, staggered story lines, high-contrast cards.
- **Word-by-word streaming** while the model writes.
- **AI illustration** painted for each tale (cached so repeats are instant).
- **Voices** — three designed narrator voices (synthesized by VoxCPM2 Voice Design, no real person), an instant browser-speech "Robo" option, and **family-voice cloning** (record ~15s → the story is read in that voice; opt-in, session-scoped).
- **Story movie** — one picture + one narration clip per scene, stitched into a Ken-Burns video.
- **Offline mode** — `LOCAL_MODE=1` runs the GGUF on the laptop CPU.

---

## App modes

| Mode | File | Story engine | Voice | Use case |
|------|------|-------------|-------|---------|
| Jungle UI | `app_jungle.py` | Modal endpoint → llama.cpp fallback | VoxCPM2 or browser speech | Main app for kids |
| Badge-sweep build | `app_server.py` + `index.html` | local llama.cpp GGUF | browser speech (Off-the-Grid) | Hackathon submission |
| Classic Gradio | `app.py` | Modal endpoint | — | model-comparison / ZeroGPU demo |

---

## Hackathon badges

| Badge | Evidence |
|-------|---------|
| 🔌 **Off the Grid** | `app_server.py` runs the GGUF locally via llama.cpp; read-aloud uses the browser's speech engine. Cloud is opt-in and labeled. |
| 🦙 **Llama Champion** | `llama-cpp-python` is the inference runtime (`Llama.from_pretrained(…q4_k_m.gguf)`) |
| 🎯 **Well-Tuned** | published LoRA fine-tune of the MiniCPM5-1B story model (+ GGUF) |
| 🎨 **Off-Brand** | `gr.Server` + hand-built `index.html` storybook UI — no default Gradio interface |
| 📡 **Sharing is Caring** | every generation logged to `traces/*.jsonl` → `push_traces.py` publishes them as a Hub dataset |
| 📓 **Field Notes** | `FIELD_NOTES.md` + `TUTORIAL.md` blog drafts |

Bonus targets: **Tiny Titan** (the 1B is the default brain) · **Off-Brand Award** ($1,500)

---

## Voice & data ethics

- The narrator voices are **designed/synthetic** (VoxCPM2 Voice Design from text descriptions) — no real person's voice is used by default.
- **Family-voice cloning** is explicit opt-in: a parent records their own clip; it is stored on the user's own Modal volume and used only to read that family's stories.
- We do **not** clone children's voices. The fine-tuned narrator voice was trained from **consented adult narrator** datasets.
- In offline mode the model that hears the child's name never leaves the device.

---

## Repo structure

```
07_kids_story_studio/
├── app_jungle.py          ← main Jungle UI (Gradio, ages 2–5)
├── app.py                 ← classic Gradio (model comparison)
├── app_server.py          ← badge-sweep build (gr.Server + vanilla HTML)
├── index.html             ← hand-built storybook frontend
├── serve_story_modal.py   ← Modal story endpoint   (MiniCPM5-1B GGUF, llama.cpp)
├── serve_voice_modal.py   ← Modal voice endpoint    (VoxCPM2)
├── serve_image_modal.py   ← Modal image endpoint    (FLUX.2-klein-4B GGUF)
├── deploy_jungle_modal.py ← deploy the full Gradio UI on Modal
├── push_traces.py         ← publish generation traces to the Hub
├── architecture.png       ← system diagram (above)
├── requirements.txt
├── TUTORIAL.md            ← full fine-tuning walkthrough (blog-ready)
├── FIELD_NOTES.md         ← hackathon blog draft
├── authoring/             ← hand-authored story markdown files
└── finetune/
    ├── author_kit.py      ← validate + push dataset
    ├── modal_finetune.py  ← MiniCPM5-1B LoRA training
    ├── eval_base_vs_tuned.py
    └── data/              ← raw story drafts
```

---

## Links

- Story model + GGUF: [ThePradip/minicpm5-1b-kids-storyteller](https://huggingface.co/ThePradip/minicpm5-1b-kids-storyteller)
- Dataset: [build-small-hackathon/kids-story](https://huggingface.co/datasets/build-small-hackathon/kids-story)
- Demo Space: [build-small-hackathon/jungle-story-time](https://huggingface.co/spaces/build-small-hackathon/jungle-story-time)
- Full tutorial: [TUTORIAL.md](./TUTORIAL.md) · Field notes: [FIELD_NOTES.md](./FIELD_NOTES.md)
```