--- title: Jungle Story Time emoji: 🦁 colorFrom: green colorTo: yellow sdk: gradio sdk_version: "6.15.2" app_file: app.py pinned: false license: apache-2.0 short_description: Personalized kids' stories with AI voice narration --- # Jungle Story Time β€” Kids Story Studio πŸ¦πŸ“– > **3 taps and GO.** Pick a friend, pick a place, pick story-or-poem β€” a fine-tuned 1B model writes a gentle, personalized tale for your child, an illustration is painted for it, and it's read aloud in a sweet voice (or your own family's voice). Built for the **Build Small Hackathon (June 2026)**. --- ## What it does A bedtime-story machine for children aged 2–5. The child (or parent) makes a few taps: | Step | Choice | Options | |------|--------|---------| | 1 | **Friend** | Simba 🦁 Β· Tiger 🐯 Β· Panda 🐼 Β· Bhalu 🐻 Β· Parrot 🦜 Β· Elephant 🐘 Β· Bunny 🐰 Β· Duck πŸ¦† Β· or type any animal | | 2 | **Place** | Home 🏠 Β· Jungle 🌳 Β· Pond 🌊 Β· Mango Tree πŸ₯­ Β· Night Sky πŸŒ™ | | 3 | **Type** | Story or Poem | | 4 | **Voice** | Sunny 🌞 Β· Koyal 🐦 Β· Dadu πŸŒ™ Β· Robo (instant) πŸ€– Β· or πŸŽ™οΈ your family's voice | The app picks a surprise lesson (sharing, patience, counting, colors, animal sounds…), writes the story, streams it back word-by-word into an animated storybook poster, paints a matching watercolor illustration, and reads it aloud. It can also stitch a **narrated story "movie"** (one illustration + one narration clip per scene, joined with a Ken-Burns pan). --- ## Architecture ![Jungle Story Time architecture](./architecture.png) A child makes a few taps in the **Gradio UI** (friend, place, story type, voice). The request goes to the **Modal serverless platform**, which runs three scale-to-zero services: - **Story Agent** β€” the fine-tuned **MiniCPM5-1B** GGUF, served on CPU via `llama.cpp`. - **Voice Agent** β€” **VoxCPM2** for narration, with designed voices and zero-shot family-voice cloning. - **Illustrator** β€” **FLUX.2-klein-4B** (4-bit GGUF) paints one storybook picture per tale. A fast session cache reuses identical prompts so repeats are instant. The same fine-tuned story model also runs **fully offline** on a laptop CPU via `llama-cpp-python` + the Q4_K_M GGUF (`LOCAL_MODE=1`) β€” no cloud, no API, the child's name never leaves the machine. --- ## Models Three models power the app β€” one fine-tuned, two strong off-the-shelf bases: | Service | Model | Fine-tuned? | Notes | |---------|-------|:---:|-------| | **Story** | `ThePradip/minicpm5-1b-kids-storyteller-GGUF` (1B, Q4_K_M, llama.cpp, CPU) | βœ… Unsloth LoRA on `openbmb/MiniCPM5-1B` (A10G) | published with full-precision sibling `ThePradip/minicpm5-1b-kids-storyteller` | | **Voice** | `openbmb/VoxCPM2` (2B) | ❌ off-the-shelf | designed voices + zero-shot family-voice cloning, no training needed | | **Image** | `unsloth/FLUX.2-klein-4B-GGUF` (4B, Q4_K_M, diffusers) | ❌ off-the-shelf | kid-safe by prompt construction, few-step distilled | The fine-tune wired into production is the **story model**; the voice and image models are used as-is. ### Why a 1B story model? A 2–5 year old needs **fast** (no 30s waits), **private** (it hears your child's name), **cheap** (runs on the family laptop), and **stylistically reliable** (tiny sentences, sound words, a refrain repeated 3Γ—). Style is exactly what small-model fine-tuning is good at. `MiniCPM5-1B` is a plain `LlamaForCausalLM`, so the whole Unsloth toolchain works with zero patches, and the Q4_K_M GGUF runs at **~60 tokens/sec on an M-series MacBook CPU**. --- ## Dataset The story model is trained on **hand-authored** stories β€” not scraped, not bulk-generated by a frontier API. | Dataset | Size | Use | |---------|------|-----| | Hand-authored (`build-small-hackathon/kids-story`) | 129 plots β†’ 258 examples (name variants) | **SFT for the 1B story model** β€” 12 categories, style-validated | **Categories (12):** animals Β· birds Β· colors Β· shapes Β· surroundings Β· family Β· friends Β· environment Β· morals Β· speech practice Β· early learning Β· rhymes **Style contract** β€” machine-enforced by `finetune/author_kit.py` before publishing: - 30–170 words Β· average sentence ≀ 9 words - a refrain repeated exactly 3Γ— - at least one sound word (quack, splash, whoosh…) - a visualizability score β€” concrete picture-words a toddler can see; abstract stories rejected - nothing scary Β· clean ending Β· the child's name appears --- ## Fine-tuning pipeline ``` finetune/ β”œβ”€β”€ author_kit.py ← validate + push the hand-authored dataset β”œβ”€β”€ generate_dataset.py ← build the personalized SFT set β”œβ”€β”€ modal_finetune.py ← MiniCPM5-1B LoRA training job β”œβ”€β”€ eval_base_vs_tuned.py ← before/after quality comparison └── data/ ← authored stories (markdown blocks) ``` ### Training jobs Each job below maps to a function in `finetune/modal_finetune.py` (GPU and method as declared there): | Job | GPU | Method | |-----|-----|--------| | Dataset prep | CPU (Modal) | author-kit validation | | MiniCPM5-1B LoRA | A10G | Unsloth | | Serving endpoints (story Β· voice Β· image) | A10G | scale-to-zero (β‰ˆ$0 idle, billed per second) | ### Key training decisions - **`transformers==4.57.3` is the one pin that matters** β€” Unsloth's patcher only supports ≀4.57.x. On Modal's clean image, pin only this. - **Model's own chat template**, `enable_thinking=False` so stories stream with no visible `` block. - **`train_on_responses_only`** β€” loss on the story tokens, not the prompt. - **LoRA r=16, Ξ±=32, dropout 0.05** on attention+MLP (`q,k,v,o,gate,up,down`). ~1% of weights. - **Validation gate before every Hub push** β€” generates 5 unseen test stories, checks name faithfulness + style + safety wordlist, and requires loss β‰₯10% below start. Fail β†’ nothing publishes. ### Training results (1B story model) | Round | Examples | Epochs | Loss | Validation | |-------|----------|--------|------|-----------| | 1 | 69 | 5 | 2.61 β†’ 2.04 (βˆ’22%) | 2/3 pass | | 2 | 258 | 3 | 2.64 β†’ 1.54 (βˆ’42%) | 5/5 pass βœ… | Round 1 exposed the classic small-model failure (asked for *Veer*, wrote a perfect story starring *Sam*). Style transfers from a tiny set; name-faithfulness needed more data β€” round 2 fixed it. --- ## Running locally ```bash pip install gradio llama-cpp-python requests # cloud story + voice + image endpoints (default) python app_jungle.py # β†’ http://localhost:7868 # fully offline story generation (no internet after the model download) LOCAL_MODE=1 python app_jungle.py ``` The first run downloads the Q4_K_M GGUF (~700 MB) and caches it. After that, story generation is fully offline. ### Environment variables | Var | Default | Purpose | |-----|---------|---------| | `STORY_ENDPOINT` | the deployed Modal URL | cloud story generation | | `VOICE_ENDPOINT` | the deployed Modal URL | VoxCPM2 narration | | `IMAGE_ENDPOINT` | the deployed Modal URL | FLUX.2-klein illustration | | `LOCAL_MODE` | unset | `1` = generate stories on-device via llama.cpp | --- ## Deploying to Modal ```bash # one-time pip install modal && modal setup modal secret create huggingface HF_TOKEN=hf_xxx # write token # fine-tune (optional β€” the story model is already published) modal run finetune/modal_finetune.py::prepare_dataset modal run finetune/modal_finetune.py::train_minicpm5 # deploy the three serving endpoints modal deploy serve_story_modal.py # MiniCPM5-1B GGUF β†’ ...-storyteller-narrate modal deploy serve_voice_modal.py # VoxCPM2 β†’ ...-voxnarrator-speak modal deploy serve_image_modal.py # FLUX.2-klein-4B β†’ ...-illustrator-draw # OR deploy the whole Gradio UI on Modal (GGUF baked in) modal deploy deploy_jungle_modal.py # β†’ https://--jungle-story-time-ui.modal.run ``` All endpoints are **scale-to-zero** (β‰ˆ$0 idle, billed per second) and bake their weights into the image so cold starts skip the download. --- ## Endpoints (API) ``` POST …kids-story-api-storyteller-narrate.modal.run {"kid":"Riya","age":3,"characters":"a yellow duck", "place":"the pond","lesson":"the quack sound","kind":"story"} β†’ {"story":"…","tokens":123,"seconds":8.1} POST …kids-voice-tts-voxnarrator-speak.modal.run {"text":"…","voice":"sunny|koyal|dadu|clone","reference_b64":"…"} β†’ audio/wav POST …kids-image-gen-illustrator-draw.modal.run {"prompt":"children's picture-book illustration of …"} β†’ image/png (1024Γ—1024) ``` --- ## Features - **Animated storybook poster** β€” per-scene backdrop (jungle/pond/night…), bouncing friend, floating decorations, staggered story lines, high-contrast cards. - **Word-by-word streaming** while the model writes. - **AI illustration** painted for each tale (cached so repeats are instant). - **Voices** β€” three designed narrator voices (synthesized by VoxCPM2 Voice Design, no real person), an instant browser-speech "Robo" option, and **family-voice cloning** (record ~15s β†’ the story is read in that voice; opt-in, session-scoped). - **Story movie** β€” one picture + one narration clip per scene, stitched into a Ken-Burns video. - **Offline mode** β€” `LOCAL_MODE=1` runs the GGUF on the laptop CPU. --- ## App modes | Mode | File | Story engine | Voice | Use case | |------|------|-------------|-------|---------| | Jungle UI | `app_jungle.py` | Modal endpoint β†’ llama.cpp fallback | VoxCPM2 or browser speech | Main app for kids | | Badge-sweep build | `app_server.py` + `index.html` | local llama.cpp GGUF | browser speech (Off-the-Grid) | Hackathon submission | | Classic Gradio | `app.py` | Modal endpoint | β€” | model-comparison / ZeroGPU demo | --- ## Hackathon badges | Badge | Evidence | |-------|---------| | πŸ”Œ **Off the Grid** | `app_server.py` runs the GGUF locally via llama.cpp; read-aloud uses the browser's speech engine. Cloud is opt-in and labeled. | | πŸ¦™ **Llama Champion** | `llama-cpp-python` is the inference runtime (`Llama.from_pretrained(…q4_k_m.gguf)`) | | 🎯 **Well-Tuned** | published LoRA fine-tune of the MiniCPM5-1B story model (+ GGUF) | | 🎨 **Off-Brand** | `gr.Server` + hand-built `index.html` storybook UI β€” no default Gradio interface | | πŸ“‘ **Sharing is Caring** | every generation logged to `traces/*.jsonl` β†’ `push_traces.py` publishes them as a Hub dataset | | πŸ““ **Field Notes** | `FIELD_NOTES.md` + `TUTORIAL.md` blog drafts | Bonus targets: **Tiny Titan** (the 1B is the default brain) Β· **Off-Brand Award** ($1,500) --- ## Voice & data ethics - The narrator voices are **designed/synthetic** (VoxCPM2 Voice Design from text descriptions) β€” no real person's voice is used by default. - **Family-voice cloning** is explicit opt-in: a parent records their own clip; it is stored on the user's own Modal volume and used only to read that family's stories. - We do **not** clone children's voices. The fine-tuned narrator voice was trained from **consented adult narrator** datasets. - In offline mode the model that hears the child's name never leaves the device. --- ## Repo structure ``` 07_kids_story_studio/ β”œβ”€β”€ app_jungle.py ← main Jungle UI (Gradio, ages 2–5) β”œβ”€β”€ app.py ← classic Gradio (model comparison) β”œβ”€β”€ app_server.py ← badge-sweep build (gr.Server + vanilla HTML) β”œβ”€β”€ index.html ← hand-built storybook frontend β”œβ”€β”€ serve_story_modal.py ← Modal story endpoint (MiniCPM5-1B GGUF, llama.cpp) β”œβ”€β”€ serve_voice_modal.py ← Modal voice endpoint (VoxCPM2) β”œβ”€β”€ serve_image_modal.py ← Modal image endpoint (FLUX.2-klein-4B GGUF) β”œβ”€β”€ deploy_jungle_modal.py ← deploy the full Gradio UI on Modal β”œβ”€β”€ push_traces.py ← publish generation traces to the Hub β”œβ”€β”€ architecture.png ← system diagram (above) β”œβ”€β”€ requirements.txt β”œβ”€β”€ TUTORIAL.md ← full fine-tuning walkthrough (blog-ready) β”œβ”€β”€ FIELD_NOTES.md ← hackathon blog draft β”œβ”€β”€ authoring/ ← hand-authored story markdown files └── finetune/ β”œβ”€β”€ author_kit.py ← validate + push dataset β”œβ”€β”€ modal_finetune.py ← MiniCPM5-1B LoRA training β”œβ”€β”€ eval_base_vs_tuned.py └── data/ ← raw story drafts ``` --- ## Links - Story model + GGUF: [ThePradip/minicpm5-1b-kids-storyteller](https://huggingface.co/ThePradip/minicpm5-1b-kids-storyteller) - Dataset: [build-small-hackathon/kids-story](https://huggingface.co/datasets/build-small-hackathon/kids-story) - Demo Space: [build-small-hackathon/jungle-story-time](https://huggingface.co/spaces/build-small-hackathon/jungle-story-time) - Full tutorial: [TUTORIAL.md](./TUTORIAL.md) Β· Field notes: [FIELD_NOTES.md](./FIELD_NOTES.md) ```