Three Tiny Models, One Bedtime Story

Community Article
Published June 15, 2026

architecture

How three small, open AI models, one that writes, one that speaks, and one that draws, come together into a private, low-cost storyteller for young children. Built for the Build Small Hackathon, June 2026.

A four-year-old taps a tiger, then a jungle, then "story." A few seconds later her own name appears on screen inside a tale about a brave little cub, and a warm voice begins reading it aloud. None of the giant AI models you read about in the news are involved. The story comes from a small AI model we trained ourselves for about two dollars. The voice and the picture come from two more small, free-to-use models that cost almost nothing to run.

This article explains how the three fit together, the decisions that mattered, and the bugs that cost us an afternoon, so you can rebuild it or borrow the parts.

Everything here is open: the dataset, the model and its compact offline version, and the supporting models are all freely available, with no paid API calls, just open models running on rented GPUs.

Why a small model is the right choice

A storyteller for a two-to-five-year-old doesn't need a frontier model. It needs four things: speed (a toddler won't wait thirty seconds), privacy (it hears your child's name), low cost (it should run on a family laptop), and a reliable style: very short sentences, playful sound words, and a refrain repeated three times.

That last point is the key. Getting the style consistent is exactly what a small model is good at, once you fine-tune it. Fine-tuning means taking an existing open model and training it a little further on your own examples until it picks up the style you want. We started from MiniCPM5-1B, a small, freely available model that the standard open-source training tools work with out of the box.

Step 1 — Curate the dataset (the part most people skip)

The single biggest lever on quality was the training data. We tried three ways of producing children's stories, in increasing order of quality:

  1. Using a free, small AI model as the "teacher." A small vision model wrote the stories. Only 8–17% were usable: refrains broke down, and instructions leaked into the text. A small model can't teach good style.
  2. Using a strong open model as the teacher. A 32-billion-parameter model, prompted carefully and filtered hard, produced roughly 67% usable stories at about one dollar per three thousand. Better, but still inconsistent.
  3. Curating and augmenting a small set ourselves. We keep only the stories that pass a strict style contract, edit them into shape, and augment the set with name variants. Every story is usable and on-style. This is what we shipped.

The curated set is 129 stories and poems across twelve categories: animals, birds, colors, shapes, surroundings, family, friends, the environment, simple morals, speech practice ("Ba, ba, ball!"), early learning (counting, big and small), and rhymes. Each story is a plain text block with a short header, so anyone can add or edit one:

### kid: Diya | age: 2 | characters: a little duck | place: the pond | category: speak | teach: the quack sound
Diya is at the pond.
A yellow duck swims by.
Quack, quack, little duck!
...

A small validation script checks every story against the same rules before anything is published: 30–170 words, an average of nine words per sentence or fewer, a refrain repeated three times, sound words present, concrete and easy-to-picture content (abstract stories are rejected), nothing frightening, and a clean ending.

One lesson worth repeating: word-blocking filters need to respect word boundaries. Our first version rejected a perfectly safe story because the word "soldiers" contains the letters "die."

python author_kit.py validate     # reports exactly what to fix, per story
python author_kit.py push your-org/your-dataset

Step 2 — Set up the cloud account (about ten minutes, once)

We trained and serve everything on Modal, a service that runs your code on cloud GPUs and bills only for the seconds it actually runs. New accounts include a monthly free compute allowance.

Setup is short. Install and sign in to the command-line tool, then give it permission to publish the finished model to the Hugging Face model hub. The access token is stored as a secret, never pasted into code or chat. There are no servers to manage, no GPU drivers to install, and nothing to remember to shut down.

Step 3 — The training script

A single file describes the whole training run: which model, which cloud machine, and the settings. A few lessons stuck:

  • Change as little as possible. We locked exactly one software version, the only one the training tools fully supported. Every extra thing you lock down is something that breaks later.
  • Let the model learn from the answers, not the questions, so it learns to write stories rather than echo the prompts back.
  • Train light. Instead of retraining the entire model, we updated only a tiny slice of it, about 1%. That's fast, cheap, and runs on a modest machine.

Step 4 — Run it

A single command builds the environment once (and caches it for next time), downloads the dataset, and trains on an A10G, a mid-range cloud GPU. Logs stream to your terminal and dashboard. Most of the elapsed time is the one-time setup; the fine-tuning itself is a short job, because the model is small.

Step 5 — Validate before publishing

The script refuses to publish unless the model proves itself first. It writes several fresh test stories with new child names and checks the length, that the child's name actually shows up, and that the text passes a safety word list. It also checks that the model clearly improved during training. If any check fails, nothing is published.

This matters more than it looks. Earlier in the project, a data step failed silently and the pipeline cheerfully published a tiny, broken dataset over a good one. A validation gate between "the pipeline produced something" and "the something is now public" is the cheapest insurance in this kind of work.

The first round revealed an honest limitation: asked for a story about a child named Veer, the model wrote a perfectly on-style story, but named the child Sam instead. The style transferred reliably. The specific name did not always. More data was the fix, and we proved it: a second round grew the 129 stories to 258 (adding a name-swapped variant of each) and passed every check, including the exact name test that had failed before.

A passing sample (from the unseen prompt "Riya, age 3, a yellow duck, the quack sound"):

Riya sees a yellow duck at the pond. The duck quacks, quacks, quacks! His legs wiggle like little frogs. The water is warm and cool. See, see! Riya hops over one leg. The duck quacks, quacks, quacks! …

Refrain, sound words, very short sentences, and the child's name: the full style, learned from a small, carefully curated set.

Step 6 — Ship it small, so it runs offline

The pipeline publishes two versions of the finished model: the full-size version, and a compact GGUF file (about 700 MB). GGUF is a format designed to run language models efficiently on an ordinary computer's processor, with no GPU required.

The compact version is the one we deploy. On a modern laptop it writes about sixty words a second, faster than a child can listen, with no graphics card needed. Nothing is sent to an outside service, which means the model that hears a child's name never leaves the machine.

The voice: a narrator with no fine-tuning needed

A story is only half the experience; a young child wants to hear it. For narration we use VoxCPM2 (an open, 2-billion-parameter speech model), and here we did no fine-tuning at all. Two built-in features do the work:

  • Designed voices. You describe a voice in words and the model produces a consistent narrator. We wrote six: a warm storyteller, a bright sing-song voice, a kind grandfather, a playful cartoon voice, a gentle teacher, and a deep, calm narrator. Each description is rendered once into a short reference clip and reused, so the voice sounds identical every time. No real person is involved.
  • Family voice (optional). A parent records about fifteen seconds of audio and the model reads stories in that voice, with no training. It's opt-in, stored privately for that family, and used only for their stories.

One bug cost us an afternoon: don't put the voice description ("warm, melodious…") into the text you ask the model to speak, because it will read the description out loud. The description shapes the voice. The text is only what should be spoken. Keep them separate.

We don't clone children's voices. The designed voices are synthetic, and voice cloning is limited to a consenting adult's recording of their own voice.

The picture: a small illustrator that fits alongside the rest

Each tale gets one soft, watercolor-style illustration from FLUX.2 [klein], a small open image model from the FLUX 2 family, run in 4-bit so it stays cheap to serve.

The three models run as separate scale-to-zero services on Modal. Each spins up only when it's needed and costs nothing while idle, so a single mid-range GPU is enough.

The illustrations are kept child-safe by design rather than by retraining. Every prompt is wrapped in a soft picture-book style, plus instructions that block anything frightening, any text in the image, and photorealism. A short generation run produces a usable image in a few seconds, and each picture is cached, so repeats are instant.

Step 7 — Release the resources (no surprise bills)

Because billing is per-second and the GPU shuts down the moment a job finishes, idle time costs nothing. A quick check confirms nothing is left running. The only leftovers, a cached environment and a saved checkpoint, sit in free storage and can be deleted entirely if you want a zero footprint.

Four things worth borrowing

  1. The curation pattern. Plain-text stories plus automatic style checks let anyone build and vet the dataset, and the checks document the data standard better than any written guide.
  2. Validation gates before every irreversible step: before publishing the dataset, and before publishing the model. They're cheap and they prevent quiet disasters.
  3. Pin only what you must. Every unnecessary version lock is a future failure. We kept exactly one.
  4. For a narrow task, a small model plus good examples beats a large model plus a vague prompt. Small models are excellent at absorbing a consistent style.

Built for the Build Small Hackathon, June 2026, using MiniCPM5-1B (OpenBMB), VoxCPM2 (OpenBMB), FLUX.2 [klein] (Black Forest Labs), and Modal.

Community

Sign up or log in to comment