Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
Field Notes — Sofía
What we actually learned building Sofía in ~10 days: why a small fine-tuned 7B beats a bigger generic model for a 3-year-old, why "curated content + LLM as glue" kills hallucination, the two ZeroGPU gotchas that looked like billing bugs but weren't, an abandoned experiment with Nemotron 4B that validated our core design principle, and a per-visitor ZeroGPU quota that explained a "the LLM never responds" scare two days before the deadline.
The problem, in one line
A ~3-year-old who needs to play and learn nonstop ("why is the sun like that?", "how are forks
made?", "are cats and dogs siblings?") and two parents who work from home and
don't always have 100% of their creativity available. Sofía doesn't replace
that — it complements it on the days we're running a bit "flat". The full
story is in README.md.
Decision #1: the LLM is never the source of truth
From day one we decided that all content (stories, songs, learning
activities) lives curated in content/ as JSON, and the LLM only presents it
and adds conversational warmth. This isn't premature optimization: for a
3-year-old, the universe of things she needs to "know" is genuinely small and
known in advance. That constraint is what makes a 7B model (instead of a
frontier one) the right choice, not a limitation.
The pivot: from llama.cpp to ZeroGPU (06/12-06/13)
We started with llama.cpp on the Space. Result: 34-180
seconds per response — unusable for a toddler who loses patience in 3
seconds. On 06/12 we pivoted to transformers + ZeroGPU (@spaces.GPU). This
meant losing the Llama Champion merit, but keeping Off the Grid: the model
still runs inside the Space, with zero external APIs.
Two gotchas each cost us almost half a day, and both humbled us, like Topuria in white house:
- Import order.
import spacesmust be the first import inapp.py— beforetorch,transformers,gradio, and before any of our own imports that themselves importtorch. Iftorchtouches CUDA beforespacespatches it, 100% of@spaces.GPUcalls fail withRuntimeError: No CUDA GPUs are available. It's not intermittent, it's not a quota issue — it's import order. - Kokoro (TTS) must be forced onto CPU.
KPipeline(device=None)auto-detectstorch.cuda.is_available()(which isTrueunder ZeroGPU even at module level) and moves the model to CUDA. Sincetts.warmup()runs outside@spaces.GPU, this breaks the Space's startup with "Expected all tensors to be on the same device". TTS doesn't need a GPU — we forceddevice="cpu"and it was fixed.
Once both were resolved, we verified live on 06/13: the first chat after
startup takes ~7-8s, chat with the LLM ~6-12s, speak (Kokoro/CPU) ~8s. Well
below the 34-180s of cpu-basic.
The experiment we abandoned (and why that's good news)
On 06/15, with everything working, we tried whether smaller models (4B:
nemotron-mini and nemotron-3-nano:4b via Ollama) could replace Qwen2.5-7B
to save latency/compute. We gave them the same SYSTEM_PROMPT and the same
<content> blocks that Sofía uses in production.
Both broke the inviolable principle: instead of presenting the curated
<content> verbatim, they paraphrased it or invented new text on top. For a
story or a counting activity, that's exactly what we can't allow — the parents
validated that text, not the model.
We dropped the experiment (main stayed intact, the branch was deleted), but
it left us with something valuable: concrete evidence for why the fine-tune
matters. It's not just "talks more warmly" — the fine-tune teaches the model
to not touch the curated content, something that the base model (even a
smaller one) doesn't do reliably out of the box.
QLoRA on Modal: the numbers
finetune/build_dataset.py generated 196 examples (greetings/persona,
delivering <content> x3 phrasings per activity, color changes, stories, and
~62 refusals/redirections that don't repeat the exact terms from
safety/blocklist.txt — defense in depth). QLoRA r=16/alpha=32 on
q/k/v/o_proj, 4-bit NF4, 3 epochs, on a Modal A10G: 72 steps, 408 seconds,
loss 2.51 → 0.14.
The post-merge smoke test (finetune/smoke_test_modal.py, 5 turns) confirmed
from the positive side what the Nemotron experiment showed from the negative
side: the fine-tuned model greets in character, repeats counting <content>
verbatim, refuses a question about guns without engaging with the topic,
redirects fear of monsters with warmth, and when asked for an invented story
("a dragon that eats cars") says it doesn't have that one and offers a curated
alternative — instead of making one up. Published model:
build-small-hackathon/sofia-qwen2.5-7b.
The scare in the final two days: per-visitor ZeroGPU quota
On the afternoon of 06/13, after uploading the Space, everything failed again — but now 100% of calls, not intermittently, with the same "No CUDA GPUs are available" traceback as the import-order gotcha (which was already fixed). For a moment we thought we'd broken something.
After 2 hours we get the real cause: ZeroGPU has a daily quota per visitor (not per Space):
~2 min/day for anonymous users, 5 min/day with a free account, 40 min/day with
PRO — and the reset is 24h after the first use, not at midnight. At ~6-12s
per turn, an anonymous visitor exhausts the 2 minutes in 10-20 messages,
something very easy to do during hackathon testing. Once exhausted, every
@spaces.GPU call from that visitor fails with the same traceback until the
reset — indistinguishable from the import-order bug unless you know it exists.
The thesis, confirmed
For a 3-year-old, a fine-tuned 7B model, tightly scoped by curated content, is the right tool — not a frontier model, and also not (as the Nemotron experiment showed) a smaller model without that fine-tune. Size alone doesn't solve "don't invent the story the parents already reviewed"; the style+safety fine-tune on a right-sized model does.