# Field Notes — Sofía What we actually learned building Sofía in ~10 days: why a small fine-tuned 7B beats a bigger generic model for a 3-year-old, why "curated content + LLM as glue" kills hallucination, the two ZeroGPU gotchas that looked like billing bugs but weren't, an abandoned experiment with Nemotron 4B that validated our core design principle, and a per-visitor ZeroGPU quota that explained a "the LLM never responds" scare two days before the deadline. ## The problem, in one line A ~3-year-old who needs to play and learn nonstop ("why is the sun like that?", "how are forks made?", "are cats and dogs siblings?") and two parents who work from home and don't always have 100% of their creativity available. Sofía doesn't replace that — it complements it on the days we're running a bit "flat". The full story is in [`README.md`](https://huggingface.co/spaces/build-small-hackathon/sofia-educational-companion/blob/main/README.md). ## Decision #1: the LLM is never the source of truth From day one we decided that **all content** (stories, songs, learning activities) lives curated in `content/` as JSON, and the LLM only presents it and adds conversational warmth. This isn't premature optimization: for a 3-year-old, the universe of things she needs to "know" is genuinely small and known in advance. That constraint is what makes a 7B model (instead of a frontier one) the **right** choice, not a limitation. ## The pivot: from llama.cpp to ZeroGPU (06/12-06/13) We started with `llama.cpp` on the Space. Result: **34-180 seconds per response** — unusable for a toddler who loses patience in 3 seconds. On 06/12 we pivoted to `transformers` + ZeroGPU (`@spaces.GPU`). This meant losing the *Llama Champion* merit, but keeping *Off the Grid*: the model still runs inside the Space, with zero external APIs. Two gotchas each cost us almost half a day, and both humbled us, like Topuria in white house: 1. **Import order.** `import spaces` must be the *first* import in `app.py` — before `torch`, `transformers`, `gradio`, and before any of our own imports that themselves import `torch`. If `torch` touches CUDA before `spaces` patches it, **100% of `@spaces.GPU` calls fail** with `RuntimeError: No CUDA GPUs are available`. It's not intermittent, it's not a quota issue — it's import order. 2. **Kokoro (TTS) must be forced onto CPU.** `KPipeline(device=None)` auto-detects `torch.cuda.is_available()` (which is `True` under ZeroGPU even at module level) and moves the model to CUDA. Since `tts.warmup()` runs *outside* `@spaces.GPU`, this breaks the Space's startup with "Expected all tensors to be on the same device". TTS doesn't need a GPU — we forced `device="cpu"` and it was fixed. Once both were resolved, we verified live on 06/13: the first `chat` after startup takes ~7-8s, `chat` with the LLM ~6-12s, `speak` (Kokoro/CPU) ~8s. Well below the 34-180s of `cpu-basic`. ## The experiment we abandoned (and why that's good news) On 06/15, with everything working, we tried whether *smaller* models (4B: `nemotron-mini` and `nemotron-3-nano:4b` via Ollama) could replace Qwen2.5-7B to save latency/compute. We gave them the same `SYSTEM_PROMPT` and the same `` blocks that Sofía uses in production. **Both broke the inviolable principle**: instead of presenting the curated `` *verbatim*, they paraphrased it or invented new text on top. For a story or a counting activity, that's exactly what we can't allow — the parents validated that text, not the model. We dropped the experiment (`main` stayed intact, the branch was deleted), but it left us with something valuable: **concrete evidence for why the fine-tune matters**. It's not just "talks more warmly" — the fine-tune teaches the model to *not touch* the curated content, something that the base model (even a smaller one) doesn't do reliably out of the box. ## QLoRA on Modal: the numbers `finetune/build_dataset.py` generated 196 examples (greetings/persona, delivering `` x3 phrasings per activity, color changes, stories, and ~62 refusals/redirections that **don't** repeat the exact terms from `safety/blocklist.txt` — defense in depth). QLoRA `r=16`/`alpha=32` on `q/k/v/o_proj`, 4-bit NF4, 3 epochs, on a Modal A10G: **72 steps, 408 seconds, loss 2.51 → 0.14**. The post-merge smoke test (`finetune/smoke_test_modal.py`, 5 turns) confirmed from the positive side what the Nemotron experiment showed from the negative side: the fine-tuned model greets in character, repeats counting `` **verbatim**, refuses a question about guns without engaging with the topic, redirects fear of monsters with warmth, and when asked for an invented story ("a dragon that eats cars") says it doesn't have that one and offers a curated alternative — instead of making one up. Published model: [`build-small-hackathon/sofia-qwen2.5-7b`](https://huggingface.co/build-small-hackathon/sofia-qwen2.5-7b). ## The scare in the final two days: per-visitor ZeroGPU quota On the afternoon of 06/13, after uploading the Space, **everything failed again** — but now 100% of calls, not intermittently, with the same "No CUDA GPUs are available" traceback as the import-order gotcha (which was already fixed). For a moment we thought we'd broken something. After 2 hours we get the real cause: ZeroGPU has a **daily quota per visitor** (not per Space): ~2 min/day for anonymous users, 5 min/day with a free account, 40 min/day with PRO — and the reset is 24h after the *first use*, not at midnight. At ~6-12s per turn, an anonymous visitor exhausts the 2 minutes in 10-20 messages, something very easy to do during hackathon testing. Once exhausted, **every** `@spaces.GPU` call from that visitor fails with the same traceback until the reset — indistinguishable from the import-order bug unless you know it exists. ## The thesis, confirmed For a 3-year-old, **a fine-tuned 7B model, tightly scoped by curated content, is the right tool** — not a frontier model, and also not (as the Nemotron experiment showed) a smaller model without that fine-tune. Size alone doesn't solve "don't invent the story the parents already reviewed"; the style+safety fine-tune on a right-sized model does.