# Field Notes — Sofía

What we actually learned building Sofía in ~10 days: why a small fine-tuned 7B
beats a bigger generic model for a 3-year-old, why "curated content + LLM as
glue" kills hallucination, the two ZeroGPU gotchas that looked like billing
bugs but weren't, an abandoned experiment with Nemotron 4B that validated our
core design principle, and a per-visitor ZeroGPU quota that explained a "the
LLM never responds" scare two days before the deadline.

## The problem, in one line

A ~3-year-old who needs to play and learn nonstop ("why is the sun like that?", "how are forks
made?", "are cats and dogs siblings?") and two parents who work from home and
don't always have 100% of their creativity available. Sofía doesn't replace
that — it complements it on the days we're running a bit "flat". The full
story is in [`README.md`](https://huggingface.co/spaces/build-small-hackathon/sofia-educational-companion/blob/main/README.md).

## Decision #1: the LLM is never the source of truth

From day one we decided that **all content** (stories, songs, learning
activities) lives curated in `content/` as JSON, and the LLM only presents it
and adds conversational warmth. This isn't premature optimization: for a
3-year-old, the universe of things she needs to "know" is genuinely small and
known in advance. That constraint is what makes a 7B model (instead of a
frontier one) the **right** choice, not a limitation.

## The pivot: from llama.cpp to ZeroGPU (06/12-06/13)

We started with `llama.cpp` on the Space. Result: **34-180
seconds per response** — unusable for a toddler who loses patience in 3
seconds. On 06/12 we pivoted to `transformers` + ZeroGPU (`@spaces.GPU`). This
meant losing the *Llama Champion* merit, but keeping *Off the Grid*: the model
still runs inside the Space, with zero external APIs.

Two gotchas each cost us almost half a day, and both humbled us, like Topuria in white house:

1. **Import order.** `import spaces` must be the *first* import in `app.py` —
   before `torch`, `transformers`, `gradio`, and before any of our own imports
   that themselves import `torch`. If `torch` touches CUDA before `spaces`
   patches it, **100% of `@spaces.GPU` calls fail** with
   `RuntimeError: No CUDA GPUs are available`. It's not intermittent, it's not
   a quota issue — it's import order.
2. **Kokoro (TTS) must be forced onto CPU.** `KPipeline(device=None)`
   auto-detects `torch.cuda.is_available()` (which is `True` under ZeroGPU
   even at module level) and moves the model to CUDA. Since `tts.warmup()`
   runs *outside* `@spaces.GPU`, this breaks the Space's startup with
   "Expected all tensors to be on the same device". TTS doesn't need a GPU —
   we forced `device="cpu"` and it was fixed.

Once both were resolved, we verified live on 06/13: the first `chat` after
startup takes ~7-8s, `chat` with the LLM ~6-12s, `speak` (Kokoro/CPU) ~8s. Well
below the 34-180s of `cpu-basic`.

## The experiment we abandoned (and why that's good news)

On 06/15, with everything working, we tried whether *smaller* models (4B:
`nemotron-mini` and `nemotron-3-nano:4b` via Ollama) could replace Qwen2.5-7B
to save latency/compute. We gave them the same `SYSTEM_PROMPT` and the same
`<content>` blocks that Sofía uses in production.

**Both broke the inviolable principle**: instead of presenting the curated
`<content>` *verbatim*, they paraphrased it or invented new text on top. For a
story or a counting activity, that's exactly what we can't allow — the parents
validated that text, not the model.

We dropped the experiment (`main` stayed intact, the branch was deleted), but
it left us with something valuable: **concrete evidence for why the fine-tune
matters**. It's not just "talks more warmly" — the fine-tune teaches the model
to *not touch* the curated content, something that the base model (even a
smaller one) doesn't do reliably out of the box.

## QLoRA on Modal: the numbers

`finetune/build_dataset.py` generated 196 examples (greetings/persona,
delivering `<content>` x3 phrasings per activity, color changes, stories, and
~62 refusals/redirections that **don't** repeat the exact terms from
`safety/blocklist.txt` — defense in depth). QLoRA `r=16`/`alpha=32` on
`q/k/v/o_proj`, 4-bit NF4, 3 epochs, on a Modal A10G: **72 steps, 408 seconds,
loss 2.51 → 0.14**.

The post-merge smoke test (`finetune/smoke_test_modal.py`, 5 turns) confirmed
from the positive side what the Nemotron experiment showed from the negative
side: the fine-tuned model greets in character, repeats counting `<content>`
**verbatim**, refuses a question about guns without engaging with the topic,
redirects fear of monsters with warmth, and when asked for an invented story
("a dragon that eats cars") says it doesn't have that one and offers a curated
alternative — instead of making one up. Published model:
[`build-small-hackathon/sofia-qwen2.5-7b`](https://huggingface.co/build-small-hackathon/sofia-qwen2.5-7b).

## The scare in the final two days: per-visitor ZeroGPU quota

On the afternoon of 06/13, after uploading the Space, **everything failed
again** — but now 100% of calls, not intermittently, with the same "No CUDA
GPUs are available" traceback as the import-order gotcha (which was already
fixed). For a moment we thought we'd broken something.

After 2 hours we get the real cause: ZeroGPU has a **daily quota per visitor** (not per Space):
~2 min/day for anonymous users, 5 min/day with a free account, 40 min/day with
PRO — and the reset is 24h after the *first use*, not at midnight. At ~6-12s
per turn, an anonymous visitor exhausts the 2 minutes in 10-20 messages,
something very easy to do during hackathon testing. Once exhausted, **every**
`@spaces.GPU` call from that visitor fails with the same traceback until the
reset — indistinguishable from the import-order bug unless you know it exists.

## The thesis, confirmed

For a 3-year-old, **a fine-tuned 7B model, tightly scoped by curated content,
is the right tool** — not a frontier model, and also not (as the Nemotron
experiment showed) a smaller model without that fine-tune. Size alone doesn't
solve "don't invent the story the parents already reviewed"; the style+safety
fine-tune on a right-sized model does.