Spaces:

build-small-hackathon
/

sofia-educational-companion

Running on Zero

App Files Files Community

sofia-educational-companion / FIELD_NOTES.md

EstebanBarac

Fix broken relative links in README and FIELD_NOTES

f745214 6 days ago

preview code

Raw

History Blame Contribute Delete

6.26 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Field Notes — Sofía

What we actually learned building Sofía in ~10 days: why a small fine-tuned 7B beats a bigger generic model for a 3-year-old, why "curated content + LLM as glue" kills hallucination, the two ZeroGPU gotchas that looked like billing bugs but weren't, an abandoned experiment with Nemotron 4B that validated our core design principle, and a per-visitor ZeroGPU quota that explained a "the LLM never responds" scare two days before the deadline.

The problem, in one line

A ~3-year-old who needs to play and learn nonstop ("why is the sun like that?", "how are forks made?", "are cats and dogs siblings?") and two parents who work from home and don't always have 100% of their creativity available. Sofía doesn't replace that — it complements it on the days we're running a bit "flat". The full story is in README.md.

Decision #1: the LLM is never the source of truth

From day one we decided that all content (stories, songs, learning activities) lives curated in content/ as JSON, and the LLM only presents it and adds conversational warmth. This isn't premature optimization: for a 3-year-old, the universe of things she needs to "know" is genuinely small and known in advance. That constraint is what makes a 7B model (instead of a frontier one) the right choice, not a limitation.

The pivot: from llama.cpp to ZeroGPU (06/12-06/13)

We started with llama.cpp on the Space. Result: 34-180 seconds per response — unusable for a toddler who loses patience in 3 seconds. On 06/12 we pivoted to transformers + ZeroGPU (@spaces.GPU). This meant losing the Llama Champion merit, but keeping Off the Grid: the model still runs inside the Space, with zero external APIs.

Two gotchas each cost us almost half a day, and both humbled us, like Topuria in white house:

Import order. import spaces must be the first import in app.py — before torch, transformers, gradio, and before any of our own imports that themselves import torch. If torch touches CUDA before spaces patches it, 100% of @spaces.GPU calls fail with RuntimeError: No CUDA GPUs are available. It's not intermittent, it's not a quota issue — it's import order.
Kokoro (TTS) must be forced onto CPU. KPipeline(device=None) auto-detects torch.cuda.is_available() (which is True under ZeroGPU even at module level) and moves the model to CUDA. Since tts.warmup() runs outside @spaces.GPU, this breaks the Space's startup with "Expected all tensors to be on the same device". TTS doesn't need a GPU — we forced device="cpu" and it was fixed.

Once both were resolved, we verified live on 06/13: the first chat after startup takes ~7-8s, chat with the LLM ~6-12s, speak (Kokoro/CPU) ~8s. Well below the 34-180s of cpu-basic.

The experiment we abandoned (and why that's good news)

On 06/15, with everything working, we tried whether smaller models (4B: nemotron-mini and nemotron-3-nano:4b via Ollama) could replace Qwen2.5-7B to save latency/compute. We gave them the same SYSTEM_PROMPT and the same <content> blocks that Sofía uses in production.

Both broke the inviolable principle: instead of presenting the curated <content> verbatim, they paraphrased it or invented new text on top. For a story or a counting activity, that's exactly what we can't allow — the parents validated that text, not the model.

We dropped the experiment (main stayed intact, the branch was deleted), but it left us with something valuable: concrete evidence for why the fine-tune matters. It's not just "talks more warmly" — the fine-tune teaches the model to not touch the curated content, something that the base model (even a smaller one) doesn't do reliably out of the box.

QLoRA on Modal: the numbers

finetune/build_dataset.py generated 196 examples (greetings/persona, delivering <content> x3 phrasings per activity, color changes, stories, and ~62 refusals/redirections that don't repeat the exact terms from safety/blocklist.txt — defense in depth). QLoRA r=16/alpha=32 on q/k/v/o_proj, 4-bit NF4, 3 epochs, on a Modal A10G: 72 steps, 408 seconds, loss 2.51 → 0.14.

The post-merge smoke test (finetune/smoke_test_modal.py, 5 turns) confirmed from the positive side what the Nemotron experiment showed from the negative side: the fine-tuned model greets in character, repeats counting <content> verbatim, refuses a question about guns without engaging with the topic, redirects fear of monsters with warmth, and when asked for an invented story ("a dragon that eats cars") says it doesn't have that one and offers a curated alternative — instead of making one up. Published model: build-small-hackathon/sofia-qwen2.5-7b.

The scare in the final two days: per-visitor ZeroGPU quota

On the afternoon of 06/13, after uploading the Space, everything failed again — but now 100% of calls, not intermittently, with the same "No CUDA GPUs are available" traceback as the import-order gotcha (which was already fixed). For a moment we thought we'd broken something.

After 2 hours we get the real cause: ZeroGPU has a daily quota per visitor (not per Space): ~2 min/day for anonymous users, 5 min/day with a free account, 40 min/day with PRO — and the reset is 24h after the first use, not at midnight. At ~6-12s per turn, an anonymous visitor exhausts the 2 minutes in 10-20 messages, something very easy to do during hackathon testing. Once exhausted, every @spaces.GPU call from that visitor fails with the same traceback until the reset — indistinguishable from the import-order bug unless you know it exists.

The thesis, confirmed

For a 3-year-old, a fine-tuned 7B model, tightly scoped by curated content, is the right tool — not a frontier model, and also not (as the Nemotron experiment showed) a smaller model without that fine-tune. Size alone doesn't solve "don't invent the story the parents already reviewed"; the style+safety fine-tune on a right-sized model does.