Yui Home Assistant — teaching a 3B model to write Home Assistant automations that actually work
TL;DR — Yui is a local voice assistant that hears "Hey Yui, turn on the porch light when motion is detected," writes the Home Assistant automation YAML, loads it into a live home, and fires the trigger so you watch the home react. The headline result: a heldout works-rate of 0% → 94.4%, because we stopped training the model to write YAML that looks right and started training it against a verifier that checks whether the automation actually loads, fires, and moves the device.
Why we built this
Home Assistant is wonderful and a little brutal. The automation engine is YAML, the entity IDs are unforgiving, and "it loaded without error" tells you almost nothing — HA will happily accept an automation that references a device that doesn't exist and silently never fire. Every LLM we tried produced automations that looked plausible and didn't work.
We wanted three things at once, and the hackathon angle ("build small") forced the discipline:
- Fully local. No frontier API in the loop. Everything runs on a ≤3B
model with
transformers+peft, adapters pulled from the Hub at boot. - Voice-first. Wake word → speech → action, the way you'd actually use it.
- Verified, not vibed. The automation has to do the thing, not just parse.
What it does
Yui is a two-stage assistant:
- Stage 1 — routing/actions (MiniCPM 1B + SFT): decides whether what you said is an intent ("turn off the kitchen light"), a question ("is the front door locked?"), or an automation request, and emits structured JSON. It also learns to say no — ask it to "open the garage" in a home with no garage and it declines instead of hallucinating a device.
- Stage 2 — automation YAML (MiniCPM 2B + GRPO): writes a real HA automation, which we then load into an in-process Home Assistant sandbox and fire so the affected device changes state on screen.
Every session spins up a fresh synthetic home, so the demo is never canned — the model is reasoning about devices it's seeing for the first time.
How we trained it — the verifier is the whole trick
The thing we're proud of is the reward. Stage 2 was trained with GRPO against a live Home Assistant verifier. The reward is not "does this YAML look like training data." It's:
- Does it load into Home Assistant?
- Does the correct trigger fire?
- Does the correct actuator actually move?
Because HA loads broken automations without complaint, "it loaded" earns nothing. You only score by changing the world the way the prompt asked. We warmed up with SFT, then ran GRPO with k=8 rollouts per prompt against the verifier.
The result, measured on 76 prompts across 10 reserved homes the model never trained on:
Heldout works-rate: 0% (base) → 94.4% (warmup SFT + GRPO). YAML-load failures went to zero.
That's the number that matters: not BLEU against a gold YAML, but "the light actually came on."
Things we learned the hard way
A hackathon is mostly learnings. Ours clustered in three places.
1. ZeroGPU has opinions, and they're load-bearing. This ate a day:
- ZeroGPU only ships Python 3.10.13 / 3.12.12 — and
numpy>=2.3needs ≥3.11, so we pinnednumpy<2.3. - Home Assistant 2025.2+ requires Python ≥3.13.2, which ZeroGPU can't run — so the whole HA/sandbox stack is pinned to the 2025.1 line.
- The builder force-injects
torch<=2.11andgradio[oauth,mcp], which collides with pinned deps. We solved apydanticversion war with a two-pass install (apre-requirements.txtthat installs the strict pin first). - You cannot touch CUDA outside a
@spaces.GPUslice — eventorch.cuda.is_available()trips the emulation guard. So we build every model on CPU and move it onto the GPU inside the decorated call.
2. A demo that works offline does not prove the live mic works. Our demo video runs the real pipeline — real wake-word model, real Whisper, real two-stage brain — but it feeds a clean, full-length, silence-padded TTS clip to the detector in one shot. The live app streams 0.25s mic chunks from the browser. Those are completely different code paths: real-room audio at a real distance peaks lower than clean TTS, "Hey Yui" can straddle chunk boundaries, and per-chunk resampling adds artifacts. Lesson: validate the streaming path explicitly, tune the wake threshold for real mics (we dropped it to 0.35), and always ship a manual fallback button.
3. Gradio's "helpful" UI chrome can look like a bug. Users reported the
UI "flickering with a ton of errors." There were no errors — it was
Gradio's status-tracker progress overlay repainting ~60×/second because
our 0.25s streaming event had show_progress unset. One