Yui Home Assistant — teaching a 3B model to write Home Assistant automations that actually work

Community Article
Published June 15, 2026

Build-log for Yui, our Build Small hackathon entry (Backyard AI track). Fully local, ≤3B params, runs on ZeroGPU.

TL;DR — Yui is a local voice assistant that hears "Hey Yui, turn on the porch light when motion is detected," writes the Home Assistant automation YAML, loads it into a live home, and fires the trigger so you watch the home react. The headline result: a heldout works-rate of 0% → 94.4%, because we stopped training the model to write YAML that looks right and started training it against a verifier that checks whether the automation actually loads, fires, and moves the device.


Why we built this

Home Assistant is wonderful and a little brutal. The automation engine is YAML, the entity IDs are unforgiving, and "it loaded without error" tells you almost nothing — HA will happily accept an automation that references a device that doesn't exist and silently never fire. Every LLM we tried produced automations that looked plausible and didn't work.

We wanted three things at once, and the hackathon angle ("build small") forced the discipline:

  1. Fully local. No frontier API in the loop. Everything runs on a ≤3B model with transformers + peft, adapters pulled from the Hub at boot.
  2. Voice-first. Wake word → speech → action, the way you'd actually use it.
  3. Verified, not vibed. The automation has to do the thing, not just parse.

What it does

Yui is a two-stage assistant:

  • Stage 1 — routing/actions (MiniCPM 1B + SFT): decides whether what you said is an intent ("turn off the kitchen light"), a question ("is the front door locked?"), or an automation request, and emits structured JSON. It also learns to say no — ask it to "open the garage" in a home with no garage and it declines instead of hallucinating a device.
  • Stage 2 — automation YAML (MiniCPM 2B + GRPO): writes a real HA automation, which we then load into an in-process Home Assistant sandbox and fire so the affected device changes state on screen.

Every session spins up a fresh synthetic home, so the demo is never canned — the model is reasoning about devices it's seeing for the first time.

How we trained it — the verifier is the whole trick

The thing we're proud of is the reward. Stage 2 was trained with GRPO against a live Home Assistant verifier. The reward is not "does this YAML look like training data." It's:

  1. Does it load into Home Assistant?
  2. Does the correct trigger fire?
  3. Does the correct actuator actually move?

Because HA loads broken automations without complaint, "it loaded" earns nothing. You only score by changing the world the way the prompt asked. We warmed up with SFT, then ran GRPO with k=8 rollouts per prompt against the verifier.

The result, measured on 76 prompts across 10 reserved homes the model never trained on:

Heldout works-rate: 0% (base) → 94.4% (warmup SFT + GRPO). YAML-load failures went to zero.

That's the number that matters: not BLEU against a gold YAML, but "the light actually came on."

Things we learned the hard way

A hackathon is mostly learnings. Ours clustered in three places.

1. ZeroGPU has opinions, and they're load-bearing. This ate a day:

  • ZeroGPU only ships Python 3.10.13 / 3.12.12 — and numpy>=2.3 needs ≥3.11, so we pinned numpy<2.3.
  • Home Assistant 2025.2+ requires Python ≥3.13.2, which ZeroGPU can't run — so the whole HA/sandbox stack is pinned to the 2025.1 line.
  • The builder force-injects torch<=2.11 and gradio[oauth,mcp], which collides with pinned deps. We solved a pydantic version war with a two-pass install (a pre-requirements.txt that installs the strict pin first).
  • You cannot touch CUDA outside a @spaces.GPU slice — even torch.cuda.is_available() trips the emulation guard. So we build every model on CPU and move it onto the GPU inside the decorated call.

2. A demo that works offline does not prove the live mic works. Our demo video runs the real pipeline — real wake-word model, real Whisper, real two-stage brain — but it feeds a clean, full-length, silence-padded TTS clip to the detector in one shot. The live app streams 0.25s mic chunks from the browser. Those are completely different code paths: real-room audio at a real distance peaks lower than clean TTS, "Hey Yui" can straddle chunk boundaries, and per-chunk resampling adds artifacts. Lesson: validate the streaming path explicitly, tune the wake threshold for real mics (we dropped it to 0.35), and always ship a manual fallback button.

3. Gradio's "helpful" UI chrome can look like a bug. Users reported the UI "flickering with a ton of errors." There were no errors — it was Gradio's status-tracker progress overlay repainting ~60×/second because our 0.25s streaming event had show_progress unset. One

Community

Sign up or log in to comment