# 📓 Field Notes — Building *UX Crime Scene*

*A report for the Build Small Hackathon (Gradio × Hugging Face), Thousand Token Wood track.*

---

## The idea

Every designer has felt it: you open a website and something is *wrong*. A button
whispering when it should shout. An action buried like a body under three menus. You
can't always name it — but a good UX critic can.

So I built one. Not a dashboard, not another B2B SaaS — **a film-noir detective** who
takes a screenshot of any interface, works it like a crime scene, and files a verdict.
Every UX flaw gets circled as evidence. Every charge gets a name. The user gets a case
file and a letter grade.

The whole thing is a joke that takes itself completely seriously — which is exactly the
energy *Thousand Token Wood* asked for. The AI isn't *helping* me build the experience;
the AI **is** the detective. Remove the model and there's no case.

## The small-model bet (×3)

The cap was 32B. The easy move is to grab the biggest model that fits — but "build small"
is the whole point, so I went the other way — with **three** small models, each doing one
job: **`Qwen2.5-VL-7B`** (8.3B) as the detective's eyes, **`FLUX.2-klein-4B`** as the
forensics lab that *rebuilds* each flawed element fixed (a before/after you compare with a
draggable slider), and **`Kokoro-82M`** as the voice that reads the verdict aloud. After
the case is filed you can even **interrogate the Inspector** — the same 7B defends each
charge from the visible evidence, or concedes when you're right.

The eyes deserved the most care: Qwen2.5-VL-7B is the strongest open vision-language
model for *bbox grounding* at this size. UX critique needs
two things — *grounding* (knowing **where** the flaw is in the pixels) and
*instruction-following* (staying in character, returning clean structured evidence) — and
the real work became making a 7B punch above its weight (that's what the agent loop below
is for).

It runs on **Modal** (vLLM, L40S, scale-to-zero) behind a FastAPI endpoint — ~25s warm.
The Gradio app on Hugging Face Spaces is CPU-only — it never touches a GPU. Idle cost: $0.

## The hard part: making the circles land

The detective is only convincing if the evidence markers sit **exactly** on the guilty
element. That turned out to be the deepest rabbit hole of the project.

Qwen2.5-VL returns `bbox_2d` coordinates — but in its *smart-resized* pixel space, not
your original screenshot's. Get the rescale wrong and the Inspector circles empty air. I
had to:

- capture the exact resized dimensions the model saw,
- rescale every box back into original-image pixels,
- and tune temperature down so the grounding stopped drifting.

When the first circle snapped perfectly onto a tiny "VER EQUIPOS" button, the whole thing
came alive.

And the single biggest lever turned out to be **resolution**, not the model. Running a
*small* model frees a lot of GPU memory — so instead of feeding the detective a compressed
~1 MP image, I could hand it ~2.8 MP of detail. The grounding sharpened dramatically: a 7B
that "circles near the button" becomes a 7B that circles *the button*. For the rare
panoramic screenshot, the app goes further and **tiles** the image into a grid, investigating
each quadrant at native resolution so nothing in the body gets missed. The lesson: a small
model wasn't the bottleneck — a blurry photo was. Give the detective a sharp one.

## Making it investigate, not just answer

A single call to a vision model is just that — a call. A *detective* works the scene in
steps. So the Inspector became a small visual **agent**:

1. **Sweep** — one full-scene pass flags the suspects (the candidate crimes + rough regions).
2. **Close-up** — for the most serious suspects, the app **crops and zooms into each region**
   and re-examines it up close, with a focused prompt, to **confirm or clear** the charge and
   read off a **tighter** evidence box.
3. **File** — the confirmed crimes (with their sharpened boxes) become the verdict.

Plan → act (zoom) → verify → synthesize. It's slower (a few calls instead of one), but it
clears false positives, tightens the boxes, and — testing it live on real sites — it caught
a real bug I'd never have found otherwise: a *very* dense page (Hacker News) blew the token
budget and truncated the JSON, crashing the parse. Now the parser salvages every complete
piece of evidence from a cut-off response. Live testing earns its keep.

## The craft: pushing past default Gradio

This is where I spent the most love (and where I'm aiming at the **Off-Brand** badge). The
app doesn't look like a Gradio app — it looks like a noir film:

- a **cinematic intro** down a rain-soaked alley to Precinct 7,
- a top-down **detective's desk** where you drop the evidence onto a Polaroid,
- a live **investigation** scene — the Inspector working a laptop, magnifying glass in hand,
- a typewritten **case file** with charges, a verdict, a grade seal, and a shareable link.

All custom CSS, HTML and JS layered over Gradio, with original generated art, voice lines,
and sound design. Each verdict is stored on a Modal volume and gets a unique `?case=ID`
link so anyone can reopen the case.

## What I learned

1. **Small models are plenty — if you respect the task.** A 7B VLM, pushed hard, does grounded visual
   critique genuinely well. The ceiling wasn't intelligence; it was *plumbing* (coordinate
   spaces) and *craft* (making it feel like something).
2. **Experience is the product.** The model's JSON is maybe 10% of why the app works. The
   other 90% is the desk, the smoke, the Inspector's voice — the *feeling* that you brought
   a case to a detective and he actually took it seriously.
3. **Constraints are a creative gift.** "a small model on your laptop" pushed me toward something
   sharp and weird instead of bloated. That's the whole spirit of the trail.

## Try it

- 🔎 **App:** https://huggingface.co/spaces/build-small-hackathon/ux-crime-scene
- ▶️ **Trailer:** https://youtu.be/JJOMKEcX0Ws
- 📹 **Full walkthrough (66s):** https://youtu.be/kju7LiAXGC0

*Filed at Precinct 7 · The Inspector's eyes: Qwen2.5-VL-7B on Modal.*