# ๐Ÿ““ Field Notes โ€” Building *UX Crime Scene* *A report for the Build Small Hackathon (Gradio ร— Hugging Face), Thousand Token Wood track.* --- ## The idea Every designer has felt it: you open a website and something is *wrong*. A button whispering when it should shout. An action buried like a body under three menus. You can't always name it โ€” but a good UX critic can. So I built one. Not a dashboard, not another B2B SaaS โ€” **a film-noir detective** who takes a screenshot of any interface, works it like a crime scene, and files a verdict. Every UX flaw gets circled as evidence. Every charge gets a name. The user gets a case file and a letter grade. The whole thing is a joke that takes itself completely seriously โ€” which is exactly the energy *Thousand Token Wood* asked for. The AI isn't *helping* me build the experience; the AI **is** the detective. Remove the model and there's no case. ## The small-model bet (ร—3) The cap was 32B. The easy move is to grab the biggest model that fits โ€” but "build small" is the whole point, so I went the other way โ€” with **three** small models, each doing one job: **`Qwen2.5-VL-7B`** (8.3B) as the detective's eyes, **`FLUX.2-klein-4B`** as the forensics lab that *rebuilds* each flawed element fixed (a before/after you compare with a draggable slider), and **`Kokoro-82M`** as the voice that reads the verdict aloud. After the case is filed you can even **interrogate the Inspector** โ€” the same 7B defends each charge from the visible evidence, or concedes when you're right. The eyes deserved the most care: Qwen2.5-VL-7B is the strongest open vision-language model for *bbox grounding* at this size. UX critique needs two things โ€” *grounding* (knowing **where** the flaw is in the pixels) and *instruction-following* (staying in character, returning clean structured evidence) โ€” and the real work became making a 7B punch above its weight (that's what the agent loop below is for). It runs on **Modal** (vLLM, L40S, scale-to-zero) behind a FastAPI endpoint โ€” ~25s warm. The Gradio app on Hugging Face Spaces is CPU-only โ€” it never touches a GPU. Idle cost: $0. ## The hard part: making the circles land The detective is only convincing if the evidence markers sit **exactly** on the guilty element. That turned out to be the deepest rabbit hole of the project. Qwen2.5-VL returns `bbox_2d` coordinates โ€” but in its *smart-resized* pixel space, not your original screenshot's. Get the rescale wrong and the Inspector circles empty air. I had to: - capture the exact resized dimensions the model saw, - rescale every box back into original-image pixels, - and tune temperature down so the grounding stopped drifting. When the first circle snapped perfectly onto a tiny "VER EQUIPOS" button, the whole thing came alive. And the single biggest lever turned out to be **resolution**, not the model. Running a *small* model frees a lot of GPU memory โ€” so instead of feeding the detective a compressed ~1 MP image, I could hand it ~2.8 MP of detail. The grounding sharpened dramatically: a 7B that "circles near the button" becomes a 7B that circles *the button*. For the rare panoramic screenshot, the app goes further and **tiles** the image into a grid, investigating each quadrant at native resolution so nothing in the body gets missed. The lesson: a small model wasn't the bottleneck โ€” a blurry photo was. Give the detective a sharp one. ## Making it investigate, not just answer A single call to a vision model is just that โ€” a call. A *detective* works the scene in steps. So the Inspector became a small visual **agent**: 1. **Sweep** โ€” one full-scene pass flags the suspects (the candidate crimes + rough regions). 2. **Close-up** โ€” for the most serious suspects, the app **crops and zooms into each region** and re-examines it up close, with a focused prompt, to **confirm or clear** the charge and read off a **tighter** evidence box. 3. **File** โ€” the confirmed crimes (with their sharpened boxes) become the verdict. Plan โ†’ act (zoom) โ†’ verify โ†’ synthesize. It's slower (a few calls instead of one), but it clears false positives, tightens the boxes, and โ€” testing it live on real sites โ€” it caught a real bug I'd never have found otherwise: a *very* dense page (Hacker News) blew the token budget and truncated the JSON, crashing the parse. Now the parser salvages every complete piece of evidence from a cut-off response. Live testing earns its keep. ## The craft: pushing past default Gradio This is where I spent the most love (and where I'm aiming at the **Off-Brand** badge). The app doesn't look like a Gradio app โ€” it looks like a noir film: - a **cinematic intro** down a rain-soaked alley to Precinct 7, - a top-down **detective's desk** where you drop the evidence onto a Polaroid, - a live **investigation** scene โ€” the Inspector working a laptop, magnifying glass in hand, - a typewritten **case file** with charges, a verdict, a grade seal, and a shareable link. All custom CSS, HTML and JS layered over Gradio, with original generated art, voice lines, and sound design. Each verdict is stored on a Modal volume and gets a unique `?case=ID` link so anyone can reopen the case. ## What I learned 1. **Small models are plenty โ€” if you respect the task.** A 7B VLM, pushed hard, does grounded visual critique genuinely well. The ceiling wasn't intelligence; it was *plumbing* (coordinate spaces) and *craft* (making it feel like something). 2. **Experience is the product.** The model's JSON is maybe 10% of why the app works. The other 90% is the desk, the smoke, the Inspector's voice โ€” the *feeling* that you brought a case to a detective and he actually took it seriously. 3. **Constraints are a creative gift.** "a small model on your laptop" pushed me toward something sharp and weird instead of bloated. That's the whole spirit of the trail. ## Try it - ๐Ÿ”Ž **App:** https://huggingface.co/spaces/build-small-hackathon/ux-crime-scene - โ–ถ๏ธ **Trailer:** https://youtu.be/JJOMKEcX0Ws - ๐Ÿ“น **Full walkthrough (66s):** https://youtu.be/kju7LiAXGC0 *Filed at Precinct 7 ยท The Inspector's eyes: Qwen2.5-VL-7B on Modal.*