# 📊 Quantitative Evaluation — does the Inspector actually land his charges? A small model that *claims* to circle UX flaws is a parlor trick unless the circles land. So we measured it, the honest way: a **senior UX designer** graded every single charge the live pipeline produced, blind to no one — the full evidence (annotated screenshots, raw cases, and per-charge ratings) is public in the [traces dataset](https://huggingface.co/datasets/build-small-hackathon/ux-crime-scene-traces). ## Method - **16 real pages**, captured at 1440×900 (en-US): apple, cnn, ebay (×2 views), espn (×2 views), foxnews, github, huggingface, nasa, nytimes, usatoday, weather.com, wikipedia, yahoo, and spacejam.com (1996). - The **production pipeline, unmodified** (`Qwen2.5-VL-7B`, agentic sweep → zoom → verify → file) investigated each page once. No retries, no cherry-picking: every charge it filed was graded. - For each of the **38 charges**, a senior UX designer answered two questions: 1. **Grounding** — is the evidence circle on the exact element the charge names? 2. **Validity** — is it a real design issue you would flag in a professional audit? ## Results | Metric | Score | | --- | --- | | **Grounding** (circle on the named element) | **32/38 — 84.2%** | | **Validity** (a real design issue, per a senior UX designer) | **35/38 — 92.1%** | | Fully correct (both) | 32/38 — 84.2% | | Pages with a flawless report | **12/16 — 75%** | ## Failure analysis (the honest part) All 6 imperfect charges are concentrated in 4 pages (apple, espn, nasa, wikipedia) — dark, sparse or atypically structured layouts where the sweep has the least anchor text. The most useful finding: errors correlate with **severity over-reach**. Charges the model filed as `medium` were nearly perfect (**27/28 fully correct, 96%**), while charges it escalated to `high` dropped to 5/10. When the Inspector gets dramatic, he gets sloppier — a calibration target for future work, and exactly the kind of thing you only learn by measuring. ## Reproduce it The eval pack (16 annotated screenshots, raw case JSON, per-charge human ratings, and the generation script) lives in the dataset's `eval/` folder. Point `run_eval.py` at any folder of screenshots to grade your own set. *Graded June 13, 2026 against the deployed production pipeline.*