# 📊 Quantitative Evaluation — does the Inspector actually land his charges?

A small model that *claims* to circle UX flaws is a parlor trick unless the circles land.
So we measured it, the honest way: a **senior UX designer** graded every single charge
the live pipeline produced, blind to no one — the full evidence (annotated screenshots,
raw cases, and per-charge ratings) is public in the
[traces dataset](https://huggingface.co/datasets/build-small-hackathon/ux-crime-scene-traces).

## Method

- **16 real pages**, captured at 1440×900 (en-US): apple, cnn, ebay (×2 views),
  espn (×2 views), foxnews, github, huggingface, nasa, nytimes, usatoday, weather.com,
  wikipedia, yahoo, and spacejam.com (1996).
- The **production pipeline, unmodified** (`Qwen2.5-VL-7B`, agentic sweep → zoom → verify
  → file) investigated each page once. No retries, no cherry-picking: every charge it
  filed was graded.
- For each of the **38 charges**, a senior UX designer answered two questions:
  1. **Grounding** — is the evidence circle on the exact element the charge names?
  2. **Validity** — is it a real design issue you would flag in a professional audit?

## Results

| Metric | Score |
| --- | --- |
| **Grounding** (circle on the named element) | **32/38 — 84.2%** |
| **Validity** (a real design issue, per a senior UX designer) | **35/38 — 92.1%** |
| Fully correct (both) | 32/38 — 84.2% |
| Pages with a flawless report | **12/16 — 75%** |

## Failure analysis (the honest part)

All 6 imperfect charges are concentrated in 4 pages (apple, espn, nasa, wikipedia) —
dark, sparse or atypically structured layouts where the sweep has the least anchor text.

The most useful finding: errors correlate with **severity over-reach**. Charges the model
filed as `medium` were nearly perfect (**27/28 fully correct, 96%**), while charges it
escalated to `high` dropped to 5/10. When the Inspector gets dramatic, he gets sloppier —
a calibration target for future work, and exactly the kind of thing you only learn by
measuring.

## Reproduce it

The eval pack (16 annotated screenshots, raw case JSON, per-charge human ratings, and the
generation script) lives in the dataset's `eval/` folder. Point `run_eval.py` at any
folder of screenshots to grade your own set.

*Graded June 13, 2026 against the deployed production pipeline.*