Mapping Conflict to Motion: building Dempster's Court
A courtroom puzzle game where AI witnesses each believe a different version of the truth, and the rule you pick to combine their testimony decides who hangs. Built for the Hugging Face Build Small Hackathon, June 2026.
- Play it: https://huggingface.co/spaces/build-small-hackathon/dempsters-court
- Witness trace dataset: https://huggingface.co/datasets/thangvip/dempsters-court-witness-traces
- Source / GDD / spec / plan: in the Space repo
- Demo: https://youtu.be/DsrdlaRrnIc
The thesis the game tries to land
Most "evidence" games turn certainty into a knowledge problem: find the clue, you win. Dempster's Court asks something stranger — given that you already have all the testimony, how much certainty are you entitled to? In an age of AI systems that are fluently, confidently wrong, the answer can flip the verdict on the same facts.
The game is built on Dempster–Shafer theory, but the player never sees a Greek letter. They pull a brass lever labelled Dempster (the Zealot) or Yager (the Agnostic) or PCR5 (the Diplomat) or Cautious (the Skeptic), and watch the Suspicion bars on each suspect rearrange themselves.
The hard wall
The architecture has exactly one rule: the model never does arithmetic, and the math engine never improvises.
- Witness Engine — a small instruct model that speaks in character, hedges, lies, remembers. Gives prose.
- Belief Engine — ~250 lines of pure Python that does every Dempster–Shafer calculation deterministically. Gives numbers.
Numbers shown to the player always come from the Belief Engine. Sentences shown to the player always come from the Witness Engine. No crossing.
The crucial de-risking move for v1: in story-mode cases, the model's proposed mass is discarded, and the case YAML's authored target_mass is used directly for the math. The model produces the dialogue; the math produces the bars. The lesson cannot be broken by a flaky model emission.
Two backends, one engine
Witness inference runs through a single WitnessEngine interface with two interchangeable backends:
| Backend | Where it runs | Model | When to use |
|---|---|---|---|
InferenceProvidersWitnessEngine |
Hugging Face Inference Providers (OpenAI-compatible router) | Qwen2.5-7B-Instruct (or any ≤32B chat model) | The hosted Space on cpu-basic — no local GPU needed. |
WitnessEngine (llama.cpp) |
Locally on your laptop | Qwen2.5-3B-Instruct, GGUF Q4_K_M | "Off the Grid" play — no cloud calls, the whole stack on disk. |
Switching is a single env var (WITNESS_BACKEND=providers or llama_cpp). The hosted Space defaults to providers so judges and players don't hit a cold-start CPU-llama.cpp boot. The local install path defaults to llama_cpp so the laptop demo is genuinely offline.
This dual path also lets us keep two badges honest: the local-mode build still earns Off the Grid + Llama Champion, while the hosted Space stays snappy on free hardware.
Structured output without GBNF
The local backend uses GBNF to constrain the witness's deposition JSON to a per-case grammar that enumerates the allowed suspect names. The hosted backend can't ship a GBNF grammar to providers, so instead it asks for a JSON object via response_format={"type": "json_object"}, embeds the schema in the prompt, and routes everything through the same parse_and_validate_mass cleanup pass that the local backend already used (drop empty subsets, reject unknown suspect names, clamp + renormalize). One retry on parse failure with a "your previous response was invalid" nudge handles the long tail.
The point: grammars are a luxury, not a load-bearing dependency. The Belief Engine doesn't care where the mass came from, as long as the values are in [0, 1] and sum to one.
The signature moment
It's Case 1, "Two Truths." The Butler is highly credible: he saw the Gardener pour the wine. He allows a 1% sliver that maybe it was the Secretary. The Cook is equally credible: she saw the Maid in the pantry. She also allows a 1% sliver for the Secretary.
The player pulls Dempster. Their two strong accusations of different people conflict and cancel; all that survives normalization is the shared 1% on the Secretary. Dempster crowns the Secretary with Suspicion = 1.00. The Doubt meter pulses red. The lever is offering certain conviction of an innocent.
The player pulls Yager. All that conflict gets poured into "I'm not sure." Every bar collapses to near zero. The game refuses to convict.
The lesson the player feels in their gut: the same facts, combined two different ways, convict two different people.
Things I'd do differently
The frontend is hand-written vanilla HTML/JS. No bundler. No framework. In a 9-day window this was the right call — a
gr.Server-style FastAPI mount plus afrontend/directory of static files left zero time for build tooling and yielded the Off-Brand badge cleanly. For anything bigger, I'd reach for Svelte.Story mode wins; procedural mode is the cherry. The four hand-authored cases (Garden Path → Two Truths → Echo → Banquet) are tuned to a specific lesson; their numbers were chosen, validated, retuned, and locked. Endless Mode + a Daily Case give replay value, but the emotional payload is in the authored cases.
The biggest risk was the CSS, not the math. The Belief Engine was the first thing I built (one afternoon, ~250 lines, every GDD §B number reproduced by pytest), and it never broke again. The frontend, by contrast, ate a day of fiddling — listener accumulation across navigations, Tailwind-CDN specificity races against the custom stylesheet, the SSE reader that silently swallowed errors. Reviews caught all of them; without two-stage subagent review I'd have shipped broken.
Day 9: deployment chose itself. The original plan was a t4-small Space running Qwen2.5 via llama.cpp with CUDA. Moving to the hackathon org for submission day forced a hardware reset, and that reset turned into a feature: the providers backend means a casual visitor gets a 1-second witness reply on free hardware, while the offline-first crowd can still
pip install -e .[local], pull the GGUF, and play with no network.
Badges
| Badge | Where it lives |
|---|---|
| ⚖️ Tiny Titan | The local llama.cpp build runs Qwen2.5-3B-Instruct (≤4B). The hosted Space's providers model is Qwen2.5-7B-Instruct — also well under 32B. |
| 🎨 Off-Brand | Custom HTML/JS courtroom mounted on FastAPI — no Gradio default UI. |
| 📡 Sharing is Caring | Synthetic witness interrogation traces published as a public HF dataset. |
| 📓 Field Notes | This post. |
| 🔌 Off the Grid (local mode) | WITNESS_BACKEND=llama_cpp + the GGUF on disk — no cloud calls at runtime. |
| 🦙 Llama Champion (local mode) | llama-cpp-python runtime under the local backend. |
Credits
Built by thangvip on a MacBook, June 2026. With apologies to Arthur Dempster, Glenn Shafer, Ronald Yager, Florentin Smarandache, and Thierry Denœux, whose mathematics the game (badly) personifies.
If you played a case and got a different verdict than your friend, you understood the point.
