Spaces:
Running
Running
Upload folder using huggingface_hub
Browse files- README.md +13 -8
- matchday/wc2026.py +1 -1
README.md
CHANGED
|
@@ -60,7 +60,7 @@ trip planner that treats a small model as a genuine decision-maker, not a chatbo
|
|
| 60 |
|
| 61 |
The standout agent behavior: **MatchDay corrects you when you're wrong and
|
| 62 |
refuses to plan when a match doesn't exist.** Ask for *"Canada vs Qatar, June
|
| 63 |
-
26"* and it tells you the real match is **June 18 at BC Place,
|
| 64 |
re-plans around it. Ask for *"Canada vs Morocco"* and it won't pretend — that
|
| 65 |
match doesn't exist, so it offers the real alternatives instead. That grounding
|
| 66 |
is the difference between an agent and a form.
|
|
@@ -76,10 +76,12 @@ is the difference between an agent and a form.
|
|
| 76 |
(flights, hotels, weather, nearby spots), fans them out concurrently, and
|
| 77 |
scores each package with a fixed formula (cost / arrival-buffer /
|
| 78 |
stadium-proximity). Every value gets a provenance badge.
|
| 79 |
-
- **🔁 Loop:** a bounded
|
| 80 |
-
Pydantic argument validation,
|
| 81 |
-
|
| 82 |
-
|
|
|
|
|
|
|
| 83 |
|
| 84 |
## 🤖 Best Agent — multi-step tool use & planning (under the 32B cap)
|
| 85 |
|
|
@@ -97,11 +99,14 @@ an agent and not a pipeline:
|
|
| 97 |
fixture table is the ground truth. The agent re-centers the trip on the *real*
|
| 98 |
match date (preserving the user's nights) and refuses nonexistent matchups
|
| 99 |
with honest alternatives — proven by `tests/test_wc2026_grounding.py`
|
| 100 |
-
(6/6 zero-network checks: Canada vs Qatar → Jun 18 /
|
| 101 |
Brazil vs Germany and Canada vs Morocco refused).
|
| 102 |
- **Guardrails that keep it honest:** tool allowlist, Pydantic arg validation,
|
| 103 |
-
|
| 104 |
user-visible fallback to deterministic parsing when Modal is cold-starting.
|
|
|
|
|
|
|
|
|
|
| 105 |
- **Brain + Hands separation:** the model decides and explains; Python executes
|
| 106 |
every external call and scores every price — so the model can't hallucinate a
|
| 107 |
flight number or invent a rate.
|
|
@@ -179,7 +184,7 @@ Nemotron-3-Nano-30B-A3B (3B-active MoE) · Modal A100-80GB + SGLang v0.5.12
|
|
| 179 |
|
| 180 |
| Prize | Why MatchDay qualifies |
|
| 181 |
| --- | --- |
|
| 182 |
-
| 🤖 **Best Agent** | Bounded
|
| 183 |
| 🎨 **Off-Brand** | Bespoke Layla-style UI on `gradio.Server` — custom HTML/CSS/JS, photo cards, Leaflet map, provenance pills. Not stock Gradio. |
|
| 184 |
| 🟢 **NVIDIA Nemotron Quest** | Nemotron-3-Nano-30B is the Brain; SGLang tool-calling verified live; reasoning mode wired. |
|
| 185 |
| 🟣 **Modal** | A100 inference runtime, documented above (`matchday/modal_spike.py`). |
|
|
|
|
| 60 |
|
| 61 |
The standout agent behavior: **MatchDay corrects you when you're wrong and
|
| 62 |
refuses to plan when a match doesn't exist.** Ask for *"Canada vs Qatar, June
|
| 63 |
+
26"* and it tells you the real match is **June 18 at BC Place, 3:00 PM PT** and
|
| 64 |
re-plans around it. Ask for *"Canada vs Morocco"* and it won't pretend — that
|
| 65 |
match doesn't exist, so it offers the real alternatives instead. That grounding
|
| 66 |
is the difference between an agent and a form.
|
|
|
|
| 76 |
(flights, hotels, weather, nearby spots), fans them out concurrently, and
|
| 77 |
scores each package with a fixed formula (cost / arrival-buffer /
|
| 78 |
stadium-proximity). Every value gets a provenance badge.
|
| 79 |
+
- **🔁 Loop:** a bounded agent loop (**≤5 tool rounds**) with a tool allowlist,
|
| 80 |
+
Pydantic argument validation, one malformed-call self-correction pass,
|
| 81 |
+
per-tool timeouts, a **cold-start retry** (a round-1 Modal timeout is retried
|
| 82 |
+
once so the agent actually runs instead of silently degrading to the parser),
|
| 83 |
+
and an honest, user-visible deterministic fallback. Nemotron emits structured
|
| 84 |
+
tool calls via SGLang's `qwen3_coder` + `nemotron_3` parsers.
|
| 85 |
|
| 86 |
## 🤖 Best Agent — multi-step tool use & planning (under the 32B cap)
|
| 87 |
|
|
|
|
| 99 |
fixture table is the ground truth. The agent re-centers the trip on the *real*
|
| 100 |
match date (preserving the user's nights) and refuses nonexistent matchups
|
| 101 |
with honest alternatives — proven by `tests/test_wc2026_grounding.py`
|
| 102 |
+
(6/6 zero-network checks: Canada vs Qatar → Jun 18 / 3:00 PM PT / 3 nights;
|
| 103 |
Brazil vs Germany and Canada vs Morocco refused).
|
| 104 |
- **Guardrails that keep it honest:** tool allowlist, Pydantic arg validation,
|
| 105 |
+
one malformed-call self-correction, per-tool timeouts, cold-start retry, and a
|
| 106 |
user-visible fallback to deterministic parsing when Modal is cold-starting.
|
| 107 |
+
The loop's agentic behavior — tool dispatch, self-correction, deterministic
|
| 108 |
+
fallback, cold-start retry, and trace recording — is proven by
|
| 109 |
+
`tests/test_agent_loop.py` (9 zero-network checks).
|
| 110 |
- **Brain + Hands separation:** the model decides and explains; Python executes
|
| 111 |
every external call and scores every price — so the model can't hallucinate a
|
| 112 |
flight number or invent a rate.
|
|
|
|
| 184 |
|
| 185 |
| Prize | Why MatchDay qualifies |
|
| 186 |
| --- | --- |
|
| 187 |
+
| 🤖 **Best Agent** | Bounded agent loop (≤5 rounds), 3 tools chosen autonomously, genuine multi-step turns (search → build), schedule grounding + honest refusal, guardrails. Proven by 9 zero-network loop tests. 30B < 32B. |
|
| 188 |
| 🎨 **Off-Brand** | Bespoke Layla-style UI on `gradio.Server` — custom HTML/CSS/JS, photo cards, Leaflet map, provenance pills. Not stock Gradio. |
|
| 189 |
| 🟢 **NVIDIA Nemotron Quest** | Nemotron-3-Nano-30B is the Brain; SGLang tool-calling verified live; reasoning mode wired. |
|
| 190 |
| 🟣 **Modal** | A100 inference runtime, documented above (`matchday/modal_spike.py`). |
|
matchday/wc2026.py
CHANGED
|
@@ -53,7 +53,7 @@ class Fixture:
|
|
| 53 |
_FIXTURES: list[Fixture] = [
|
| 54 |
# Group B — Canada's group (Canada, Bosnia & Herzegovina, Qatar, Switzerland)
|
| 55 |
Fixture("Canada vs Bosnia and Herzegovina", date(2026, 6, 12), "BMO Field", "Toronto", "15:00 ET", "B"),
|
| 56 |
-
Fixture("Canada vs Qatar", date(2026, 6, 18), "BC Place", "Vancouver", "
|
| 57 |
Fixture("Canada vs Switzerland", date(2026, 6, 24), "BC Place", "Vancouver", "12:00 PT", "B"),
|
| 58 |
Fixture("Qatar vs Switzerland", date(2026, 6, 13), "Levi's Stadium", "San Francisco Bay Area", "", "B"),
|
| 59 |
Fixture("Switzerland vs Bosnia and Herzegovina", date(2026, 6, 18), "SoFi Stadium", "Los Angeles", "", "B"),
|
|
|
|
| 53 |
_FIXTURES: list[Fixture] = [
|
| 54 |
# Group B — Canada's group (Canada, Bosnia & Herzegovina, Qatar, Switzerland)
|
| 55 |
Fixture("Canada vs Bosnia and Herzegovina", date(2026, 6, 12), "BMO Field", "Toronto", "15:00 ET", "B"),
|
| 56 |
+
Fixture("Canada vs Qatar", date(2026, 6, 18), "BC Place", "Vancouver", "15:00 PT", "B"),
|
| 57 |
Fixture("Canada vs Switzerland", date(2026, 6, 24), "BC Place", "Vancouver", "12:00 PT", "B"),
|
| 58 |
Fixture("Qatar vs Switzerland", date(2026, 6, 13), "Levi's Stadium", "San Francisco Bay Area", "", "B"),
|
| 59 |
Fixture("Switzerland vs Bosnia and Herzegovina", date(2026, 6, 18), "SoFi Stadium", "Los Angeles", "", "B"),
|