mzidan000 commited on
Commit
b9f2ba1
·
verified ·
1 Parent(s): 02b8de5

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +13 -8
  2. matchday/wc2026.py +1 -1
README.md CHANGED
@@ -60,7 +60,7 @@ trip planner that treats a small model as a genuine decision-maker, not a chatbo
60
 
61
  The standout agent behavior: **MatchDay corrects you when you're wrong and
62
  refuses to plan when a match doesn't exist.** Ask for *"Canada vs Qatar, June
63
- 26"* and it tells you the real match is **June 18 at BC Place, 12:00 PM PT** and
64
  re-plans around it. Ask for *"Canada vs Morocco"* and it won't pretend — that
65
  match doesn't exist, so it offers the real alternatives instead. That grounding
66
  is the difference between an agent and a form.
@@ -76,10 +76,12 @@ is the difference between an agent and a form.
76
  (flights, hotels, weather, nearby spots), fans them out concurrently, and
77
  scores each package with a fixed formula (cost / arrival-buffer /
78
  stadium-proximity). Every value gets a provenance badge.
79
- - **🔁 Loop:** a bounded ReAct agent loop (**≤5 tool rounds**) with an allowlist,
80
- Pydantic argument validation, duplicate-call detection, one self-correction
81
- pass, per-tool timeouts, and an honest deterministic fallback. Nemotron emits
82
- structured tool calls via SGLang's `qwen3_coder` + `nemotron_3` parsers.
 
 
83
 
84
  ## 🤖 Best Agent — multi-step tool use & planning (under the 32B cap)
85
 
@@ -97,11 +99,14 @@ an agent and not a pipeline:
97
  fixture table is the ground truth. The agent re-centers the trip on the *real*
98
  match date (preserving the user's nights) and refuses nonexistent matchups
99
  with honest alternatives — proven by `tests/test_wc2026_grounding.py`
100
- (6/6 zero-network checks: Canada vs Qatar → Jun 18 / 12:00 PT / 3 nights;
101
  Brazil vs Germany and Canada vs Morocco refused).
102
  - **Guardrails that keep it honest:** tool allowlist, Pydantic arg validation,
103
- duplicate suppression, one malformed-call correction, timeouts, and a
104
  user-visible fallback to deterministic parsing when Modal is cold-starting.
 
 
 
105
  - **Brain + Hands separation:** the model decides and explains; Python executes
106
  every external call and scores every price — so the model can't hallucinate a
107
  flight number or invent a rate.
@@ -179,7 +184,7 @@ Nemotron-3-Nano-30B-A3B (3B-active MoE) · Modal A100-80GB + SGLang v0.5.12
179
 
180
  | Prize | Why MatchDay qualifies |
181
  | --- | --- |
182
- | 🤖 **Best Agent** | Bounded ReAct loop (≤5 rounds), 3 tools chosen autonomously, genuine multi-step turns (search → build), schedule grounding + honest refusal, guardrails. 30B < 32B. |
183
  | 🎨 **Off-Brand** | Bespoke Layla-style UI on `gradio.Server` — custom HTML/CSS/JS, photo cards, Leaflet map, provenance pills. Not stock Gradio. |
184
  | 🟢 **NVIDIA Nemotron Quest** | Nemotron-3-Nano-30B is the Brain; SGLang tool-calling verified live; reasoning mode wired. |
185
  | 🟣 **Modal** | A100 inference runtime, documented above (`matchday/modal_spike.py`). |
 
60
 
61
  The standout agent behavior: **MatchDay corrects you when you're wrong and
62
  refuses to plan when a match doesn't exist.** Ask for *"Canada vs Qatar, June
63
+ 26"* and it tells you the real match is **June 18 at BC Place, 3:00 PM PT** and
64
  re-plans around it. Ask for *"Canada vs Morocco"* and it won't pretend — that
65
  match doesn't exist, so it offers the real alternatives instead. That grounding
66
  is the difference between an agent and a form.
 
76
  (flights, hotels, weather, nearby spots), fans them out concurrently, and
77
  scores each package with a fixed formula (cost / arrival-buffer /
78
  stadium-proximity). Every value gets a provenance badge.
79
+ - **🔁 Loop:** a bounded agent loop (**≤5 tool rounds**) with a tool allowlist,
80
+ Pydantic argument validation, one malformed-call self-correction pass,
81
+ per-tool timeouts, a **cold-start retry** (a round-1 Modal timeout is retried
82
+ once so the agent actually runs instead of silently degrading to the parser),
83
+ and an honest, user-visible deterministic fallback. Nemotron emits structured
84
+ tool calls via SGLang's `qwen3_coder` + `nemotron_3` parsers.
85
 
86
  ## 🤖 Best Agent — multi-step tool use & planning (under the 32B cap)
87
 
 
99
  fixture table is the ground truth. The agent re-centers the trip on the *real*
100
  match date (preserving the user's nights) and refuses nonexistent matchups
101
  with honest alternatives — proven by `tests/test_wc2026_grounding.py`
102
+ (6/6 zero-network checks: Canada vs Qatar → Jun 18 / 3:00 PM PT / 3 nights;
103
  Brazil vs Germany and Canada vs Morocco refused).
104
  - **Guardrails that keep it honest:** tool allowlist, Pydantic arg validation,
105
+ one malformed-call self-correction, per-tool timeouts, cold-start retry, and a
106
  user-visible fallback to deterministic parsing when Modal is cold-starting.
107
+ The loop's agentic behavior — tool dispatch, self-correction, deterministic
108
+ fallback, cold-start retry, and trace recording — is proven by
109
+ `tests/test_agent_loop.py` (9 zero-network checks).
110
  - **Brain + Hands separation:** the model decides and explains; Python executes
111
  every external call and scores every price — so the model can't hallucinate a
112
  flight number or invent a rate.
 
184
 
185
  | Prize | Why MatchDay qualifies |
186
  | --- | --- |
187
+ | 🤖 **Best Agent** | Bounded agent loop (≤5 rounds), 3 tools chosen autonomously, genuine multi-step turns (search → build), schedule grounding + honest refusal, guardrails. Proven by 9 zero-network loop tests. 30B < 32B. |
188
  | 🎨 **Off-Brand** | Bespoke Layla-style UI on `gradio.Server` — custom HTML/CSS/JS, photo cards, Leaflet map, provenance pills. Not stock Gradio. |
189
  | 🟢 **NVIDIA Nemotron Quest** | Nemotron-3-Nano-30B is the Brain; SGLang tool-calling verified live; reasoning mode wired. |
190
  | 🟣 **Modal** | A100 inference runtime, documented above (`matchday/modal_spike.py`). |
matchday/wc2026.py CHANGED
@@ -53,7 +53,7 @@ class Fixture:
53
  _FIXTURES: list[Fixture] = [
54
  # Group B — Canada's group (Canada, Bosnia & Herzegovina, Qatar, Switzerland)
55
  Fixture("Canada vs Bosnia and Herzegovina", date(2026, 6, 12), "BMO Field", "Toronto", "15:00 ET", "B"),
56
- Fixture("Canada vs Qatar", date(2026, 6, 18), "BC Place", "Vancouver", "12:00 PT", "B"),
57
  Fixture("Canada vs Switzerland", date(2026, 6, 24), "BC Place", "Vancouver", "12:00 PT", "B"),
58
  Fixture("Qatar vs Switzerland", date(2026, 6, 13), "Levi's Stadium", "San Francisco Bay Area", "", "B"),
59
  Fixture("Switzerland vs Bosnia and Herzegovina", date(2026, 6, 18), "SoFi Stadium", "Los Angeles", "", "B"),
 
53
  _FIXTURES: list[Fixture] = [
54
  # Group B — Canada's group (Canada, Bosnia & Herzegovina, Qatar, Switzerland)
55
  Fixture("Canada vs Bosnia and Herzegovina", date(2026, 6, 12), "BMO Field", "Toronto", "15:00 ET", "B"),
56
+ Fixture("Canada vs Qatar", date(2026, 6, 18), "BC Place", "Vancouver", "15:00 PT", "B"),
57
  Fixture("Canada vs Switzerland", date(2026, 6, 24), "BC Place", "Vancouver", "12:00 PT", "B"),
58
  Fixture("Qatar vs Switzerland", date(2026, 6, 13), "Levi's Stadium", "San Francisco Bay Area", "", "B"),
59
  Fixture("Switzerland vs Bosnia and Herzegovina", date(2026, 6, 18), "SoFi Stadium", "Los Angeles", "", "B"),