Spaces:

build-small-hackathon
/

matchday

Running

App Files Files Community

mzidan000 commited on 19 days ago

Commit

b9f2ba1

verified ·

1 Parent(s): 02b8de5

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +13 -8
matchday/wc2026.py +1 -1

README.md CHANGED Viewed

@@ -60,7 +60,7 @@ trip planner that treats a small model as a genuine decision-maker, not a chatbo
 The standout agent behavior: **MatchDay corrects you when you're wrong and
 refuses to plan when a match doesn't exist.** Ask for *"Canada vs Qatar, June
-26"* and it tells you the real match is **June 18 at BC Place, 12:00 PM PT** and
 re-plans around it. Ask for *"Canada vs Morocco"* and it won't pretend — that
 match doesn't exist, so it offers the real alternatives instead. That grounding
 is the difference between an agent and a form.
@@ -76,10 +76,12 @@ is the difference between an agent and a form.
   (flights, hotels, weather, nearby spots), fans them out concurrently, and
   scores each package with a fixed formula (cost / arrival-buffer /
   stadium-proximity). Every value gets a provenance badge.
-- **🔁 Loop:** a bounded ReAct agent loop (**≤5 tool rounds**) with an allowlist,
-  Pydantic argument validation, duplicate-call detection, one self-correction
-  pass, per-tool timeouts, and an honest deterministic fallback. Nemotron emits
-  structured tool calls via SGLang's `qwen3_coder` + `nemotron_3` parsers.
 ## 🤖 Best Agent — multi-step tool use & planning (under the 32B cap)
@@ -97,11 +99,14 @@ an agent and not a pipeline:
   fixture table is the ground truth. The agent re-centers the trip on the *real*
   match date (preserving the user's nights) and refuses nonexistent matchups
   with honest alternatives — proven by `tests/test_wc2026_grounding.py`
-  (6/6 zero-network checks: Canada vs Qatar → Jun 18 / 12:00 PT / 3 nights;
   Brazil vs Germany and Canada vs Morocco refused).
 - **Guardrails that keep it honest:** tool allowlist, Pydantic arg validation,
-  duplicate suppression, one malformed-call correction, timeouts, and a
   user-visible fallback to deterministic parsing when Modal is cold-starting.
 - **Brain + Hands separation:** the model decides and explains; Python executes
   every external call and scores every price — so the model can't hallucinate a
   flight number or invent a rate.
@@ -179,7 +184,7 @@ Nemotron-3-Nano-30B-A3B (3B-active MoE) · Modal A100-80GB + SGLang v0.5.12
 | Prize | Why MatchDay qualifies |
 | --- | --- |
-| 🤖 **Best Agent** | Bounded ReAct loop (≤5 rounds), 3 tools chosen autonomously, genuine multi-step turns (search → build), schedule grounding + honest refusal, guardrails. 30B < 32B. |
 | 🎨 **Off-Brand** | Bespoke Layla-style UI on `gradio.Server` — custom HTML/CSS/JS, photo cards, Leaflet map, provenance pills. Not stock Gradio. |
 | 🟢 **NVIDIA Nemotron Quest** | Nemotron-3-Nano-30B is the Brain; SGLang tool-calling verified live; reasoning mode wired. |
 | 🟣 **Modal** | A100 inference runtime, documented above (`matchday/modal_spike.py`). |

 The standout agent behavior: **MatchDay corrects you when you're wrong and
 refuses to plan when a match doesn't exist.** Ask for *"Canada vs Qatar, June
+26"* and it tells you the real match is **June 18 at BC Place, 3:00 PM PT** and
 re-plans around it. Ask for *"Canada vs Morocco"* and it won't pretend — that
 match doesn't exist, so it offers the real alternatives instead. That grounding
 is the difference between an agent and a form.
   (flights, hotels, weather, nearby spots), fans them out concurrently, and
   scores each package with a fixed formula (cost / arrival-buffer /
   stadium-proximity). Every value gets a provenance badge.
+- **🔁 Loop:** a bounded agent loop (**≤5 tool rounds**) with a tool allowlist,
+  Pydantic argument validation, one malformed-call self-correction pass,
+  per-tool timeouts, a **cold-start retry** (a round-1 Modal timeout is retried
+  once so the agent actually runs instead of silently degrading to the parser),
+  and an honest, user-visible deterministic fallback. Nemotron emits structured
+  tool calls via SGLang's `qwen3_coder` + `nemotron_3` parsers.
 ## 🤖 Best Agent — multi-step tool use & planning (under the 32B cap)
   fixture table is the ground truth. The agent re-centers the trip on the *real*
   match date (preserving the user's nights) and refuses nonexistent matchups
   with honest alternatives — proven by `tests/test_wc2026_grounding.py`
+  (6/6 zero-network checks: Canada vs Qatar → Jun 18 / 3:00 PM PT / 3 nights;
   Brazil vs Germany and Canada vs Morocco refused).
 - **Guardrails that keep it honest:** tool allowlist, Pydantic arg validation,
+  one malformed-call self-correction, per-tool timeouts, cold-start retry, and a
   user-visible fallback to deterministic parsing when Modal is cold-starting.
+  The loop's agentic behavior — tool dispatch, self-correction, deterministic
+  fallback, cold-start retry, and trace recording — is proven by
+  `tests/test_agent_loop.py` (9 zero-network checks).
 - **Brain + Hands separation:** the model decides and explains; Python executes
   every external call and scores every price — so the model can't hallucinate a
   flight number or invent a rate.
 | Prize | Why MatchDay qualifies |
 | --- | --- |
+| 🤖 **Best Agent** | Bounded agent loop (≤5 rounds), 3 tools chosen autonomously, genuine multi-step turns (search → build), schedule grounding + honest refusal, guardrails. Proven by 9 zero-network loop tests. 30B < 32B. |
 | 🎨 **Off-Brand** | Bespoke Layla-style UI on `gradio.Server` — custom HTML/CSS/JS, photo cards, Leaflet map, provenance pills. Not stock Gradio. |
 | 🟢 **NVIDIA Nemotron Quest** | Nemotron-3-Nano-30B is the Brain; SGLang tool-calling verified live; reasoning mode wired. |
 | 🟣 **Modal** | A100 inference runtime, documented above (`matchday/modal_spike.py`). |

matchday/wc2026.py CHANGED Viewed

@@ -53,7 +53,7 @@ class Fixture:
 _FIXTURES: list[Fixture] = [
     # Group B — Canada's group (Canada, Bosnia & Herzegovina, Qatar, Switzerland)
     Fixture("Canada vs Bosnia and Herzegovina", date(2026, 6, 12), "BMO Field", "Toronto", "15:00 ET", "B"),
-    Fixture("Canada vs Qatar", date(2026, 6, 18), "BC Place", "Vancouver", "12:00 PT", "B"),
     Fixture("Canada vs Switzerland", date(2026, 6, 24), "BC Place", "Vancouver", "12:00 PT", "B"),
     Fixture("Qatar vs Switzerland", date(2026, 6, 13), "Levi's Stadium", "San Francisco Bay Area", "", "B"),
     Fixture("Switzerland vs Bosnia and Herzegovina", date(2026, 6, 18), "SoFi Stadium", "Los Angeles", "", "B"),

 _FIXTURES: list[Fixture] = [
     # Group B — Canada's group (Canada, Bosnia & Herzegovina, Qatar, Switzerland)
     Fixture("Canada vs Bosnia and Herzegovina", date(2026, 6, 12), "BMO Field", "Toronto", "15:00 ET", "B"),
+    Fixture("Canada vs Qatar", date(2026, 6, 18), "BC Place", "Vancouver", "15:00 PT", "B"),
     Fixture("Canada vs Switzerland", date(2026, 6, 24), "BC Place", "Vancouver", "12:00 PT", "B"),
     Fixture("Qatar vs Switzerland", date(2026, 6, 13), "Levi's Stadium", "San Francisco Bay Area", "", "B"),
     Fixture("Switzerland vs Bosnia and Herzegovina", date(2026, 6, 18), "SoFi Stadium", "Los Angeles", "", "B"),