# CrisisInbox Hackathon Roadmap **Event:** OpenEnv Hackathon @ Shack15, SF **Dates:** Saturday March 7 – Sunday March 8, 2026 **Team Size:** 2 **Problem Statement:** 3.2 (Personalized Tasks) + Patronus AI Sub-Theme (Schema Drift) --- ## PHASE 1: Foundation (11:30 AM – 2:00 PM Saturday) ### Person A — Environment Setup - [x] Read OpenEnv 0.2.1 docs thoroughly — understand environment structure, observation/action spaces, step function - [x] Get a minimal "hello world" OpenEnv environment running locally - [x] Define the core data model: - [x] `Message` object: sender, channel (6 types), content, urgency (critical/high/medium/low), deadline, dependencies, drift_flag, supersedes - [x] Task state tracked via environment internals (`_handled`, `_visible_messages`, deadline expiry) — no separate Task object needed since messages and tasks are 1:1 - [x] World state tracked via environment instance vars (`_current_hour`, `_all_messages`, `_visible_messages`, `_handled`, `_score`, `_fired_drifts`, `_superseded`) — exposed to agent through `get_status` tool - [x] Deploy bare-bones environment to HF Spaces — confirmed at eptan-crisis-inbox.hf.space - [ ] Push initial scaffold to GitHub ### Person B — Training Pipeline Setup - [x] Open Unsloth GRPO Colab notebook and run with a toy example end-to-end - [x] Confirm training loop works: environment → agent action → reward → update - [x] Define the action space — simplified to free-form: model outputs `respond_to_message(msg_id, "response")`, parsed via regex - [x] Define observation space: full inbox snapshot as text prompt with urgency grouping, deadline warnings, drift flags, stale markers - [x] Implement reward function (integrated in notebook `score_action()`): - [x] Urgency base: critical=10, high=5, medium=3, low=1 - [x] Deadline timing: early bonus (up to +50%), late penalty (-75%) - [x] Schema drift adaptation: +50% for handling drift-flagged messages - [x] Stale info penalty: -50% for acting on superseded messages - [x] Response quality: -50% for short/empty responses - [x] Priority penalty: -70% for choosing low-urgency when critical messages exist - [ ] Push training scaffold to GitHub ### Checkpoint — 2:00 PM - [x] ✅ OpenEnv minimal environment runs locally - [x] ✅ HF Spaces deployment pipeline confirmed - [x] ✅ Unsloth GRPO training loop configured in notebook (Qwen2.5-0.5B + LoRA + GRPO) - [x] ✅ Data model and reward function defined - [ ] 🚨 If OpenEnv is blocking: simplify to gym-style env, wrap in OpenEnv later - [ ] 🚨 If Unsloth is blocking: fall back to HF TRL directly --- ## PHASE 2: Core Build (2:00 PM – 6:00 PM Saturday) ### Person A — Environment Logic - [x] Build message generation system: - [x] Create 18 sender profiles (Mom, Boss, Sister, Neighbor Dave, FEMA, NWS, State Farm, Delta Airlines, Oakwood Elementary, HR, Landlord, Coworker Sarah, PG&E, Red Cross, Bank of America, CVS Pharmacy, Sacramento County, Comcast) - [x] Write 76 messages across 48-hour simulated timeline - [x] Each message has: sender, channel, timestamp, urgency, deadline, content, dependencies, drift_flag, supersedes - [x] Organized into waves: initial crisis (hour 0-2), escalation (2-6), post-evacuation chaos (6-12), conflicting demands (12-20), drift events (20-30), recovery (26-36), ongoing management (32-40), final stretch (44-48) - [x] Implement episode timeline engine: - [x] Messages arrive at scheduled sim-hours - [x] Time costs per action: reading = 0.1h, responding = 0.25h, advance_time tool = 0.5-4.0h - [x] Dependencies gate actions (must handle prerequisites first) - [x] Episode ends at hour 48 - [x] Build 5 schema drift events: - [x] Hour 20: Insurance deadline shortened from 72h to 48h - [x] Hour 21: Evacuation zone expanded to include Zone B (workplace) - [x] Hour 22.5: Employer emergency leave expanded (PTO → 5 days paid leave) - [x] Hour 24.5: Airline extends free rebooking window from 48h to 7 days - [x] Hour 34: FEMA adds new documentation requirements - [x] Randomization: each episode triggers 3 of 5 drift events (seed-controlled) - [x] Add episode variation: +/-15% jitter on arrival times, +/-10% on deadlines (seed-controlled) - [x] Integrate with OpenEnv API: `reset()`, `step()` implemented; 5 MCP tools (get_inbox, read_message, respond_to_message, get_status, advance_time) ### Person B — Reward & Training Integration - [x] Reward function implemented in two places: - [x] Environment-side: `_calculate_reward()` in `crisis_inbox_environment.py` (used during live episodes) - [x] Training-side: `score_action()` in notebook (parses model output, scores against inbox state) - [x] Episode generator (`generate_episodes.py`) produces offline training data: - [x] 50 episodes, 803 training prompts across 16 decision points per episode - [x] Full message content, drift flags, superseded markers, dependency info - [x] Prompts include urgency grouping, deadline warnings, stale markers - [x] Connect to Unsloth GRPO: notebook loads episodes.json, builds HF Dataset, configures GRPOTrainer - [ ] Run training on Colab with GPU and capture reward curves - [ ] Log reward components separately to identify which signals are working ### Checkpoint — 6:00 PM (Dinner) - [x] ✅ Environment generates realistic message streams (76 messages, 19 senders) - [x] ✅ Schema drift events fire correctly (tested: superseded messages marked, drift rewards working) - [x] ✅ Training notebook configured with GRPO, reward function, and evaluation - [ ] ✅ First reward curves from actual GPU training run - [ ] 🚨 If messages aren't generating: reduce to 30 messages, fewer senders - [ ] 🚨 If training isn't connecting: hardcode environment responses, focus on reward signal --- ## PHASE 3: Integration & Iteration (6:00 PM – 10:00 PM Saturday) ### Person A — Environment Polish - [x] Polish message content for realism and emotional impact (20+ messages rewritten) - [x] Mom: panicked texting, crying voicemails, medical anxiety, guilt-trip to Tahoe - [x] Boss (Greg): passive-aggressive emails with signature, softens after Meridian postpones - [x] Sister: desperation about kids, voice cracking in voicemail, genuine gratitude - [x] Neighbor Dave: casual bro tone, guilt about Whiskers, neighborhood solidarity - [x] Emma (niece, age 7): 3 kid-perspective messages — pillow fights, scared in the dark, rainbow drawing - [x] FEMA/NWS: kept formal and information-dense (already good) - [x] Test all 5 drift events fire correctly and change environment state - [x] Verify dependency chains work (23 messages with dependencies, gated in respond_to_message) - [x] Edge case handling: stale info penalized (-50% reward), expired deadlines tracked in get_status - [x] Redeploy updated environment to HF Spaces (73 messages, timeline engine, drift events) - [x] Test HF Spaces deployment works end-to-end remotely (verified via client) ### Person B — Training Optimization - [ ] Analyze initial reward curves — identify flat or noisy components - [ ] Tune reward weights: if one signal dominates, rebalance - [ ] If training is flat overall: - [ ] Simplify action space (reduce to 4 actions) - [ ] Increase reward magnitudes - [ ] Reduce episode length - [ ] Log specific before/after agent behaviors: - [ ] Capture untrained agent: responds in order, ignores evacuation alert - [ ] Capture trained agent: triages safety first, adapts to drift - [ ] Save training checkpoints and reward curve data for demo - [ ] Target: 200-500 episodes with visible upward trend ### Checkpoint — 10:00 PM (Doors Close) - [x] ✅ Environment fully functional with drift events on HF Spaces - [ ] ✅ Training curves show upward trend (pending GPU run) - [ ] ✅ At least 2 clear before/after behavior examples captured - [ ] ✅ All code pushed to GitHub - [ ] 🚨 If reward curves are flat: simplify environment overnight, retrain Sunday AM - [ ] 🚨 If HF Spaces is broken: prepare local demo as backup - [ ] Both teammates agree on Sunday morning priority list before leaving --- ## PHASE 4: Sunday Polish (9:00 AM – 12:00 PM Sunday) ### Person A — Demo & Presentation - [x] Build demo display (`demo.py`): terminal-based visualization - [x] Color-coded urgency levels (red bg = critical, red = high, yellow = medium, green = low) - [x] Schema drift notifications with magenta banner - [x] Agent action visualization with blue banner and reward display - [x] Two strategies: smart triage vs naive (arrival order) - [x] Comparison mode shows 55% improvement (157.8 vs 101.8 pts) - [x] Coverage breakdown by urgency with ASCII progress bars - [x] HF Spaces deployment live with polished environment - [x] Write repo README: - [x] Scenario hook, three layers of difficulty, sender profiles table - [x] MCP tools table, reward function table, episode variation details - [x] Quick start (hosted + local), repo structure, tech stack - [ ] Draft the 3-minute pitch outline: - [ ] 0:00-0:30 — The scenario hook ("A wildfire just hit. You have 47 unread messages.") - [ ] 0:30-1:30 — Show the environment: message stream, drift events, conflicting tasks - [ ] 1:30-2:15 — Untrained vs trained agent comparison - [ ] 2:15-2:45 — Reward curves and training results - [ ] 2:45-3:00 — Why this matters (real-world impact close) ### Person B — Training & Artifacts - [ ] Run final training session for cleanest possible reward curves - [ ] Export reward curve plots (all 5 components + composite) - [ ] Export 3 specific before/after examples with clear narrative: - [ ] Example 1: Untrained ignores FEMA evacuation → Trained prioritizes it first - [ ] Example 2: Untrained misses insurance deadline after policy change → Trained adapts - [ ] Example 3: Untrained sends form-letter reply to Mom → Trained matches emotional tone - [x] Finalize Colab notebook: - [x] Clean, commented code with markdown sections - [x] Reward function with test cases (good action vs bad action vs junk) - [x] Evaluation cell comparing trained model choices - [ ] Reward curves visible when run (pending GPU run) - [ ] Verify all required artifacts exist: - [ ] Public GitHub repo - [ ] HF Spaces deployment - [ ] Colab training notebook - [ ] YouTube video (not yet recorded) ### Checkpoint — 12:00 PM - [x] ✅ Demo runs smoothly end-to-end (smart vs naive comparison, 55% improvement) - [ ] ✅ Reward curves are clean and trend upward (pending GPU run) - [ ] ✅ Before/after examples are compelling (pending GPU run) - [x] ✅ README is complete - [x] ✅ Colab notebook is clean (14 cells, markdown sections, reward function tests) - [ ] 🚨 If demo is buggy: simplify to scripted walkthrough of one episode - [ ] 🚨 If reward curves are still flat: show qualitative behavior improvement instead --- ## PHASE 5: Record & Submit (12:00 PM – 1:00 PM Sunday) ### Both Together - [ ] Record one-minute YouTube video: - [ ] Person A: voiceover narration - [ ] Person B: screen capture / screen share - [ ] Structure: - [ ] 0:00-0:15 — "CrisisInbox: training AI to triage personal tasks during disasters" - [ ] 0:15-0:35 — Show environment: messages arriving, drift event firing - [ ] 0:35-0:50 — Before/after: untrained fails → trained succeeds - [ ] 0:50-1:00 — Flash reward curves, close with impact statement - [ ] Upload to YouTube (unlisted is fine) - [ ] Rehearse 3-minute live pitch at least twice out loud - [ ] Prepare for Q&A — practice answers to: - [ ] "Why is this hard for LLMs?" - [ ] "How does schema drift work specifically?" - [ ] "What would you build next with more time?" - [ ] "How is this different from a standard chatbot?" - [ ] "How does the reward function handle edge cases?" - [ ] **SUBMIT via Cerebral Valley form:** - [ ] GitHub repo URL (public) - [ ] HF Spaces URL - [ ] YouTube video URL - [ ] Colab notebook URL - [ ] Select problem statement 3.2 + Patronus AI sub-theme - [ ] Confirm submission received --- ## FINAL CHECKLIST — Before Judging Starts (1:15 PM Sunday) - [ ] GitHub repo is public and all code is pushed - [ ] HF Spaces deployment is live and accessible - [ ] Colab notebook runs end-to-end - [ ] YouTube video is uploaded and link works - [ ] Submission form completed - [ ] Both teammates can explain every part of the project - [ ] Laptop charged for demo - [ ] Demo runs without internet (backup plan if WiFi fails) --- ## EMERGENCY FALLBACKS | Problem | Fallback | |---------|----------| | OpenEnv won't work | Build gym-style env, wrap in OpenEnv interface last | | HF Spaces deploy fails | Run demo locally, screenshot HF attempt for judges | | Training won't converge | Simplify to 3 actions, 20 messages, 2 drift events | | Reward curves flat | Show qualitative behavior change, explain reward design | | YouTube upload fails | Screen-record on phone, upload from mobile | | One teammate burns out | Other covers — both should understand full stack | --- ## KEY RULE **If something isn't working after 45 minutes, simplify. A working simple version always beats a broken ambitious one.**