kunalkachru23 commited on
Commit
b168831
·
verified ·
1 Parent(s): cc6bb3c

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -30,7 +30,7 @@ Use this section first during judging/review.
30
  - **Live environment (HF Space):** https://kunalkachru23-nexus-enhanced-stage.hf.space/
31
  - **3-minute pitch script:** [`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)
32
  - **2-minute demo walkthrough:** [`docs/pitch/DEMO_WALKTHROUGH.md`](docs/pitch/DEMO_WALKTHROUGH.md)
33
- - **Hard-gate + rubric evidence index:** [`docs/project/JUDGING_EVIDENCE_INDEX.md`](docs/project/JUDGING_EVIDENCE_INDEX.md)
34
  - **Behavioral delta (before vs after):** [`docs/project/BEHAVIORAL_DELTA_PROOF.md`](docs/project/BEHAVIORAL_DELTA_PROOF.md)
35
  - **Compliance lock matrix:** [`docs/project/COMPLIANCE_LOCK_MATRIX.md`](docs/project/COMPLIANCE_LOCK_MATRIX.md)
36
  - **HF blog draft (publish-ready):** [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md)
@@ -139,9 +139,9 @@ python scripts/export_reward_plot.py \
139
 
140
  Caption: blue line is per-episode reward, green is rolling average, red dashed line is baseline (`0.265`).
141
 
142
- ## BRD hard gate — OpenEnv (reproduce)
143
 
144
- Per [`../design/hackathon_brd.md`](../design/hackathon_brd.md) Section 17, judges expect **OpenEnv (latest release)** usage, not only a custom HTTP server.
145
 
146
  **Local (dev machine, after `pip install "openenv>=0.2.3"`):**
147
 
@@ -163,7 +163,7 @@ openenv validate --url https://kunalkachru23-nexus-enhanced-stage.hf.space
163
 
164
  **Deploying with OpenEnv:** use `openenv push . --repo-id <user>/<space> --exclude .hfignore` (or **`./gate.sh --push`**, which adds `--exclude` for you). OpenEnv does not load `.hfignore` unless you pass it via `--exclude`; omitting it does **not** break the build, it only uploads extra paths (less lean). See `docs/guides/QUICK_START.md` for a short rationale.
165
 
166
- `requirements.txt` **omits** `openenv` on the Space Docker image to keep builds reliable; the **Colab notebook** installs `openenv>=0.2.3` for the training hard gate. Contract-only routes (`/metadata`, `/schema`, `GET /state`, `POST /mcp`) satisfy `openenv validate --url`; episode logic uses **`/reset`**, **`/step/{session_id}`**, **`/state/{session_id}`** only.
167
 
168
  ## API Endpoints
169
 
@@ -192,15 +192,15 @@ openenv validate --url https://kunalkachru23-nexus-enhanced-stage.hf.space
192
  - **Snorkel AI** — Rotating expert review board (4 criteria)
193
  - **Patronus AI** — Live schema drift in INC007 at step 18
194
 
195
- ## Pitch, plan, and BRD evidence
196
 
197
  Documentation lives under [`docs/`](docs/) (guides, deployment, project status, pitch/demo scripts, blog drafts).
198
 
199
- - **[`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)** — 3-minute spoken script + 2-minute Q&A bullets (BRD §18.1).
200
- - **[`docs/project/PLAN_OF_ACTION.md`](docs/project/PLAN_OF_ACTION.md)** — BRD compliance matrix + prioritized todo table.
201
- - **`scripts/export_reward_plot.py`** — export reward curve PNG from `--url` or `episode_rewards.json` (Criterion 3 slides). Canonical chart (tracked in git): **`docs/images/training_reward_curve.png`** (see section above).
202
 
203
- ## Final submission checklist (hard-gate safe)
204
 
205
  - [ ] Space URL is live and included in final form: `https://kunalkachru23-nexus-enhanced-stage.hf.space/`
206
  - [ ] `openenv validate .` passes locally.
@@ -216,7 +216,7 @@ Documentation lives under [`docs/`](docs/) (guides, deployment, project status,
216
 
217
  ## Blog Post
218
 
219
- See [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md) for the publish-ready HuggingFace blog draft (includes reward model deep-dive, training methodology, and demo walkthrough). Publish and add the public URL for BRD §17.3.
220
 
221
  ## Team
222
 
 
30
  - **Live environment (HF Space):** https://kunalkachru23-nexus-enhanced-stage.hf.space/
31
  - **3-minute pitch script:** [`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)
32
  - **2-minute demo walkthrough:** [`docs/pitch/DEMO_WALKTHROUGH.md`](docs/pitch/DEMO_WALKTHROUGH.md)
33
+ - **Compliance + judging evidence index:** [`docs/project/JUDGING_EVIDENCE_INDEX.md`](docs/project/JUDGING_EVIDENCE_INDEX.md)
34
  - **Behavioral delta (before vs after):** [`docs/project/BEHAVIORAL_DELTA_PROOF.md`](docs/project/BEHAVIORAL_DELTA_PROOF.md)
35
  - **Compliance lock matrix:** [`docs/project/COMPLIANCE_LOCK_MATRIX.md`](docs/project/COMPLIANCE_LOCK_MATRIX.md)
36
  - **HF blog draft (publish-ready):** [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md)
 
139
 
140
  Caption: blue line is per-episode reward, green is rolling average, red dashed line is baseline (`0.265`).
141
 
142
+ ## OpenEnv (reproduce)
143
 
144
+ Per **hackathon compliance criteria**, the submission uses **OpenEnv (latest release)** in the toolchain—not only a custom HTTP server. Reproduce validation with the commands below.
145
 
146
  **Local (dev machine, after `pip install "openenv>=0.2.3"`):**
147
 
 
163
 
164
  **Deploying with OpenEnv:** use `openenv push . --repo-id <user>/<space> --exclude .hfignore` (or **`./gate.sh --push`**, which adds `--exclude` for you). OpenEnv does not load `.hfignore` unless you pass it via `--exclude`; omitting it does **not** break the build, it only uploads extra paths (less lean). See `docs/guides/QUICK_START.md` for a short rationale.
165
 
166
+ `requirements.txt` **omits** `openenv` on the Space Docker image to keep builds reliable; the **Colab notebook** installs `openenv>=0.2.3` to satisfy the **Colab + OpenEnv** portion of compliance. Contract-only routes (`/metadata`, `/schema`, `GET /state`, `POST /mcp`) satisfy `openenv validate --url`; episode logic uses **`/reset`**, **`/step/{session_id}`**, **`/state/{session_id}`** only.
167
 
168
  ## API Endpoints
169
 
 
192
  - **Snorkel AI** — Rotating expert review board (4 criteria)
193
  - **Patronus AI** — Live schema drift in INC007 at step 18
194
 
195
+ ## Pitch, plan, and compliance evidence
196
 
197
  Documentation lives under [`docs/`](docs/) (guides, deployment, project status, pitch/demo scripts, blog drafts).
198
 
199
+ - **[`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)** — 3-minute spoken script + 2-minute Q&A bullets (organizer pitch format).
200
+ - **[`docs/project/PLAN_OF_ACTION.md`](docs/project/PLAN_OF_ACTION.md)** — hackathon compliance matrix + prioritized todo table.
201
+ - **`scripts/export_reward_plot.py`** — export reward curve PNG from `--url` or `episode_rewards.json` (slides / observable improvement evidence). Canonical chart (tracked in git): **`docs/images/training_reward_curve.png`** (see section above).
202
 
203
+ ## Final submission checklist (compliance-ready)
204
 
205
  - [ ] Space URL is live and included in final form: `https://kunalkachru23-nexus-enhanced-stage.hf.space/`
206
  - [ ] `openenv validate .` passes locally.
 
216
 
217
  ## Blog Post
218
 
219
+ See [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md) for the publish-ready HuggingFace blog draft (includes reward model deep-dive, training methodology, and demo walkthrough). Publish and add the public URL to your submission package (blog or short video, per organizer requirements).
220
 
221
  ## Team
222
 
docs/blog/blog_post_hf.md CHANGED
@@ -9,8 +9,8 @@ authors:
9
 
10
  *Team Falcons — Meta PyTorch OpenEnv Hackathon Grand Finale, April 2026*
11
 
12
- **Live Demo:** [huggingface.co/spaces/kunalkachru23/nexus-enhanced](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced)
13
- **Training Notebook:** [grpo_colab_v2.ipynb](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced/blob/main/notebooks/grpo_colab_v2.ipynb)
14
 
15
  ---
16
 
@@ -98,7 +98,7 @@ Before on-site GPU training, 30 scripted baseline episodes established the floor
98
 
99
  Expected trained model performance after 200 GRPO steps: **0.55–0.75**
100
 
101
- The gap shows clear, observable training signal satisfying BRD Criterion 3 (observable reward improvement evidence).
102
 
103
  ---
104
 
@@ -141,10 +141,10 @@ The expert board notices IC weaknesses and shifts its evaluation focus — simul
141
 
142
  ```bash
143
  # Live demo — no install needed
144
- curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/demo/run/INC003 | python -m json.tool
145
 
146
  # Or open the web dashboard
147
- https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
148
  ```
149
 
150
  **Training notebook** (GRPO on Qwen2.5-1.5B, cells 1–3 run without GPU):
 
9
 
10
  *Team Falcons — Meta PyTorch OpenEnv Hackathon Grand Finale, April 2026*
11
 
12
+ **Live Demo:** [huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage)
13
+ **Training Notebook:** [grpo_colab_v2.ipynb](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage/blob/main/notebooks/grpo_colab_v2.ipynb)
14
 
15
  ---
16
 
 
98
 
99
  Expected trained model performance after 200 GRPO steps: **0.55–0.75**
100
 
101
+ The gap shows clear, observable training signal—supporting the hackathon rubric emphasis on **observable reward improvement**.
102
 
103
  ---
104
 
 
141
 
142
  ```bash
143
  # Live demo — no install needed
144
+ curl -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
145
 
146
  # Or open the web dashboard
147
+ https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
148
  ```
149
 
150
  **Training notebook** (GRPO on Qwen2.5-1.5B, cells 1–3 run without GPU):
docs/deployment/DEPLOYMENT_CHECKLIST.md CHANGED
@@ -85,7 +85,7 @@ python deploy_to_hf_spaces.py
85
  # 🎉 Deployment complete!
86
 
87
  # Monitor build progress:
88
- # https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
89
  # Look for "Building" → "Running" status (5-10 min)
90
  ```
91
 
@@ -94,12 +94,12 @@ Once HF Space shows "Running" status:
94
 
95
  ```bash
96
  # Test public health endpoint
97
- curl https://kunalkachru23-nexus-enhanced.hf.space/health
98
 
99
  # Expected: {"status": "healthy", "environment": "nexus-enhanced", ...}
100
 
101
  # Run full remote test suite
102
- python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
103
 
104
  # Expected: ✅ ALL 7 TESTS PASS
105
  ```
@@ -107,7 +107,7 @@ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf
107
  ### Step 5: Verify Judge Dashboard
108
  Open in browser:
109
  ```
110
- https://kunalkachru23-nexus-enhanced.hf.space/
111
  ```
112
 
113
  **Expected to see**:
@@ -125,13 +125,13 @@ After deployment, run:
125
 
126
  ```bash
127
  # Full test suite against deployed environment
128
- python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
129
 
130
  # Individual endpoint tests:
131
- curl https://kunalkachru23-nexus-enhanced.hf.space/health | jq .
132
- curl https://kunalkachru23-nexus-enhanced.hf.space/metrics | jq .
133
- curl https://kunalkachru23-nexus-enhanced.hf.space/learning-curve | jq .
134
- curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
135
  -H "Content-Type: application/json" \
136
  -d '{"incident_id": "INC003"}' | jq .
137
  ```
@@ -166,14 +166,14 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
166
  ### After Phase 7 Passes:
167
 
168
  1. **Update Colab Notebook** (grpo_colab_v2.ipynb)
169
- - Verify `BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space"`
170
  - Run connectivity check cell
171
  - Expected: ✅ Connected message
172
 
173
  2. **Start Training**
174
  - Run all cells in Colab
175
  - Training produces reward curves
176
- - Monitor: https://kunalkachru23-nexus-enhanced.hf.space/learning-curve
177
 
178
  3. **Expected Training Results**
179
  - First 5-10 episodes: ~0.2 reward (baseline)
@@ -181,7 +181,7 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
181
  - Episodes 30+: Convergence around 0.6-0.8
182
  - Total training time: ~6 hours for 50 episodes on Colab GPU
183
 
184
- 4. **Blog Post** (Hard Gate)
185
  - Topic: "How NEXUS Enhanced Trains Multi-Agent Incident Response via GRPO"
186
  - Sections:
187
  - Problem statement (CrowdStrike scale)
@@ -191,7 +191,7 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
191
  - Length: ~800-1200 words (< 2 min read)
192
  - Publish to HF blog or Medium
193
 
194
- 5. **Pitch Video** (Hard Gate)
195
  - Duration: 3 minutes max
196
  - Content:
197
  - Show judge dashboard at `/`
@@ -222,7 +222,7 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
222
 
223
  | Service | URL | Purpose |
224
  |---------|-----|---------|
225
- | Judge Dashboard | `https://kunalkachru23-nexus-enhanced.hf.space/` | Live metrics + curves |
226
  | API Health | `.../health` | Connectivity check |
227
  | Metrics | `.../metrics` | Training stats |
228
  | Learning Curve | `.../learning-curve` | Reward history |
 
85
  # 🎉 Deployment complete!
86
 
87
  # Monitor build progress:
88
+ # https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
89
  # Look for "Building" → "Running" status (5-10 min)
90
  ```
91
 
 
94
 
95
  ```bash
96
  # Test public health endpoint
97
+ curl https://kunalkachru23-nexus-enhanced-stage.hf.space/health
98
 
99
  # Expected: {"status": "healthy", "environment": "nexus-enhanced", ...}
100
 
101
  # Run full remote test suite
102
+ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
103
 
104
  # Expected: ✅ ALL 7 TESTS PASS
105
  ```
 
107
  ### Step 5: Verify Judge Dashboard
108
  Open in browser:
109
  ```
110
+ https://kunalkachru23-nexus-enhanced-stage.hf.space/
111
  ```
112
 
113
  **Expected to see**:
 
125
 
126
  ```bash
127
  # Full test suite against deployed environment
128
+ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
129
 
130
  # Individual endpoint tests:
131
+ curl https://kunalkachru23-nexus-enhanced-stage.hf.space/health | jq .
132
+ curl https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics | jq .
133
+ curl https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve | jq .
134
+ curl -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/reset \
135
  -H "Content-Type: application/json" \
136
  -d '{"incident_id": "INC003"}' | jq .
137
  ```
 
166
  ### After Phase 7 Passes:
167
 
168
  1. **Update Colab Notebook** (grpo_colab_v2.ipynb)
169
+ - Verify `BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space"`
170
  - Run connectivity check cell
171
  - Expected: ✅ Connected message
172
 
173
  2. **Start Training**
174
  - Run all cells in Colab
175
  - Training produces reward curves
176
+ - Monitor: https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve
177
 
178
  3. **Expected Training Results**
179
  - First 5-10 episodes: ~0.2 reward (baseline)
 
181
  - Episodes 30+: Convergence around 0.6-0.8
182
  - Total training time: ~6 hours for 50 episodes on Colab GPU
183
 
184
+ 4. **Blog Post** (submission requirement)
185
  - Topic: "How NEXUS Enhanced Trains Multi-Agent Incident Response via GRPO"
186
  - Sections:
187
  - Problem statement (CrowdStrike scale)
 
191
  - Length: ~800-1200 words (< 2 min read)
192
  - Publish to HF blog or Medium
193
 
194
+ 5. **Pitch Video** (submission requirement)
195
  - Duration: 3 minutes max
196
  - Content:
197
  - Show judge dashboard at `/`
 
222
 
223
  | Service | URL | Purpose |
224
  |---------|-----|---------|
225
+ | Judge Dashboard | `https://kunalkachru23-nexus-enhanced-stage.hf.space/` | Live metrics + curves |
226
  | API Health | `.../health` | Connectivity check |
227
  | Metrics | `.../metrics` | Training stats |
228
  | Learning Curve | `.../learning-curve` | Reward history |
docs/deployment/HF_SPACES_DEPLOYMENT.md CHANGED
@@ -63,18 +63,18 @@ Once HF Spaces shows "Running" status:
63
 
64
  ```bash
65
  # Test judge dashboard endpoint
66
- curl -s https://kunalkachru23-nexus-enhanced.hf.space/health | jq .
67
 
68
  # Test reset endpoint
69
- curl -s -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
70
  -H "Content-Type: application/json" \
71
  -d '{"incident_id": "INC003"}' | jq .
72
 
73
  # Test metrics endpoint
74
- curl -s https://kunalkachru23-nexus-enhanced.hf.space/metrics | jq .
75
 
76
  # Test learning curve
77
- curl -s https://kunalkachru23-nexus-enhanced.hf.space/learning-curve | jq .
78
  ```
79
 
80
  ### Step 6: Update Colab Notebook
@@ -82,7 +82,7 @@ curl -s https://kunalkachru23-nexus-enhanced.hf.space/learning-curve | jq .
82
  In `notebooks/grpo_colab_v2.ipynb`, ensure BASE_URL points to deployed space:
83
 
84
  ```python
85
- BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space" # YOUR SPACE URL
86
  ```
87
 
88
  Then run connectivity check:
@@ -102,7 +102,7 @@ cat > test_hf_space_deployment.py << 'EOF'
102
  import requests
103
  import json
104
 
105
- BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space"
106
 
107
  def test_health():
108
  resp = requests.get(f"{BASE_URL}/health")
@@ -193,8 +193,8 @@ python test_hf_space_deployment.py
193
 
194
  While Colab trains:
195
 
196
- 1. **Watch reward curves**: https://kunalkachru23-nexus-enhanced.hf.space/learning-curve
197
- 2. **Check metrics**: `curl https://kunalkachru23-nexus-enhanced.hf.space/metrics`
198
  3. **Monitor Colab logs** for reward_fn errors
199
  4. **Expected pattern**: First 5-10 episodes ~0.2 reward, then gradual improvement to 0.6-0.8
200
 
@@ -212,5 +212,5 @@ git push origin main
212
  ## Next Steps
213
 
214
  - **Phase 7**: Run full regression tests against deployed HF Space
215
- - **Blog Post**: Write HF blog explaining NEXUS architecture (hard gate)
216
- - **Pitch**: Prepare 3-minute demo for judges (hard gate)
 
63
 
64
  ```bash
65
  # Test judge dashboard endpoint
66
+ curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/health | jq .
67
 
68
  # Test reset endpoint
69
+ curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/reset \
70
  -H "Content-Type: application/json" \
71
  -d '{"incident_id": "INC003"}' | jq .
72
 
73
  # Test metrics endpoint
74
+ curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics | jq .
75
 
76
  # Test learning curve
77
+ curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve | jq .
78
  ```
79
 
80
  ### Step 6: Update Colab Notebook
 
82
  In `notebooks/grpo_colab_v2.ipynb`, ensure BASE_URL points to deployed space:
83
 
84
  ```python
85
+ BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space" # YOUR SPACE URL
86
  ```
87
 
88
  Then run connectivity check:
 
102
  import requests
103
  import json
104
 
105
+ BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space"
106
 
107
  def test_health():
108
  resp = requests.get(f"{BASE_URL}/health")
 
193
 
194
  While Colab trains:
195
 
196
+ 1. **Watch reward curves**: https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve
197
+ 2. **Check metrics**: `curl https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics`
198
  3. **Monitor Colab logs** for reward_fn errors
199
  4. **Expected pattern**: First 5-10 episodes ~0.2 reward, then gradual improvement to 0.6-0.8
200
 
 
212
  ## Next Steps
213
 
214
  - **Phase 7**: Run full regression tests against deployed HF Space
215
+ - **Blog Post**: Write HF blog explaining NEXUS architecture (per submission requirements)
216
+ - **Pitch**: Prepare 3-minute demo for judges (per submission requirements)
docs/guides/QUICK_START.md CHANGED
@@ -48,7 +48,7 @@ openenv push . --repo-id kunalkachru23/nexus-enhanced-stage --exclude .hfignore
48
  4. Watch dashboard: https://kunalkachru23-nexus-enhanced-stage.hf.space/
49
  ```
50
 
51
- ### Export reward plot for slides (BRD Criterion 3)
52
  ```bash
53
  python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
54
  # or from local episode_rewards.json:
@@ -188,7 +188,7 @@ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-st
188
  | Reward Progress | **20%** | Chart.js curves on dashboard ← **KEY** |
189
  | Pipeline | 10% | GRPO on Colab GPU → HF Space API |
190
 
191
- **🎯 Priority**: Ensure reward curves are visible and improving to maximize Criterion 3 score.
192
 
193
  ---
194
 
 
48
  4. Watch dashboard: https://kunalkachru23-nexus-enhanced-stage.hf.space/
49
  ```
50
 
51
+ ### Export reward plot for slides (observable improvement evidence)
52
  ```bash
53
  python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
54
  # or from local episode_rewards.json:
 
188
  | Reward Progress | **20%** | Chart.js curves on dashboard ← **KEY** |
189
  | Pipeline | 10% | GRPO on Colab GPU → HF Space API |
190
 
191
+ **🎯 Priority**: Ensure reward curves are visible and improving to support the observable-improvement rubric row.
192
 
193
  ---
194
 
docs/pitch/DEMO_WALKTHROUGH.md CHANGED
@@ -69,7 +69,7 @@ Say:
69
  ## [1:50-2:00] Close
70
 
71
  Say:
72
- "We satisfy OpenEnv hard gates, show live reward improvement, and can directly inspect behavior delta in the running environment."
73
 
74
  ---
75
 
 
69
  ## [1:50-2:00] Close
70
 
71
  Say:
72
+ "We satisfy OpenEnv validation requirements, show live reward improvement, and can directly inspect behavior delta in the running environment."
73
 
74
  ---
75
 
docs/pitch/PITCH.md CHANGED
@@ -1,7 +1,7 @@
1
- # NEXUS Enhanced — 3-minute pitch + 2-minute Q&A (BRD §18.1)
2
 
3
  **Event:** Meta PyTorch OpenEnv Hackathon × Scaler — Grand Finale
4
- **Format (verbatim BRD):** 3 min pitch + 2 min Q&A = 5 min total.
5
 
6
  ---
7
 
@@ -29,7 +29,7 @@
29
 
30
  ---
31
 
32
- ## Observable evidence (~35 s) — Criterion 3
33
 
34
  - Show **dashboard** reward curve and rolling average vs **baseline** (pre-event benchmark).
35
  - Use one canonical metrics callout (snapshot `2026-04-24T16:48:26Z`, stage URL): **episodes 387**, **avg 0.4634**, **best 1.0032**, **+74.9% vs baseline 0.265**.
@@ -38,7 +38,7 @@
38
 
39
  ---
40
 
41
- ## Training & improvement (~30 s) — Criteria 3 & 4
42
 
43
  - **Colab** runs minimal GRPO training against the **real** remote environment API (not a mocked reward).
44
  - Improvement is **observable** on the curve and in **behaviour** (shorter paths, better notifications, fewer oversight violations) — tie any checkpoint story to **what the IC does differently**, not only the scalar.
@@ -47,7 +47,7 @@
47
 
48
  ## Close (~20 s)
49
 
50
- > “NEXUS is **OpenEnv-shaped**: isolated episodes, structured actions, measurable outcomes, and a problem that stays hard after the novelty wears off. We meet the **hard gates** OpenEnv latest in the toolchain + Colab TRL/Unsloth + HF blog/video slot and we optimised for the **40% environment** and **30% storytelling** bars with a demo you can **drive live** in under two minutes.”
51
 
52
  **Stop at 3:00. Breathe. Hand off for Q&A.**
53
 
@@ -65,7 +65,7 @@ Answer in **short paragraphs**; do not invent numbers not on the dashboard.
65
  | **Reward hacking?** | Sparse terminal reward + oversight + coalition + tool budgets; red herrings in harder incidents. |
66
  | **What is partial observability?** | IC observation is a slice; specialists see tool outputs for their role; IC never sees everything at once. |
67
  | **INC007 in 30 s?** | Nightmare incident: multi-region blast radius; **schema drift** forces contract change mid-episode — reserved for sharp Q&A, not the full live path if time is short. |
68
- | **Why GRPO / TRL / Unsloth?** | BRD hard gate #2: minimal training in Colab with **HF TRL** and **Unsloth** for efficient QLoRA on Qwen-class IC policy. |
69
  | **What if the Space is slow?** | Training is async from Colab; dashboard refreshes on timer; auto-demo is one POST chain. |
70
  | **Baseline 0.265?** | Pre-event scripted benchmark documented in server; curve compares **trained vs that baseline** for “observable improvement.” |
71
  | **Single strongest differentiator?** | Multi-agent + sparse reward + **schema drift** on INC007 in one OpenEnv-hosted stack judges can open in the browser. |
 
1
+ # NEXUS Enhanced — 3-minute pitch + 2-minute Q&A
2
 
3
  **Event:** Meta PyTorch OpenEnv Hackathon × Scaler — Grand Finale
4
+ **Format (per hackathon compliance):** 3 min pitch + 2 min Q&A = 5 min total.
5
 
6
  ---
7
 
 
29
 
30
  ---
31
 
32
+ ## Observable evidence (~35 s) — reward improvement (judging rubric)
33
 
34
  - Show **dashboard** reward curve and rolling average vs **baseline** (pre-event benchmark).
35
  - Use one canonical metrics callout (snapshot `2026-04-24T16:48:26Z`, stage URL): **episodes 387**, **avg 0.4634**, **best 1.0032**, **+74.9% vs baseline 0.265**.
 
38
 
39
  ---
40
 
41
+ ## Training & improvement (~30 s) — improvement + pipeline coherence
42
 
43
  - **Colab** runs minimal GRPO training against the **real** remote environment API (not a mocked reward).
44
  - Improvement is **observable** on the curve and in **behaviour** (shorter paths, better notifications, fewer oversight violations) — tie any checkpoint story to **what the IC does differently**, not only the scalar.
 
47
 
48
  ## Close (~20 s)
49
 
50
+ > “NEXUS is **OpenEnv-shaped**: isolated episodes, structured actions, measurable outcomes, and a problem that stays hard after the novelty wears off. We meet **hackathon compliance**: OpenEnv latest in the toolchain, Colab TRL/Unsloth training, and the required HF blog or short video slot—and we optimised for the **40% environment** and **30% storytelling** rubric weights with a demo you can **drive live** in under two minutes.”
51
 
52
  **Stop at 3:00. Breathe. Hand off for Q&A.**
53
 
 
65
  | **Reward hacking?** | Sparse terminal reward + oversight + coalition + tool budgets; red herrings in harder incidents. |
66
  | **What is partial observability?** | IC observation is a slice; specialists see tool outputs for their role; IC never sees everything at once. |
67
  | **INC007 in 30 s?** | Nightmare incident: multi-region blast radius; **schema drift** forces contract change mid-episode — reserved for sharp Q&A, not the full live path if time is short. |
68
+ | **Why GRPO / TRL / Unsloth?** | Per compliance: minimal training in Colab with **HF TRL** and **Unsloth** for efficient QLoRA on Qwen-class IC policy. |
69
  | **What if the Space is slow?** | Training is async from Colab; dashboard refreshes on timer; auto-demo is one POST chain. |
70
  | **Baseline 0.265?** | Pre-event scripted benchmark documented in server; curve compares **trained vs that baseline** for “observable improvement.” |
71
  | **Single strongest differentiator?** | Multi-agent + sparse reward + **schema drift** on INC007 in one OpenEnv-hosted stack judges can open in the browser. |
docs/pitch/PITCH_3MIN.md CHANGED
@@ -86,7 +86,7 @@ The trained model learns to query the right service, notify customers proactivel
86
 
87
  For academia, it's a benchmarkable environment. For enterprise, it's a path toward AI-assisted incident management. For Meta and PyTorch, it demonstrates OpenEnv's potential for real-world complexity.
88
 
89
- **Live demo:** https://kunalkachru23-nexus-enhanced.hf.space/
90
 
91
  Thank you."
92
 
 
86
 
87
  For academia, it's a benchmarkable environment. For enterprise, it's a path toward AI-assisted incident management. For Meta and PyTorch, it demonstrates OpenEnv's potential for real-world complexity.
88
 
89
+ **Live demo:** https://kunalkachru23-nexus-enhanced-stage.hf.space/
90
 
91
  Thank you."
92
 
docs/pitch/VIDEO_RECORD_NOW_PACK.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NEXUS Enhanced — Record-Now Video Pack
2
+
3
+ Target length: 1:45 to 2:00
4
+ Presenter style: calm, clear, outcome-first
5
+ Audience: judges / hackathon reviewers
6
+
7
+ ---
8
+
9
+ ## 1) Pre-record setup (2-3 minutes)
10
+
11
+ Keep only these windows ready:
12
+ - Browser tab 1: `https://kunalkachru23-nexus-enhanced-stage.hf.space/web`
13
+ - Browser tab 2: `https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics`
14
+ - Terminal tab with this command pre-pasted:
15
+
16
+ ```bash
17
+ curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
18
+ ```
19
+
20
+ Visual quality checklist:
21
+ - Screen recording at 1080p.
22
+ - Terminal font size: 16+.
23
+ - Browser zoom: 125%.
24
+ - Hide desktop notifications.
25
+ - Use dark mode consistently (optional, but cleaner).
26
+
27
+ Canonical narration numbers (freeze these):
28
+ - Episodes: `387`
29
+ - Average reward: `0.4634`
30
+ - Best reward: `1.0032`
31
+ - Baseline: `0.265`
32
+ - Improvement: `+74.9%`
33
+
34
+ Source: `docs/project/snapshots/submission_snapshot_20260424T164826Z.md`
35
+
36
+ ---
37
+
38
+ ## 2) One-take timeline (timestamped)
39
+
40
+ ### 0:00-0:12 — Hook (camera or title card)
41
+
42
+ Say:
43
+ "Most incident-response AI demos are single-agent and unrealistic. NEXUS Enhanced is a multi-agent OpenEnv environment where an Incident Commander coordinates specialists under partial observability, business constraints, and schema drift."
44
+
45
+ On screen:
46
+ - Title card or the `/web` dashboard landing view.
47
+
48
+ ### 0:12-0:35 — What you built
49
+
50
+ Say:
51
+ "We built seven incident scenarios, from easier outages to a nightmare schema-drift incident. The system is deployed on Hugging Face Spaces and validated with OpenEnv-compatible workflow checks. Training runs through TRL GRPO with Unsloth in Colab."
52
+
53
+ On screen:
54
+ - Briefly show `/metrics` page and then back to `/web`.
55
+
56
+ ### 0:35-1:00 — Measurable improvement
57
+
58
+ Say:
59
+ "In the current frozen snapshot we have 387 completed episodes, average reward 0.4634 versus baseline 0.265, and best reward 1.0032. That's a 74.9% uplift over baseline."
60
+
61
+ On screen:
62
+ - Show dashboard training metrics and curve.
63
+
64
+ ### 1:00-1:30 — Behavioral evidence
65
+
66
+ Say:
67
+ "The key claim is behavior change, not only scalar gain. In INC003, the policy commits earlier to the memory-leak hypothesis, sequences runbook steps cleanly, sends proactive customer notifications, and reaches postmortem with fewer redundant actions."
68
+
69
+ On screen:
70
+ - Run the terminal command:
71
+
72
+ ```bash
73
+ curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
74
+ ```
75
+
76
+ - Point to:
77
+ - `final_phase: "postmortem"`
78
+ - `reward_breakdown.total`
79
+ - `coalition_correct: true`
80
+ - `notifications_sent`
81
+
82
+ ### 1:30-1:52 — Safety and robustness
83
+
84
+ Say:
85
+ "To reduce reward hacking, diagnosis is evidence-gated, customer score requires actual notification actions, and coordination penalizes duplicate tool calls. Oversight checks also affect final scoring."
86
+
87
+ On screen:
88
+ - Keep terminal result visible (reward breakdown + oversight report).
89
+
90
+ ### 1:52-2:00 — Close
91
+
92
+ Say:
93
+ "NEXUS combines innovation, observable improvement, and reproducible deployment for real incident-management RL. Thanks for watching."
94
+
95
+ On screen:
96
+ - End card with:
97
+ - Stage URL
98
+ - Repo URL
99
+ - Evidence docs path
100
+
101
+ ---
102
+
103
+ ## 3) Backup branch (if UI or network is slow)
104
+
105
+ If dashboard lags, use terminal-only sequence:
106
+
107
+ ```bash
108
+ curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/health | python -m json.tool
109
+ curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics | python -m json.tool
110
+ curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve | python -m json.tool
111
+ curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
112
+ ```
113
+
114
+ Narration line for fallback:
115
+ "Even without UI, these live endpoints show health, metrics, learning progression, and full behavioral transcript from auto-demo INC003."
116
+
117
+ ---
118
+
119
+ ## 4) Retake checklist (30 seconds)
120
+
121
+ Before final export, verify:
122
+ - Audio is clear and steady.
123
+ - Duration stays under 2:00.
124
+ - Numbers match frozen snapshot.
125
+ - Stage URL shown at least once.
126
+ - Demo command succeeds in-frame.
127
+ - No mention of internal-only terminology.
128
+
129
+ ---
130
+
131
+ ## 5) Upload metadata (copy/paste)
132
+
133
+ Published video URL:
134
+ `https://www.youtube.com/watch?v=a9YZF30tomw`
135
+
136
+ Title:
137
+ `NEXUS Enhanced Demo — Multi-Agent Incident Response RL (OpenEnv + GRPO)`
138
+
139
+ Description:
140
+ `Live demo of NEXUS Enhanced on HF Space. Shows OpenEnv-compatible environment checks, observable reward improvement, and transcript-level behavioral evidence on INC003. Stage URL: https://kunalkachru23-nexus-enhanced-stage.hf.space`
141
+
142
+ Tags:
143
+ `openenv, reinforcement learning, multi-agent, incident response, grpo, trl, unsloth, huggingface`
144
+
docs/project/BEHAVIORAL_DELTA_PROOF.md CHANGED
@@ -1,6 +1,6 @@
1
- # Criterion 4 Behavioral Delta Proof
2
 
3
- This sheet demonstrates BRD §18.2 Criterion 4 intent: measurable improvement in **how the agent acts**, not only reward numbers.
4
 
5
  ## Fixed evaluation set (canonical)
6
 
@@ -60,9 +60,9 @@ These support that gain is not only speed; diagnosis/coordination/customer dimen
60
  - `GET /learning-curve`
61
  - confirm aggregate improvement trend and rolling average.
62
 
63
- ## Why this satisfies Criterion 4
64
 
65
- Criterion 4 asks for coherent reward logic and meaningful pipeline-driven behavior change.
66
  NEXUS evidence shows:
67
 
68
  - Coherent multi-dimensional reward decomposition (MTTR, diagnosis, customer, coordination, oversight, depth).
 
1
+ # Behavioral delta proof (pipeline coherence)
2
 
3
+ This sheet demonstrates **hackathon judging intent** for reward-and-pipeline coherence: measurable improvement in **how the agent acts**, not only reward numbers.
4
 
5
  ## Fixed evaluation set (canonical)
6
 
 
60
  - `GET /learning-curve`
61
  - confirm aggregate improvement trend and rolling average.
62
 
63
+ ## Why this satisfies the pipeline-coherence rubric row
64
 
65
+ The judging rubric asks for coherent reward logic and meaningful pipeline-driven behavior change.
66
  NEXUS evidence shows:
67
 
68
  - Coherent multi-dimensional reward decomposition (MTTR, diagnosis, customer, coordination, oversight, depth).
docs/project/COMPLIANCE_LOCK_MATRIX.md CHANGED
@@ -1,8 +1,8 @@
1
- # Compliance Lock Matrix (BRD-Aligned)
2
 
3
- Purpose: freeze hard-gate and scoring-criterion traceability so implementation changes remain compliant.
4
 
5
- ## Hard gates (pass/fail)
6
 
7
  | Gate | Requirement | Project evidence | Verification command |
8
  |---|---|---|---|
@@ -10,11 +10,11 @@ Purpose: freeze hard-gate and scoring-criterion traceability so implementation c
10
  | Colab training script | Minimal Colab script with TRL/Unsloth path | `notebooks/grpo_colab_v2.ipynb` | Notebook config + run cells |
11
  | Public artifact | HF blog or <2 min video | `docs/blog/*`, `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md` | Submission URL checklist |
12
 
13
- ## Weighted scoring criteria map (BRD §18 — evaluation criteria)
14
 
15
  These four rows are what Cerebral Valley–aggregated scoring uses. The **Demonstration** column is how a judge should *see* each criterion in the live Space or repo in under five minutes.
16
 
17
- | Criterion | BRD weight | What judges need (verbatim intent) | NEXUS evidence | How the design demonstrates it (demo / artefact) |
18
  |---|---:|---|---|---|
19
  | **1 — Environment Innovation** | 40% | Novel, creative, or challenging; meaningfully tests agent behaviour | `server/environment.py`, `server/incidents.py`, `server/agents.py`, `server/tools.py`, `docs/project/SUBTHEME_EVIDENCE_MATRIX.md` | Open the dashboard **Training** tab → **manual validation** on **INC008** (Theme 3.2) and INC004–INC007; show **coalition**, **role-scoped tooling**, **INC007 schema drift** in Q&A or deep demo. Eight incidents = difficulty ladder + operational nuance + personalized track. |
20
  | **2 — Storytelling** | 30% | Clear explanation of problem, environment, and agent behaviour; demo **engaging and easy to follow** | `docs/pitch/PITCH.md`, `docs/pitch/DEMO_WALKTHROUGH.md`, `docs/project/FINAL_OPERATIONS_RUNBOOK.md` | Follow `DEMO_WALKTHROUGH.md`: metrics -> **Validation tab auto-demo** (INC003) -> optional **Guided** steps to completion. One sentence hook: “IC coordinates specialists under partial observability and contract drift.” |
@@ -23,22 +23,22 @@ These four rows are what Cerebral Valley–aggregated scoring uses. The **Demons
23
 
24
  ---
25
 
26
- ## BRD content themes (§9–14 + §14 Theme 5) — how NEXUS maps
27
 
28
- Official hackathon **themes** (Google Doc / BRD) are distinct from the **§18 scoring criteria** above. NEXUS is architected as an **enterprise incident-command** environment; the table below states where each parent theme is **primary** (core loop), **secondary** (explicit mechanic but not the headline), or **bridge** (honest pitch link without claiming a different product genre).
29
 
30
- | BRD theme | § | Primary requirement (summary) | NEXUS demonstration | Verification |
31
  |---|---|---|---|---|
32
- | **Theme 1 — Multi-agent** | §9 | Cooperation / competition / negotiation / **coalition**; **partial observability**; theory-of-mind style incentives | Five specialist roles + IC; coalition votes on harder incidents; IC observation is a slice, not full state | `server/environment.py`, `server/agents.py`, INC003+ in `server/incidents.py`, manual validation + `/state/{session_id}` |
33
- | **Theme 2 — Long horizon & instruction following** | §10 | **Sparse / delayed** reward; task **beyond one context**; decomposition & recovery; Scale sub-theme: **non-code** business workflows (incl. **HR & IT**) | Episode-level reward on `done`; long `max_steps` incidents; **server-side** session state and tool history so the task cannot fit in one static transcript; **IT ops** coordination (not a coding benchmark) | `server/reward.py`, `server/app.py` session store, INC006–INC007 length/complexity; see Scale AI row in `SUBTHEME_EVIDENCE_MATRIX.md` |
34
- | **Theme 3.1 — World modeling (professional)** | §11 | Realistic tools/workflows; **anti-shortcut**; causal / persistent world | Datadog / runbook / portal-style tools; evidence-gated diagnosis; runbook steps; red herrings on harder tracks | `server/tools.py`, `server/reward.py`, `docs/project/REWARD_HACKING_DEFENSE.md` |
35
- | **Theme 3.2 — World modeling (personalized)** | §12 | Personal delegation / conflicts / messaging-style tasks | **Dedicated incident INC008** (EA calendar: board prep vs school concert, family thread, auto-accept root cause) using `IncidentType.PERSONAL_ASSISTANT`, same multi-agent tool loop as ops incidents. **Plus** enterprise paths: customer **notifications** and SLA framing on INC001–INC007. | Manual validation **INC008** on dashboard; `server/incidents.py` (`INC008`), `server/data_models.py` |
36
- | **Theme 4 — Self-improvement** | §13 | Curriculum / adaptive difficulty; recursive capability growth | **Process-wide adaptive tier:** `server/global_curriculum.py` + `GET /curriculum` — last-5 rolling avg ≥ 0.55 promotes difficulty across **HTTP/Colab sessions** (not lost per `NexusEnvironment()`). **Plus** seven-incident ladder + **Snorkel-style** rotating `expert_criteria`. Full recursive self-play is out of scope; GRPO improves policy externally. | `server/difficulty.py`, `server/global_curriculum.py`, `server/app.py` `/curriculum`, Colab GRPO |
37
- | **Theme 5 — Wild card** | §14 | Creative value for LLM training on a defined task | **Primary positioning option:** “Out-of-box” fusion of **multi-agent ops + schema drift + oversight + token-scaled depth bonus** in one OpenEnv-deployable stack | Pitch close in `docs/pitch/PITCH.md`; innovation narrative in criterion 1 |
38
 
39
- ### Sub-theme bonuses (§15) — already locked in `SUBTHEME_EVIDENCE_MATRIX.md`
40
 
41
- Fleet, Halluminate, Snorkel, Patronus, Mercor, Scaler AI Labs rows remain the detailed sponsor map. This matrix ties **parent themes §9–14** to the same implementation so evaluators see both **theme** and **sponsor** coverage.
42
 
43
  ### One-line pitch bank (theme → sentence)
44
 
@@ -47,7 +47,7 @@ Use in Q&A if asked “which themes?”
47
  1. **Theme 1:** “IC coordinates five roles under partial observability and coalition mechanics.”
48
  2. **Theme 2:** “Sparse end-of-episode reward and persistent server state force long-horizon planning beyond a single context.”
49
  3. **Theme 3.1:** “Tool-bound enterprise workflows with anti-shortcut evidence and runbook discipline.”
50
- 4. **Theme 3.2:** “**INC008** is a BRD-style personal/EA conflict (calendar + family messaging) on the same engine; other incidents stress customer delegation under SLA.”
51
  5. **Theme 4:** “**`/curriculum`** shows live adaptive difficulty from rolling rewards; expert criteria rotate; GRPO improves the policy.”
52
  6. **Theme 5:** “Wild-card angle: one deployable environment that fuses the hardest ops themes for LLM incident command training.”
53
 
 
1
+ # Compliance Lock Matrix (hackathon-aligned)
2
 
3
+ Purpose: freeze **mandatory requirements** and **judging rubric** traceability so implementation changes stay aligned with hackathon compliance.
4
 
5
+ ## Mandatory requirements (pass/fail)
6
 
7
  | Gate | Requirement | Project evidence | Verification command |
8
  |---|---|---|---|
 
10
  | Colab training script | Minimal Colab script with TRL/Unsloth path | `notebooks/grpo_colab_v2.ipynb` | Notebook config + run cells |
11
  | Public artifact | HF blog or <2 min video | `docs/blog/*`, `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md` | Submission URL checklist |
12
 
13
+ ## Weighted scoring criteria map (judging rubric)
14
 
15
  These four rows are what Cerebral Valley–aggregated scoring uses. The **Demonstration** column is how a judge should *see* each criterion in the live Space or repo in under five minutes.
16
 
17
+ | Criterion | Weight | What judges need (intent) | NEXUS evidence | How the design demonstrates it (demo / artefact) |
18
  |---|---:|---|---|---|
19
  | **1 — Environment Innovation** | 40% | Novel, creative, or challenging; meaningfully tests agent behaviour | `server/environment.py`, `server/incidents.py`, `server/agents.py`, `server/tools.py`, `docs/project/SUBTHEME_EVIDENCE_MATRIX.md` | Open the dashboard **Training** tab → **manual validation** on **INC008** (Theme 3.2) and INC004–INC007; show **coalition**, **role-scoped tooling**, **INC007 schema drift** in Q&A or deep demo. Eight incidents = difficulty ladder + operational nuance + personalized track. |
20
  | **2 — Storytelling** | 30% | Clear explanation of problem, environment, and agent behaviour; demo **engaging and easy to follow** | `docs/pitch/PITCH.md`, `docs/pitch/DEMO_WALKTHROUGH.md`, `docs/project/FINAL_OPERATIONS_RUNBOOK.md` | Follow `DEMO_WALKTHROUGH.md`: metrics -> **Validation tab auto-demo** (INC003) -> optional **Guided** steps to completion. One sentence hook: “IC coordinates specialists under partial observability and contract drift.” |
 
23
 
24
  ---
25
 
26
+ ## Hackathon content themes — how NEXUS maps
27
 
28
+ Official hackathon **themes** (organizer brief) are distinct from the **four scoring criteria** above. NEXUS is architected as an **enterprise incident-command** environment; the table below states where each parent theme is **primary** (core loop), **secondary** (explicit mechanic but not the headline), or **bridge** (honest pitch link without claiming a different product genre).
29
 
30
+ | Parent theme | Track | Primary requirement (summary) | NEXUS demonstration | Verification |
31
  |---|---|---|---|---|
32
+ | **Theme 1 — Multi-agent** | Multi-agent track | Cooperation / competition / negotiation / **coalition**; **partial observability**; theory-of-mind style incentives | Five specialist roles + IC; coalition votes on harder incidents; IC observation is a slice, not full state | `server/environment.py`, `server/agents.py`, INC003+ in `server/incidents.py`, manual validation + `/state/{session_id}` |
33
+ | **Theme 2 — Long horizon & instruction following** | Long-horizon track | **Sparse / delayed** reward; task **beyond one context**; decomposition & recovery; Scale sub-theme: **non-code** business workflows (incl. **HR & IT**) | Episode-level reward on `done`; long `max_steps` incidents; **server-side** session state and tool history so the task cannot fit in one static transcript; **IT ops** coordination (not a coding benchmark) | `server/reward.py`, `server/app.py` session store, INC006–INC007 length/complexity; see Scale AI row in `SUBTHEME_EVIDENCE_MATRIX.md` |
34
+ | **Theme 3.1 — World modeling (professional)** | World modeling | Realistic tools/workflows; **anti-shortcut**; causal / persistent world | Datadog / runbook / portal-style tools; evidence-gated diagnosis; runbook steps; red herrings on harder tracks | `server/tools.py`, `server/reward.py`, `docs/project/REWARD_HACKING_DEFENSE.md` |
35
+ | **Theme 3.2 — World modeling (personalized)** | Personalized track | Personal delegation / conflicts / messaging-style tasks | **Dedicated incident INC008** (EA calendar: board prep vs school concert, family thread, auto-accept root cause) using `IncidentType.PERSONAL_ASSISTANT`, same multi-agent tool loop as ops incidents. **Plus** enterprise paths: customer **notifications** and SLA framing on INC001–INC007. | Manual validation **INC008** on dashboard; `server/incidents.py` (`INC008`), `server/data_models.py` |
36
+ | **Theme 4 — Self-improvement** | Self-improvement track | Curriculum / adaptive difficulty; recursive capability growth | **Process-wide adaptive tier:** `server/global_curriculum.py` + `GET /curriculum` — last-5 rolling avg ≥ 0.55 promotes difficulty across **HTTP/Colab sessions** (not lost per `NexusEnvironment()`). **Plus** seven-incident ladder + **Snorkel-style** rotating `expert_criteria`. Full recursive self-play is out of scope; GRPO improves policy externally. | `server/difficulty.py`, `server/global_curriculum.py`, `server/app.py` `/curriculum`, Colab GRPO |
37
+ | **Theme 5 — Wild card** | Wild card | Creative value for LLM training on a defined task | **Primary positioning option:** “Out-of-box” fusion of **multi-agent ops + schema drift + oversight + token-scaled depth bonus** in one OpenEnv-deployable stack | Pitch close in `docs/pitch/PITCH.md`; innovation narrative in rubric row 1 |
38
 
39
+ ### Sub-theme bonuses — already locked in `SUBTHEME_EVIDENCE_MATRIX.md`
40
 
41
+ Fleet, Halluminate, Snorkel, Patronus, Mercor, Scaler AI Labs rows remain the detailed sponsor map. This matrix ties **parent themes** to the same implementation so evaluators see both **theme** and **sponsor** coverage.
42
 
43
  ### One-line pitch bank (theme → sentence)
44
 
 
47
  1. **Theme 1:** “IC coordinates five roles under partial observability and coalition mechanics.”
48
  2. **Theme 2:** “Sparse end-of-episode reward and persistent server state force long-horizon planning beyond a single context.”
49
  3. **Theme 3.1:** “Tool-bound enterprise workflows with anti-shortcut evidence and runbook discipline.”
50
+ 4. **Theme 3.2:** “**INC008** is a personalized EA-style conflict (calendar + family messaging) on the same engine; other incidents stress customer delegation under SLA.”
51
  5. **Theme 4:** “**`/curriculum`** shows live adaptive difficulty from rolling rewards; expert criteria rotate; GRPO improves the policy.”
52
  6. **Theme 5:** “Wild-card angle: one deployable environment that fuses the hardest ops themes for LLM incident command training.”
53
 
docs/project/CURRICULUM_AND_ABLATION.md CHANGED
@@ -17,7 +17,7 @@ NEXUS follows a staged difficulty pattern:
17
 
18
  The environment and reward system are explicitly structured to support this progression (`server/environment.py`, `server/difficulty.py`, `server/incidents.py`).
19
 
20
- ## Why this curriculum is valid for BRD + Self-Serve guidance
21
 
22
  - Keeps success probability > 0 early (prevents RL stall).
23
  - Increases branching complexity only after stable basic policy behavior.
 
17
 
18
  The environment and reward system are explicitly structured to support this progression (`server/environment.py`, `server/difficulty.py`, `server/incidents.py`).
19
 
20
+ ## Why this curriculum is valid for hackathon compliance + Self-Serve guidance
21
 
22
  - Keeps success probability > 0 early (prevents RL stall).
23
  - Increases branching complexity only after stable basic policy behavior.
docs/project/FINAL_OPERATIONS_RUNBOOK.md CHANGED
@@ -65,7 +65,7 @@ Purpose: deterministic final-day operations with explicit fallback paths and no
65
  - `/health`
66
  - `/metadata`
67
  - `/schema`
68
- - Pivot to hard-gate proof + archived transcript evidence.
69
  - Do not attempt risky hotfixes during final window.
70
 
71
  ## Roles and ownership
 
65
  - `/health`
66
  - `/metadata`
67
  - `/schema`
68
+ - Pivot to compliance proof + archived transcript evidence.
69
  - Do not attempt risky hotfixes during final window.
70
 
71
  ## Roles and ownership
docs/project/FINAL_READINESS_REPORT.md CHANGED
@@ -45,12 +45,12 @@ Completed deliverables:
45
  - Network traces show successful API polling and interactions (`200` status on judge-critical endpoints).
46
  - No blocking console errors observed.
47
 
48
- ## BRD compliance status
49
 
50
  - OpenEnv workflow compliance: **pass**
51
  - Colab training script path (TRL/Unsloth): **present and documented**
52
  - Public artifact path (blog/video): **script and references prepared**
53
- - Criterion evidence mapping and traceability: **locked in compliance and evidence docs**
54
 
55
  ## Residual risks
56
 
 
45
  - Network traces show successful API polling and interactions (`200` status on judge-critical endpoints).
46
  - No blocking console errors observed.
47
 
48
+ ## Hackathon compliance status
49
 
50
  - OpenEnv workflow compliance: **pass**
51
  - Colab training script path (TRL/Unsloth): **present and documented**
52
  - Public artifact path (blog/video): **script and references prepared**
53
+ - Rubric evidence mapping and traceability: **locked in compliance and evidence docs**
54
 
55
  ## Residual risks
56
 
docs/project/IMPLEMENTATION_SUMMARY.md CHANGED
@@ -76,7 +76,7 @@ NEXUS Enhanced is a multi-agent incident response RL environment for the Meta Py
76
  **7 Cells**:
77
  1. Install: unsloth, trl, transformers, matplotlib
78
  2. Connectivity check: Verify HF Space reachable
79
- 3. `NexusRemoteEnv`: Reset/step interface to PUBLIC `https://kunalkachru23-nexus-enhanced.hf.space`
80
  4. `reward_fn`: Parse IC action → call remote env → collect reward
81
  5. Load Qwen2.5-1.5B: Unsloth QLoRA (rank=16, 4-bit, targets q_proj/k_proj/v_proj/o_proj)
82
  6. GRPOTrainer: learning_rate=5e-5, batch_size=2, num_generations=4
@@ -104,10 +104,10 @@ NEXUS Enhanced is a multi-agent incident response RL environment for the Meta Py
104
  ### Phase 6: HF Spaces Deployment (Ready) 🚀
105
 
106
  **Steps**:
107
- 1. Push code to https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
108
  2. HF Spaces auto-builds Docker image
109
  3. Services available at:
110
- - Judge dashboard: `https://kunalkachru23-nexus-enhanced.hf.space/` (port 7860)
111
  - Metrics: `/metrics`, `/learning-curve`, `/health`
112
  - API: `/reset`, `/step/{session_id}`
113
 
@@ -128,7 +128,7 @@ NEXUS Enhanced is a multi-agent incident response RL environment for the Meta Py
128
  6. ✅ HTML dashboard (`GET /`)
129
  7. ✅ Full episode execution (20 steps)
130
 
131
- **Run**: `python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space`
132
 
133
  ---
134
 
@@ -239,7 +239,7 @@ git commit -m "Phase 5-7: Docker multi-service setup + deployment tests"
239
  # Push to HF Spaces repo
240
  git push origin main
241
 
242
- # Monitor build: https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
243
  # Takes ~5-10 minutes for Docker build
244
  ```
245
 
@@ -249,7 +249,7 @@ Once HF Spaces shows "Running":
249
 
250
  ```bash
251
  # Test all endpoints
252
- python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
253
 
254
  # Expected: ✅ ALL TESTS PASS
255
  ```
@@ -260,9 +260,9 @@ Once Phase 7 tests pass:
260
 
261
  ```
262
  1. Open notebooks/grpo_colab_v2.ipynb
263
- 2. Verify BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space"
264
  3. Run all cells (Unsloth + TRL GRPO training)
265
- 4. Monitor reward curves at: https://kunalkachru23-nexus-enhanced.hf.space/learning-curve
266
  5. Expected trajectory: baseline 0.28 → improve to 0.6-0.8 over 50-100 episodes
267
  ```
268
 
@@ -277,7 +277,7 @@ Once Phase 7 tests pass:
277
  | **Reward Progress** | 20% | Observable Chart.js curves + MTTR improvements | ✅ Dashboard ready |
278
  | **Pipeline** | 10% | GRPO on Colab GPU → HF Space API | ✅ Tests ready |
279
 
280
- **Hard Gates**:
281
  - ✅ OpenEnv v0.2.3 compatible
282
  - ✅ HuggingFace TRL GRPO training
283
  - ✅ Trained checkpoint (TODO: save during training)
@@ -300,19 +300,19 @@ bash test_local_deployment.sh
300
  ### HF Space Testing
301
  ```bash
302
  # Against deployed environment
303
- python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
304
  ```
305
 
306
  ### Manual Verification
307
  ```bash
308
  # FastAPI health
309
- curl https://kunalkachru23-nexus-enhanced.hf.space/health
310
 
311
  # Judge dashboard
312
- open https://kunalkachru23-nexus-enhanced.hf.space/
313
 
314
  # Metrics snapshot
315
- curl https://kunalkachru23-nexus-enhanced.hf.space/metrics
316
  ```
317
 
318
  ---
@@ -350,7 +350,7 @@ curl https://kunalkachru23-nexus-enhanced.hf.space/metrics
350
 
351
  ## Questions for User
352
 
353
- 1. **HF Space URL**: Is `kunalkachru23-nexus-enhanced` the correct space slug?
354
  2. **Training time**: Target training duration on Colab GPU (default ~6 hours for 50 episodes)?
355
  3. **Checkpoint save**: Should checkpoint be saved to HF model hub or kept local?
356
  4. **Blog post**: Topic preference (technical deep-dive vs. storytelling narrative)?
 
76
  **7 Cells**:
77
  1. Install: unsloth, trl, transformers, matplotlib
78
  2. Connectivity check: Verify HF Space reachable
79
+ 3. `NexusRemoteEnv`: Reset/step interface to PUBLIC `https://kunalkachru23-nexus-enhanced-stage.hf.space`
80
  4. `reward_fn`: Parse IC action → call remote env → collect reward
81
  5. Load Qwen2.5-1.5B: Unsloth QLoRA (rank=16, 4-bit, targets q_proj/k_proj/v_proj/o_proj)
82
  6. GRPOTrainer: learning_rate=5e-5, batch_size=2, num_generations=4
 
104
  ### Phase 6: HF Spaces Deployment (Ready) 🚀
105
 
106
  **Steps**:
107
+ 1. Push code to https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
108
  2. HF Spaces auto-builds Docker image
109
  3. Services available at:
110
+ - Judge dashboard: `https://kunalkachru23-nexus-enhanced-stage.hf.space/` (port 7860)
111
  - Metrics: `/metrics`, `/learning-curve`, `/health`
112
  - API: `/reset`, `/step/{session_id}`
113
 
 
128
  6. ✅ HTML dashboard (`GET /`)
129
  7. ✅ Full episode execution (20 steps)
130
 
131
+ **Run**: `python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space`
132
 
133
  ---
134
 
 
239
  # Push to HF Spaces repo
240
  git push origin main
241
 
242
+ # Monitor build: https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
243
  # Takes ~5-10 minutes for Docker build
244
  ```
245
 
 
249
 
250
  ```bash
251
  # Test all endpoints
252
+ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
253
 
254
  # Expected: ✅ ALL TESTS PASS
255
  ```
 
260
 
261
  ```
262
  1. Open notebooks/grpo_colab_v2.ipynb
263
+ 2. Verify BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space"
264
  3. Run all cells (Unsloth + TRL GRPO training)
265
+ 4. Monitor reward curves at: https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve
266
  5. Expected trajectory: baseline 0.28 → improve to 0.6-0.8 over 50-100 episodes
267
  ```
268
 
 
277
  | **Reward Progress** | 20% | Observable Chart.js curves + MTTR improvements | ✅ Dashboard ready |
278
  | **Pipeline** | 10% | GRPO on Colab GPU → HF Space API | ✅ Tests ready |
279
 
280
+ **Mandatory submission requirements**:
281
  - ✅ OpenEnv v0.2.3 compatible
282
  - ✅ HuggingFace TRL GRPO training
283
  - ✅ Trained checkpoint (TODO: save during training)
 
300
  ### HF Space Testing
301
  ```bash
302
  # Against deployed environment
303
+ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
304
  ```
305
 
306
  ### Manual Verification
307
  ```bash
308
  # FastAPI health
309
+ curl https://kunalkachru23-nexus-enhanced-stage.hf.space/health
310
 
311
  # Judge dashboard
312
+ open https://kunalkachru23-nexus-enhanced-stage.hf.space/
313
 
314
  # Metrics snapshot
315
+ curl https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics
316
  ```
317
 
318
  ---
 
350
 
351
  ## Questions for User
352
 
353
+ 1. **HF Space URL**: Canonical judge/demo Space is `kunalkachru23/nexus-enhanced-stage` (`kunalkachru23-nexus-enhanced-stage.hf.space`).
354
  2. **Training time**: Target training duration on Colab GPU (default ~6 hours for 50 episodes)?
355
  3. **Checkpoint save**: Should checkpoint be saved to HF model hub or kept local?
356
  4. **Blog post**: Topic preference (technical deep-dive vs. storytelling narrative)?
docs/project/JUDGING_EVIDENCE_INDEX.md CHANGED
@@ -3,7 +3,7 @@
3
  Snapshot timestamp (UTC): `2026-04-24T16:48:26Z`
4
  Stage URL: `https://kunalkachru23-nexus-enhanced-stage.hf.space`
5
 
6
- ## Hard-gate evidence (BRD Section 17)
7
 
8
  1. OpenEnv latest-release workflow in use
9
  - Local package validate: `openenv validate .`
@@ -20,9 +20,9 @@ Stage URL: `https://kunalkachru23-nexus-enhanced-stage.hf.space`
20
  - Owner action: publish final link and add URL to submission package.
21
 
22
  4. Compliance lock matrix
23
- - BRD criterion and hard-gate mapping: `docs/project/COMPLIANCE_LOCK_MATRIX.md`
24
 
25
- ## Live metrics snapshot (Criterion 3 evidence)
26
 
27
  Source endpoints:
28
  - `GET /metrics`
@@ -78,7 +78,7 @@ Canonical demo-day snapshot set (stage URL only):
78
  - 2-minute live walkthrough: `docs/pitch/DEMO_WALKTHROUGH.md`
79
  - <2 minute recording script: `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md`
80
  - Manual demo test cases: `docs/pitch/DEMO_MANUAL_TEST_CASES.md`
81
- - Criterion-4 behavior proof sheet: `docs/project/BEHAVIORAL_DELTA_PROOF.md`
82
  - Sub-theme matrix: `docs/project/SUBTHEME_EVIDENCE_MATRIX.md`
83
  - Reward-hacking defense: `docs/project/REWARD_HACKING_DEFENSE.md`
84
  - Training audit ledger: `docs/project/TRAINING_AUDIT_LOG.md`
 
3
  Snapshot timestamp (UTC): `2026-04-24T16:48:26Z`
4
  Stage URL: `https://kunalkachru23-nexus-enhanced-stage.hf.space`
5
 
6
+ ## Mandatory compliance evidence (OpenEnv + submission artifacts)
7
 
8
  1. OpenEnv latest-release workflow in use
9
  - Local package validate: `openenv validate .`
 
20
  - Owner action: publish final link and add URL to submission package.
21
 
22
  4. Compliance lock matrix
23
+ - Rubric and mandatory-requirement mapping: `docs/project/COMPLIANCE_LOCK_MATRIX.md`
24
 
25
+ ## Live metrics snapshot (observable improvement evidence)
26
 
27
  Source endpoints:
28
  - `GET /metrics`
 
78
  - 2-minute live walkthrough: `docs/pitch/DEMO_WALKTHROUGH.md`
79
  - <2 minute recording script: `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md`
80
  - Manual demo test cases: `docs/pitch/DEMO_MANUAL_TEST_CASES.md`
81
+ - Behavior delta proof sheet: `docs/project/BEHAVIORAL_DELTA_PROOF.md`
82
  - Sub-theme matrix: `docs/project/SUBTHEME_EVIDENCE_MATRIX.md`
83
  - Reward-hacking defense: `docs/project/REWARD_HACKING_DEFENSE.md`
84
  - Training audit ledger: `docs/project/TRAINING_AUDIT_LOG.md`
docs/project/PLAN_OF_ACTION.md CHANGED
@@ -1,23 +1,23 @@
1
- # Plan of action + BRD compliance matrix
2
 
3
- **Authority:** [`../../../design/hackathon_brd.md`](../../../design/hackathon_brd.md) (Section 17 hard gates, Section 18 rubric, §18.1 pitch format).
4
 
5
  ---
6
 
7
- ## BRD compliance — strict review (evidence-based)
8
 
9
  | Ref | Requirement | Repo / runtime evidence | Status |
10
  |-----|----------------|-------------------------|--------|
11
- | **§17.1** | OpenEnv **(latest release)** — not fork, not old | `openenv validate .` OK; `openenv validate --url` OK after contract routes; `openenv push` workflow; Colab `pip install openenv>=0.2.3`; README “BRD hard gate — OpenEnv” | **Pass** — document `openenv --version` on submission day |
12
- | **§17.2** | Minimal training in **Colab** with **Unsloth** or **HF TRL** | `notebooks/grpo_colab_v2.ipynb` installs TRL + Unsloth + trains GRPO | **Pass pending** — you must execute notebook once on GPU before final submit |
13
- | **§17.3** | **HF blog** *or* **YouTube video &lt; 2 min** | `docs/blog/blog_post.md` draft exists; **publish** + URL in README/submission | **Gap** — publishing is owner action |
14
- | **§18.1** | **3 min** pitch + **2 min** Q&A | `docs/pitch/PITCH.md` script + Q&A table timed to format | **Pass** (content) — rehearsal is owner action |
15
- | **§18.2 C1** | Environment innovation **40%** | Multi-agent, partial observability, 7 incidents, INC007 schema drift, coalition | **Strong** — rehearse one INC007 sentence |
16
- | **§18.2 C2** | Storytelling **30%** | Dashboard + demo flow in `docs/pitch/DEMO_MANUAL_TEST_CASES.md` + `docs/pitch/PITCH.md` | **Pass** — practice run |
17
- | **§18.2 C3** | Observable reward improvement **20%** | `/learning-curve`, dashboard, `scripts/export_reward_plot.py` | **Pass** — keep Space populated for live curve |
18
- | **§18.2 C4** | Reward + pipeline coherence **10%** | Sparse reward, dimensions in README; trained **behaviour** narrative | **Medium** — tie checkpoint to different IC actions, not only reward |
19
 
20
- **Non-BRD but operational:** `pytest tests/` green; `test_hf_space_deployment.py` 8/8 on stage URL; `./gate.sh` or `scripts/shell/gate.sh` optional full run before deploy.
21
 
22
  ---
23
 
@@ -30,7 +30,7 @@
30
  | 3 | Execute **`grpo_colab_v2.ipynb`** on Colab T4+ end-to-end | Team | Notebook completes; curve updates on stage |
31
  | 4 | **`python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space`** → drop PNG into deck | Team | `docs/images/training_reward_curve.png` in slide asset folder |
32
  | 5 | Rehearse **`docs/pitch/PITCH.md`** with live demo (timer **3:00**) | Team | No overrun; Q&A 2:00 bank ready |
33
- | 6 | **Publish** HF blog *or* record **≤2 min** YouTube; add URL to README | Team | BRD §17.3 satisfied |
34
  | 7 | Submission package: Space URL, Colab link, blog/video link, `openenv --version` screenshot | Team | Checklist complete |
35
  | 8 | (Optional) INC007 **60 s** clip for innovation Q&A | Team | Recorded path in repo or drive link |
36
 
 
1
+ # Plan of action + hackathon compliance matrix
2
 
3
+ **Scope:** This matrix tracks evidence against **hackathon compliance criteria** (OpenEnv toolchain, Colab training with TRL/Unsloth, published blog or short video, pitch format, and judging rubric dimensions). Align deliverables with the official organizer requirements for your submission wave.
4
 
5
  ---
6
 
7
+ ## Compliance — strict review (evidence-based)
8
 
9
  | Ref | Requirement | Repo / runtime evidence | Status |
10
  |-----|----------------|-------------------------|--------|
11
+ | **C1** | OpenEnv **(latest release)** — not fork, not old | `openenv validate .` OK; `openenv validate --url` OK after contract routes; `openenv push` workflow; Colab `pip install openenv>=0.2.3`; README **OpenEnv (reproduce)** section | **Pass** — document `openenv --version` on submission day |
12
+ | **C2** | Minimal training in **Colab** with **Unsloth** or **HF TRL** | `notebooks/grpo_colab_v2.ipynb` installs TRL + Unsloth + trains GRPO | **Pass pending** — you must execute notebook once on GPU before final submit |
13
+ | **C3** | **HF blog** *or* **YouTube video &lt; 2 min** | `docs/blog/blog_post_hf.md` draft exists; **publish** + URL in README/submission | **Gap** — publishing is owner action |
14
+ | **C4** | **3 min** pitch + **2 min** Q&A | `docs/pitch/PITCH.md` script + Q&A table timed to format | **Pass** (content) — rehearsal is owner action |
15
+ | **R1** | Environment innovation **40%** | Multi-agent, partial observability, 7 incidents, INC007 schema drift, coalition | **Strong** — rehearse one INC007 sentence |
16
+ | **R2** | Storytelling **30%** | Dashboard + demo flow in `docs/pitch/DEMO_MANUAL_TEST_CASES.md` + `docs/pitch/PITCH.md` | **Pass** — practice run |
17
+ | **R3** | Observable reward improvement **20%** | `/learning-curve`, dashboard, `scripts/export_reward_plot.py` | **Pass** — keep Space populated for live curve |
18
+ | **R4** | Reward + pipeline coherence **10%** | Sparse reward, dimensions in README; trained **behaviour** narrative | **Medium** — tie checkpoint to different IC actions, not only reward |
19
 
20
+ **Additional quality gates:** `pytest tests/` green; `test_hf_space_deployment.py` 8/8 on stage URL; `./gate.sh` or `scripts/shell/gate.sh` optional full run before deploy.
21
 
22
  ---
23
 
 
30
  | 3 | Execute **`grpo_colab_v2.ipynb`** on Colab T4+ end-to-end | Team | Notebook completes; curve updates on stage |
31
  | 4 | **`python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space`** → drop PNG into deck | Team | `docs/images/training_reward_curve.png` in slide asset folder |
32
  | 5 | Rehearse **`docs/pitch/PITCH.md`** with live demo (timer **3:00**) | Team | No overrun; Q&A 2:00 bank ready |
33
+ | 6 | **Publish** HF blog *or* record **≤2 min** YouTube; add URL to README | Team | Hackathon submission artifact (blog or video) satisfied |
34
  | 7 | Submission package: Space URL, Colab link, blog/video link, `openenv --version` screenshot | Team | Checklist complete |
35
  | 8 | (Optional) INC007 **60 s** clip for innovation Q&A | Team | Recorded path in repo or drive link |
36
 
docs/project/PROJECT_STATUS.md CHANGED
@@ -1,7 +1,8 @@
1
  # NEXUS Enhanced — project status & backlog
2
 
3
- **Source of truth for judging:** [`../../../design/hackathon_brd.md`](../../../design/hackathon_brd.md) (Section 17 hard gates + Section 18 rubric).
4
- **Last reviewed:** [`../pitch/PITCH.md`](../pitch/PITCH.md), [`PLAN_OF_ACTION.md`](PLAN_OF_ACTION.md), [`../../scripts/export_reward_plot.py`](../../scripts/export_reward_plot.py), BRD compliance matrix.
 
5
 
6
  **See also:** [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md).
7
 
@@ -22,24 +23,24 @@
22
 
23
  ---
24
 
25
- ## BRD hard gates (Section 17) — checklist
26
 
27
  | # | Requirement | Status |
28
  |---|-------------|--------|
29
- | 1 | OpenEnv **latest release** | **Evidence in repo:** README “BRD hard gate — OpenEnv commands; `openenv validate .` + `openenv validate --url` green after stubs; Colab still `pip install openenv>=0.2.3`. Record `openenv --version` in your pitch appendix. |
30
  | 2 | Minimal **Colab** training script (**Unsloth** or **HF TRL**) | **Notebook aligned:** `grpo_colab_v2.ipynb` now defaults `BASE_URL` to **stage** (`kunalkachru23-nexus-enhanced-stage.hf.space`). You still need one successful T4+ run before submission. |
31
  | 3 | **Blog (HF)** or **Video (YouTube, &lt;2 min)** | **You own:** publish + link in submission. |
32
 
33
  ---
34
 
35
- ## Judging rubric (Section 18) — quick gap scan
36
 
37
  | Criterion | Weight | Focus next |
38
  |-----------|--------|------------|
39
  | Environment innovation | 40% | One sharp “why NEXUS is hard” story (partial observability, schema drift, coalitions) backed by INC007 / live UI. |
40
  | Storytelling | 30% | 3-minute pitch script rehearsed; demo path: metrics → auto-demo → guided manual complete. |
41
  | Observable reward improvement | 20% | Keep dashboard + `/learning-curve` honest; optional: export static plot artifact for slides. |
42
- | Reward / pipeline coherence | 10% | Tie reward dimensions to BRD wording; show before/after behaviour if you have checkpoints. |
43
 
44
  ---
45
 
@@ -47,7 +48,7 @@
47
 
48
  1. ~~**Submission hygiene:** OpenEnv reproduce block in README + `outputs/` for clean `openenv validate .`~~ (done this iteration).
49
  2. **Colab:** Run `grpo_colab_v2.ipynb` once on T4+; capture reward curve screenshot for slides.
50
- 3. **Judge artifacts (BRD gate):** Publish HF **blog** or **YouTube &lt;2 min** and add the link next to README “Blog Post” section.
51
  4. **Pitch (30% storytelling):** 3-minute path: Training tab metrics → **Run Auto-Demo** → Manual **Guided: fill + execute** to complete (see [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md)).
52
  5. **Optional hardening:** Richer `/schema` from Pydantic models (cosmetic).
53
 
 
1
  # NEXUS Enhanced — project status & backlog
2
 
3
+ **Compliance reference:** Use **hackathon compliance criteria** (OpenEnv latest in toolchain, Colab training with TRL/Unsloth, published blog or ≤2 min video, pitch format, judging rubric). This file tracks status against those expectations; compliance wording lives in-repo only (no linked external design doc).
4
+
5
+ **Last reviewed:** [`../pitch/PITCH.md`](../pitch/PITCH.md), [`PLAN_OF_ACTION.md`](PLAN_OF_ACTION.md), [`../../scripts/export_reward_plot.py`](../../scripts/export_reward_plot.py), [`COMPLIANCE_LOCK_MATRIX.md`](COMPLIANCE_LOCK_MATRIX.md).
6
 
7
  **See also:** [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md).
8
 
 
23
 
24
  ---
25
 
26
+ ## Mandatory submission checklist (OpenEnv + artifacts)
27
 
28
  | # | Requirement | Status |
29
  |---|-------------|--------|
30
+ | 1 | OpenEnv **latest release** | **Evidence in repo:** README **OpenEnv (reproduce)** commands; `openenv validate .` + `openenv validate --url` green after stubs; Colab still `pip install openenv>=0.2.3`. Record `openenv --version` in your pitch appendix. |
31
  | 2 | Minimal **Colab** training script (**Unsloth** or **HF TRL**) | **Notebook aligned:** `grpo_colab_v2.ipynb` now defaults `BASE_URL` to **stage** (`kunalkachru23-nexus-enhanced-stage.hf.space`). You still need one successful T4+ run before submission. |
32
  | 3 | **Blog (HF)** or **Video (YouTube, &lt;2 min)** | **You own:** publish + link in submission. |
33
 
34
  ---
35
 
36
+ ## Judging rubric — quick gap scan
37
 
38
  | Criterion | Weight | Focus next |
39
  |-----------|--------|------------|
40
  | Environment innovation | 40% | One sharp “why NEXUS is hard” story (partial observability, schema drift, coalitions) backed by INC007 / live UI. |
41
  | Storytelling | 30% | 3-minute pitch script rehearsed; demo path: metrics → auto-demo → guided manual complete. |
42
  | Observable reward improvement | 20% | Keep dashboard + `/learning-curve` honest; optional: export static plot artifact for slides. |
43
+ | Reward / pipeline coherence | 10% | Tie reward dimensions to the published reward model; show before/after behaviour if you have checkpoints. |
44
 
45
  ---
46
 
 
48
 
49
  1. ~~**Submission hygiene:** OpenEnv reproduce block in README + `outputs/` for clean `openenv validate .`~~ (done this iteration).
50
  2. **Colab:** Run `grpo_colab_v2.ipynb` once on T4+; capture reward curve screenshot for slides.
51
+ 3. **Submission artifacts:** Publish HF **blog** or **YouTube &lt;2 min** and add the link next to README “Blog Post” section.
52
  4. **Pitch (30% storytelling):** 3-minute path: Training tab metrics → **Run Auto-Demo** → Manual **Guided: fill + execute** to complete (see [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md)).
53
  5. **Optional hardening:** Richer `/schema` from Pydantic models (cosmetic).
54
 
docs/project/SUBTHEME_EVIDENCE_MATRIX.md CHANGED
@@ -1,14 +1,14 @@
1
  # Sub-Theme Evidence Matrix (Judge-Ready)
2
 
3
- This matrix maps implemented mechanics to BRD wording and where judges can verify each claim.
4
 
5
- **Parent BRD themes (§9–14)** and the **four §18 scoring criteria** are mapped end-to-end in `docs/project/COMPLIANCE_LOCK_MATRIX.md` (theme demonstration + demo beats). This file focuses on **§15 sponsor sub-themes** and cross-links to the same implementation paths.
6
 
7
  ## Targeted sub-themes
8
 
9
- | Sponsor / Sub-theme | BRD wording to satisfy | Implemented evidence | Where to verify |
10
  |---|---|---|---|
11
- | **Theme 3.2 — Personalized (BRD §12)** | Personal tasks, delegation, conflicting priorities | **INC008** — executive EA calendar conflict (family vs board), smart-scheduler auto-accept root cause; `IncidentType.PERSONAL_ASSISTANT` | Dashboard manual validation select **INC008**; `server/incidents.py` |
12
  | Fleet AI — Scalable Oversight | "monitor, analyze, and explain" | Oversight-oriented behavior + oversight reward component in final score model | `server/reward.py`, `server/agents.py`, live run transcript from `/demo/run/INC003` |
13
  | Halluminate — Multi-Actor Environments | "interacts with and manages multiple actors ... to discover and achieve task" | IC orchestrates L1/L2/SRE/PM actions with partial observability; coalition mechanics present | `server/environment.py`, `server/agents.py`, `server/incidents.py` (INC003+), dashboard manual flow |
14
  | Snorkel AI — Simulated Experts | "changing requirements/preferences" | Rotating expert criteria and adaptive scoring emphasis over episodes | `server/reward.py`, project docs (`README.md`, `docs/project/PLAN_OF_ACTION.md`) |
@@ -17,9 +17,9 @@ This matrix maps implemented mechanics to BRD wording and where judges can verif
17
  | Scale AI — Non-code business (HR & IT) | Long-horizon **non-code** workflows in Sales / PM / **HR & IT** only | **IT / on-call incident command** (status pages, escalations, runbooks, customer comms)—no code-writing task as the core object | Multi-step dashboard validation, SLA/revenue semantics in `server/incidents.py`, L1 customer paths |
18
  | Scaler AI Labs — Multi-App Enterprise RL | "business rule nuances" in enterprise multi-app world | Datadog/Jira/Runbook/Customer interactions with operational constraints and role-specific visibility | `server/tools.py`, `server/incidents.py`, dashboard and auto-demo flow |
19
 
20
- ## Cross-criterion reinforcement
21
 
22
- - Criterion 1 (Innovation 40%): multi-agent + partial observability + schema drift + business-rule constraints.
23
- - Criterion 2 (Storytelling 30%): deterministic live flow in `docs/pitch/PITCH.md` and `docs/pitch/DEMO_WALKTHROUGH.md`.
24
- - Criterion 3 (Improvement 20%): `/learning-curve`, `/metrics`, `docs/images/training_reward_curve.png`.
25
- - Criterion 4 (Pipeline 10%): Colab GRPO script + behavior delta sheet (`docs/project/BEHAVIORAL_DELTA_PROOF.md`).
 
1
  # Sub-Theme Evidence Matrix (Judge-Ready)
2
 
3
+ This matrix maps implemented mechanics to **organizer theme wording** and where judges can verify each claim.
4
 
5
+ **Parent hackathon themes** and the **four judging rubric rows** are mapped end-to-end in `docs/project/COMPLIANCE_LOCK_MATRIX.md` (theme demonstration + demo beats). This file focuses on **sponsor sub-themes** and cross-links to the same implementation paths.
6
 
7
  ## Targeted sub-themes
8
 
9
+ | Sponsor / Sub-theme | Theme wording to satisfy | Implemented evidence | Where to verify |
10
  |---|---|---|---|
11
+ | **Theme 3.2 — Personalized** | Personal tasks, delegation, conflicting priorities | **INC008** — executive EA calendar conflict (family vs board), smart-scheduler auto-accept root cause; `IncidentType.PERSONAL_ASSISTANT` | Dashboard manual validation select **INC008**; `server/incidents.py` |
12
  | Fleet AI — Scalable Oversight | "monitor, analyze, and explain" | Oversight-oriented behavior + oversight reward component in final score model | `server/reward.py`, `server/agents.py`, live run transcript from `/demo/run/INC003` |
13
  | Halluminate — Multi-Actor Environments | "interacts with and manages multiple actors ... to discover and achieve task" | IC orchestrates L1/L2/SRE/PM actions with partial observability; coalition mechanics present | `server/environment.py`, `server/agents.py`, `server/incidents.py` (INC003+), dashboard manual flow |
14
  | Snorkel AI — Simulated Experts | "changing requirements/preferences" | Rotating expert criteria and adaptive scoring emphasis over episodes | `server/reward.py`, project docs (`README.md`, `docs/project/PLAN_OF_ACTION.md`) |
 
17
  | Scale AI — Non-code business (HR & IT) | Long-horizon **non-code** workflows in Sales / PM / **HR & IT** only | **IT / on-call incident command** (status pages, escalations, runbooks, customer comms)—no code-writing task as the core object | Multi-step dashboard validation, SLA/revenue semantics in `server/incidents.py`, L1 customer paths |
18
  | Scaler AI Labs — Multi-App Enterprise RL | "business rule nuances" in enterprise multi-app world | Datadog/Jira/Runbook/Customer interactions with operational constraints and role-specific visibility | `server/tools.py`, `server/incidents.py`, dashboard and auto-demo flow |
19
 
20
+ ## Cross-rubric reinforcement
21
 
22
+ - Innovation (40%): multi-agent + partial observability + schema drift + business-rule constraints.
23
+ - Storytelling (30%): deterministic live flow in `docs/pitch/PITCH.md` and `docs/pitch/DEMO_WALKTHROUGH.md`.
24
+ - Observable improvement (20%): `/learning-curve`, `/metrics`, `docs/images/training_reward_curve.png`.
25
+ - Pipeline coherence (10%): Colab GRPO script + behavior delta sheet (`docs/project/BEHAVIORAL_DELTA_PROOF.md`).
docs/project/TEST_RESULTS_SUMMARY.md CHANGED
@@ -474,7 +474,7 @@ Episodes will complete naturally without needing the "End Episode" button workar
474
  | **API Tests** | ✅ PASS | Episode completion with reward |
475
  | **UI Playwright** | ✅ PASS | 8-iteration state management |
476
  | **Manual Testing** | ⚠️ PARTIAL | Phase progression works, reward needs workaround |
477
- | **Environment Audit** | ✅ PASS | All BRD requirements met |
478
 
479
  ---
480
 
 
474
  | **API Tests** | ✅ PASS | Episode completion with reward |
475
  | **UI Playwright** | ✅ PASS | 8-iteration state management |
476
  | **Manual Testing** | ⚠️ PARTIAL | Phase progression works, reward needs workaround |
477
+ | **Environment Audit** | ✅ PASS | All hackathon compliance requirements met |
478
 
479
  ---
480
 
episode_rewards.json CHANGED
@@ -1 +1 @@
1
- [{"session_id": "legacy-1", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3047, "timestamp": 0.0}, {"session_id": "legacy-2", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2568, "timestamp": 0.0}, {"session_id": "legacy-3", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3226, "timestamp": 0.0}, {"session_id": "legacy-4", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3956, "timestamp": 0.0}, {"session_id": "legacy-5", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2579, "timestamp": 0.0}, {"session_id": "legacy-6", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2608, "timestamp": 0.0}, {"session_id": "legacy-7", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4088, "timestamp": 0.0}, {"session_id": "legacy-8", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3468, "timestamp": 0.0}, {"session_id": "legacy-9", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2507, "timestamp": 0.0}, {"session_id": "legacy-10", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3346, "timestamp": 0.0}, {"session_id": "legacy-11", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.257, "timestamp": 0.0}, {"session_id": "legacy-12", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2597, "timestamp": 0.0}, {"session_id": "legacy-13", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3193, "timestamp": 0.0}, {"session_id": "legacy-14", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1497, "timestamp": 0.0}, {"session_id": "legacy-15", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1677, "timestamp": 0.0}, {"session_id": "legacy-16", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2636, "timestamp": 0.0}, {"session_id": "legacy-17", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2305, "timestamp": 0.0}, {"session_id": "legacy-18", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3396, "timestamp": 0.0}, {"session_id": "legacy-19", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2447, "timestamp": 0.0}, {"session_id": "legacy-20", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2073, "timestamp": 0.0}, {"session_id": "legacy-21", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-22", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.308, "timestamp": 0.0}, {"session_id": "legacy-23", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3344, "timestamp": 0.0}, {"session_id": "legacy-24", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2179, "timestamp": 0.0}, {"session_id": "legacy-25", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2912, "timestamp": 0.0}, {"session_id": "legacy-26", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3466, "timestamp": 0.0}, {"session_id": "legacy-27", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2485, "timestamp": 0.0}, {"session_id": "legacy-28", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3736, "timestamp": 0.0}, {"session_id": "legacy-29", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2984, "timestamp": 0.0}, {"session_id": "legacy-30", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.326, "timestamp": 0.0}, {"session_id": "legacy-31", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3041, "timestamp": 0.0}, {"session_id": "legacy-32", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5033, "timestamp": 0.0}, {"session_id": "legacy-33", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.357, "timestamp": 0.0}, {"session_id": "legacy-34", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2764, "timestamp": 0.0}, {"session_id": "legacy-35", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4297, "timestamp": 0.0}, {"session_id": "legacy-36", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2691, "timestamp": 0.0}, {"session_id": "legacy-37", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3864, "timestamp": 0.0}, {"session_id": "legacy-38", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2158, "timestamp": 0.0}, {"session_id": "legacy-39", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2693, "timestamp": 0.0}, {"session_id": "legacy-40", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3942, "timestamp": 0.0}, {"session_id": "legacy-41", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-42", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3979, "timestamp": 0.0}, {"session_id": "legacy-43", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3779, "timestamp": 0.0}, {"session_id": "legacy-44", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.366, "timestamp": 0.0}, {"session_id": "legacy-45", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2747, "timestamp": 0.0}, {"session_id": "legacy-46", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3383, "timestamp": 0.0}, {"session_id": "legacy-47", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3619, "timestamp": 0.0}, {"session_id": "legacy-48", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4863, "timestamp": 0.0}, {"session_id": "legacy-49", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4321, "timestamp": 0.0}, {"session_id": "legacy-50", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2665, "timestamp": 0.0}, {"session_id": "legacy-51", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4363, "timestamp": 0.0}, {"session_id": "legacy-52", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3825, "timestamp": 0.0}, {"session_id": "legacy-53", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3621, "timestamp": 0.0}, {"session_id": "legacy-54", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4681, "timestamp": 0.0}, {"session_id": "legacy-55", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5045, "timestamp": 0.0}, {"session_id": "legacy-56", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4995, "timestamp": 0.0}, {"session_id": "legacy-57", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3607, "timestamp": 0.0}, {"session_id": "legacy-58", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.406, "timestamp": 0.0}, {"session_id": "legacy-59", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4602, "timestamp": 0.0}, {"session_id": "legacy-60", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5146, "timestamp": 0.0}, {"session_id": "legacy-61", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4012, "timestamp": 0.0}, {"session_id": "legacy-62", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4275, "timestamp": 0.0}, {"session_id": "legacy-63", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3568, "timestamp": 0.0}, {"session_id": "legacy-64", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3525, "timestamp": 0.0}, {"session_id": "legacy-65", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5161, "timestamp": 0.0}, {"session_id": "legacy-66", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5625, "timestamp": 0.0}, {"session_id": "legacy-67", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4512, "timestamp": 0.0}, {"session_id": "legacy-68", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5401, "timestamp": 0.0}, {"session_id": "legacy-69", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4917, "timestamp": 0.0}, {"session_id": "legacy-70", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4141, "timestamp": 0.0}, {"session_id": "legacy-71", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4975, "timestamp": 0.0}, {"session_id": "legacy-72", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5945, "timestamp": 0.0}, {"session_id": "legacy-73", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4715, "timestamp": 0.0}, {"session_id": "legacy-74", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6025, "timestamp": 0.0}, {"session_id": "legacy-75", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2706, "timestamp": 0.0}, {"session_id": "legacy-76", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5489, "timestamp": 0.0}, {"session_id": "legacy-77", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.493, "timestamp": 0.0}, {"session_id": "legacy-78", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.465, "timestamp": 0.0}, {"session_id": "legacy-79", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4992, "timestamp": 0.0}, {"session_id": "legacy-80", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3357, "timestamp": 0.0}, {"session_id": "legacy-81", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4801, "timestamp": 0.0}, {"session_id": "legacy-82", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5291, "timestamp": 0.0}, {"session_id": "legacy-83", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6217, "timestamp": 0.0}, {"session_id": "legacy-84", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4649, "timestamp": 0.0}, {"session_id": "legacy-85", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4446, "timestamp": 0.0}, {"session_id": "legacy-86", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.9484, "timestamp": 0.0}, {"session_id": "legacy-87", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5883, "timestamp": 0.0}, {"session_id": "legacy-88", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5443, "timestamp": 0.0}, {"session_id": "legacy-89", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4785, "timestamp": 0.0}, {"session_id": "legacy-90", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5649, "timestamp": 0.0}, {"session_id": "legacy-91", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5345, "timestamp": 0.0}, {"session_id": "legacy-92", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6071, "timestamp": 0.0}, {"session_id": "legacy-93", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4764, "timestamp": 0.0}, {"session_id": "legacy-94", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5092, "timestamp": 0.0}, {"session_id": "legacy-95", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.507, "timestamp": 0.0}, {"session_id": "legacy-96", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4242, "timestamp": 0.0}, {"session_id": "legacy-97", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5679, "timestamp": 0.0}, {"session_id": "legacy-98", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.568, "timestamp": 0.0}, {"session_id": "legacy-99", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5504, "timestamp": 0.0}, {"session_id": "legacy-100", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-101", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-102", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-103", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-104", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-105", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-106", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-107", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-108", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-109", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-110", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-111", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-112", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-113", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-114", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-115", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-116", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-117", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-118", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-119", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-120", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-121", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-122", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-123", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-124", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-125", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-126", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-127", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-128", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-129", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-130", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-131", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-132", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-133", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "a003e153-9026-406c-879d-25172aa11eda", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776965109.7711759}, {"session_id": "bf34a807-a40d-4ae5-b8b0-4d8333a62c81", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776965109.825952}, {"session_id": "8876cfb6-f5e9-4ad0-941e-84bc7d8e2b96", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776967485.476565}, {"session_id": "db31a9f2-ca48-4f93-8d1e-3fa50fa5d21a", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776967485.519014}, {"session_id": "4aaf8a4d-0db7-47f5-9dca-8bc360b19088", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776978808.275504}, {"session_id": "c13f72bb-5715-4209-9872-e85acecbc8b3", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776978808.317802}, {"session_id": "4da60545-6cb0-4a65-acf2-9a8fd8cf7e59", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776982752.592321}, {"session_id": "48a56b54-d4cd-47c9-b2e1-165dd330a6c9", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776982752.632246}, {"session_id": "2eab58c0-ffe3-4a11-9512-3b606eb2a957", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776983166.506444}, {"session_id": "ec0ae7f3-6095-4dce-b323-1cd1482c1ba4", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776983166.548518}, {"session_id": "b720c132-ddce-41fa-99d1-be7ebaa32de2", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984636.36427}, {"session_id": "edb8bdb1-5903-4710-b9f4-cd68c67f6474", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984636.4119911}, {"session_id": "2654393f-5c7e-4d17-938d-2570195a3c5f", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984647.97215}, {"session_id": "c0e98721-13b3-49e3-a3fd-735801f5f9f7", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984648.018783}, {"session_id": "6fe810db-c9ae-48e7-a205-6d2df902b555", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049568.085221}, {"session_id": "58c9af47-0d8b-4359-9889-a6a04552db83", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049568.1298869}, {"session_id": "552395e3-cae4-4423-bc3c-9314b8cc276d", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049580.030255}, {"session_id": "49a9bb9e-c9b1-465f-a617-6443da42d1be", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049580.0770292}, {"session_id": "fb4a1426-bf33-47f0-8357-29b19a75a19c", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777051304.085602}, {"session_id": "2714abc5-60be-4bf4-981c-da1c2ba328b5", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777051304.1293068}, {"session_id": "69407c50-1635-4bd8-b1b9-e1017cbb5297", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777070951.992904}, {"session_id": "9fe3899c-455e-483b-b5ce-19240df9f0e6", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777070952.039953}]
 
1
+ [{"session_id": "legacy-1", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3047, "timestamp": 0.0}, {"session_id": "legacy-2", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2568, "timestamp": 0.0}, {"session_id": "legacy-3", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3226, "timestamp": 0.0}, {"session_id": "legacy-4", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3956, "timestamp": 0.0}, {"session_id": "legacy-5", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2579, "timestamp": 0.0}, {"session_id": "legacy-6", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2608, "timestamp": 0.0}, {"session_id": "legacy-7", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4088, "timestamp": 0.0}, {"session_id": "legacy-8", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3468, "timestamp": 0.0}, {"session_id": "legacy-9", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2507, "timestamp": 0.0}, {"session_id": "legacy-10", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3346, "timestamp": 0.0}, {"session_id": "legacy-11", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.257, "timestamp": 0.0}, {"session_id": "legacy-12", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2597, "timestamp": 0.0}, {"session_id": "legacy-13", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3193, "timestamp": 0.0}, {"session_id": "legacy-14", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1497, "timestamp": 0.0}, {"session_id": "legacy-15", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1677, "timestamp": 0.0}, {"session_id": "legacy-16", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2636, "timestamp": 0.0}, {"session_id": "legacy-17", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2305, "timestamp": 0.0}, {"session_id": "legacy-18", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3396, "timestamp": 0.0}, {"session_id": "legacy-19", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2447, "timestamp": 0.0}, {"session_id": "legacy-20", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2073, "timestamp": 0.0}, {"session_id": "legacy-21", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-22", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.308, "timestamp": 0.0}, {"session_id": "legacy-23", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3344, "timestamp": 0.0}, {"session_id": "legacy-24", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2179, "timestamp": 0.0}, {"session_id": "legacy-25", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2912, "timestamp": 0.0}, {"session_id": "legacy-26", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3466, "timestamp": 0.0}, {"session_id": "legacy-27", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2485, "timestamp": 0.0}, {"session_id": "legacy-28", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3736, "timestamp": 0.0}, {"session_id": "legacy-29", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2984, "timestamp": 0.0}, {"session_id": "legacy-30", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.326, "timestamp": 0.0}, {"session_id": "legacy-31", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3041, "timestamp": 0.0}, {"session_id": "legacy-32", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5033, "timestamp": 0.0}, {"session_id": "legacy-33", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.357, "timestamp": 0.0}, {"session_id": "legacy-34", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2764, "timestamp": 0.0}, {"session_id": "legacy-35", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4297, "timestamp": 0.0}, {"session_id": "legacy-36", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2691, "timestamp": 0.0}, {"session_id": "legacy-37", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3864, "timestamp": 0.0}, {"session_id": "legacy-38", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2158, "timestamp": 0.0}, {"session_id": "legacy-39", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2693, "timestamp": 0.0}, {"session_id": "legacy-40", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3942, "timestamp": 0.0}, {"session_id": "legacy-41", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-42", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3979, "timestamp": 0.0}, {"session_id": "legacy-43", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3779, "timestamp": 0.0}, {"session_id": "legacy-44", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.366, "timestamp": 0.0}, {"session_id": "legacy-45", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2747, "timestamp": 0.0}, {"session_id": "legacy-46", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3383, "timestamp": 0.0}, {"session_id": "legacy-47", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3619, "timestamp": 0.0}, {"session_id": "legacy-48", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4863, "timestamp": 0.0}, {"session_id": "legacy-49", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4321, "timestamp": 0.0}, {"session_id": "legacy-50", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2665, "timestamp": 0.0}, {"session_id": "legacy-51", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4363, "timestamp": 0.0}, {"session_id": "legacy-52", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3825, "timestamp": 0.0}, {"session_id": "legacy-53", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3621, "timestamp": 0.0}, {"session_id": "legacy-54", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4681, "timestamp": 0.0}, {"session_id": "legacy-55", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5045, "timestamp": 0.0}, {"session_id": "legacy-56", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4995, "timestamp": 0.0}, {"session_id": "legacy-57", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3607, "timestamp": 0.0}, {"session_id": "legacy-58", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.406, "timestamp": 0.0}, {"session_id": "legacy-59", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4602, "timestamp": 0.0}, {"session_id": "legacy-60", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5146, "timestamp": 0.0}, {"session_id": "legacy-61", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4012, "timestamp": 0.0}, {"session_id": "legacy-62", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4275, "timestamp": 0.0}, {"session_id": "legacy-63", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3568, "timestamp": 0.0}, {"session_id": "legacy-64", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3525, "timestamp": 0.0}, {"session_id": "legacy-65", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5161, "timestamp": 0.0}, {"session_id": "legacy-66", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5625, "timestamp": 0.0}, {"session_id": "legacy-67", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4512, "timestamp": 0.0}, {"session_id": "legacy-68", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5401, "timestamp": 0.0}, {"session_id": "legacy-69", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4917, "timestamp": 0.0}, {"session_id": "legacy-70", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4141, "timestamp": 0.0}, {"session_id": "legacy-71", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4975, "timestamp": 0.0}, {"session_id": "legacy-72", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5945, "timestamp": 0.0}, {"session_id": "legacy-73", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4715, "timestamp": 0.0}, {"session_id": "legacy-74", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6025, "timestamp": 0.0}, {"session_id": "legacy-75", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2706, "timestamp": 0.0}, {"session_id": "legacy-76", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5489, "timestamp": 0.0}, {"session_id": "legacy-77", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.493, "timestamp": 0.0}, {"session_id": "legacy-78", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.465, "timestamp": 0.0}, {"session_id": "legacy-79", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4992, "timestamp": 0.0}, {"session_id": "legacy-80", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3357, "timestamp": 0.0}, {"session_id": "legacy-81", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4801, "timestamp": 0.0}, {"session_id": "legacy-82", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5291, "timestamp": 0.0}, {"session_id": "legacy-83", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6217, "timestamp": 0.0}, {"session_id": "legacy-84", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4649, "timestamp": 0.0}, {"session_id": "legacy-85", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4446, "timestamp": 0.0}, {"session_id": "legacy-86", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.9484, "timestamp": 0.0}, {"session_id": "legacy-87", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5883, "timestamp": 0.0}, {"session_id": "legacy-88", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5443, "timestamp": 0.0}, {"session_id": "legacy-89", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4785, "timestamp": 0.0}, {"session_id": "legacy-90", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5649, "timestamp": 0.0}, {"session_id": "legacy-91", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5345, "timestamp": 0.0}, {"session_id": "legacy-92", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6071, "timestamp": 0.0}, {"session_id": "legacy-93", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4764, "timestamp": 0.0}, {"session_id": "legacy-94", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5092, "timestamp": 0.0}, {"session_id": "legacy-95", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.507, "timestamp": 0.0}, {"session_id": "legacy-96", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4242, "timestamp": 0.0}, {"session_id": "legacy-97", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5679, "timestamp": 0.0}, {"session_id": "legacy-98", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.568, "timestamp": 0.0}, {"session_id": "legacy-99", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5504, "timestamp": 0.0}, {"session_id": "legacy-100", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-101", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-102", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-103", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-104", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-105", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-106", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-107", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-108", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-109", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-110", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-111", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-112", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-113", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-114", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-115", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-116", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-117", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-118", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-119", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-120", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-121", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-122", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-123", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-124", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-125", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-126", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-127", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-128", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-129", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-130", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-131", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-132", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-133", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "a003e153-9026-406c-879d-25172aa11eda", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776965109.7711759}, {"session_id": "bf34a807-a40d-4ae5-b8b0-4d8333a62c81", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776965109.825952}, {"session_id": "8876cfb6-f5e9-4ad0-941e-84bc7d8e2b96", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776967485.476565}, {"session_id": "db31a9f2-ca48-4f93-8d1e-3fa50fa5d21a", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776967485.519014}, {"session_id": "4aaf8a4d-0db7-47f5-9dca-8bc360b19088", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776978808.275504}, {"session_id": "c13f72bb-5715-4209-9872-e85acecbc8b3", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776978808.317802}, {"session_id": "4da60545-6cb0-4a65-acf2-9a8fd8cf7e59", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776982752.592321}, {"session_id": "48a56b54-d4cd-47c9-b2e1-165dd330a6c9", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776982752.632246}, {"session_id": "2eab58c0-ffe3-4a11-9512-3b606eb2a957", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776983166.506444}, {"session_id": "ec0ae7f3-6095-4dce-b323-1cd1482c1ba4", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776983166.548518}, {"session_id": "b720c132-ddce-41fa-99d1-be7ebaa32de2", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984636.36427}, {"session_id": "edb8bdb1-5903-4710-b9f4-cd68c67f6474", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984636.4119911}, {"session_id": "2654393f-5c7e-4d17-938d-2570195a3c5f", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984647.97215}, {"session_id": "c0e98721-13b3-49e3-a3fd-735801f5f9f7", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984648.018783}, {"session_id": "6fe810db-c9ae-48e7-a205-6d2df902b555", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049568.085221}, {"session_id": "58c9af47-0d8b-4359-9889-a6a04552db83", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049568.1298869}, {"session_id": "552395e3-cae4-4423-bc3c-9314b8cc276d", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049580.030255}, {"session_id": "49a9bb9e-c9b1-465f-a617-6443da42d1be", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049580.0770292}, {"session_id": "fb4a1426-bf33-47f0-8357-29b19a75a19c", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777051304.085602}, {"session_id": "2714abc5-60be-4bf4-981c-da1c2ba328b5", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777051304.1293068}, {"session_id": "29f5e924-bb71-4b4d-85ea-e6f321f8b674", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777184935.9644742}, {"session_id": "057b4802-136c-4fb3-b62b-0e60014a6fcf", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777184936.008428}]
notebooks/grpo_colab_enhanced.ipynb CHANGED
@@ -4,7 +4,24 @@
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
- "# NEXUS Enhanced \u2014 GRPO Training Notebook (**Enhanced**)\n\n**Same pipeline as `grpo_colab_v2.ipynb`, with optional multi-incident rotation and scoped metrics `run_id`.**\n\nUse **`grpo_colab_v2.ipynb`** for the simplest single-incident (INC003) path. Use **this notebook** when you want a **defined incident pool** (enterprise + EA + lighter variety) without editing code between runs.\n\n- **Rotation:** `round_robin` (default) or `random` per reward episode (`NEXUS_INCIDENT_ROTATION`).\n- **Pool:** `NEXUS_INCIDENT_POOL` (comma-separated) or defaults to `INC003,INC008,INC001`. Set `NEXUS_MULTI_INCIDENT=false` to lock to `NEXUS_INCIDENT_ID` only.\n- **Metrics:** `NEXUS_GRPO_RUN_ID` tags `/reset` episodes so `GET /learning-curve?run_id=...` matches this Colab run.\n\n## How to run (please follow order)\n\n1. **Runtime:** GPU (e.g. T4).\n2. Run cells **top to bottom** at least once per session: installs \u2192 **configuration** \u2192 environment \u2192 model \u2192 **training** \u2192 **plots**.\n3. Edit the **configuration cell** for `BASE_URL`, incident pool, `GRPO_RUN_ID`, `ONE_ROUND_TRAINING`, and optional `NEXUS_*` env vars.\n4. **Google Drive:** Same backup behavior as v2 when `BACKUP_TO_GOOGLE_DRIVE` is True on Colab.\n- **Checkpoints / resume:** frequent `save_steps` on the quick path; default checkpoint folder moves to **Google Drive** on Colab when Drive is mounted so you can reconnect and **continue training** without restarting from scratch.\n\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ]
9
  },
10
  {
@@ -13,14 +30,14 @@
13
  "metadata": {},
14
  "outputs": [],
15
  "source": [
16
- "# Validate OpenEnv installation (Hard Gate #1)\n",
17
  "try:\n",
18
  " import openenv\n",
19
- " print(f\"\u2705 OpenEnv {openenv.__version__} installed\")\n",
20
  "except ImportError:\n",
21
- " print(\"\u26a0\ufe0f OpenEnv not yet installed (will be installed in next cell)\")\n",
22
  "\n",
23
- "print(\"\u2705 This notebook meets BRD Hard Gate #1: 'Usage of OpenEnv (latest release)'\")"
24
  ]
25
  },
26
  {
@@ -40,7 +57,32 @@
40
  "cell_type": "markdown",
41
  "metadata": {},
42
  "source": [
43
- "## Configuration & HF Space connectivity\n\nThe next code cell adds **multi-incident** defaults and **`run_id`** for scoped learning curves. Override with:\n\n- `NEXUS_INCIDENT_POOL` \u2014 e.g. `INC003,INC008,INC004` (comma-separated). Ignored if `NEXUS_MULTI_INCIDENT=false`.\n- `NEXUS_MULTI_INCIDENT` \u2014 `false` to train against **`NEXUS_INCIDENT_ID`** only (v2-style).\n- `NEXUS_INCIDENT_ROTATION` \u2014 `round_robin` (default) or `random`.\n- `NEXUS_GRPO_RUN_ID` \u2014 string passed to `POST /reset` as `run_id` (default `colab_grpo_enhanced`).\n\n`REWARD_MAX_STEPS` default is **35** so mixed pools have headroom vs INC003-only (28).\n**Checkpoints:** On Colab with Drive mounted, weights save under `.../NEXUS_GRPO_backups/active_grpo_checkpoints` by default so a disconnect does not wipe them. Re-run the training cell to **resume** from the latest step (`NEXUS_FORCE_FRESH=true` starts over). Set `NEXUS_GRPO_OUTPUT_DIR` to override the directory.\n\n---\n\n## Colab free tier \u2014 reducing disconnect / expiry pain\n\n**Free Colab cannot be guaranteed** to stay alive for hours (idle limits, preemption, daily caps). Mitigations:\n\n1. **Drive checkpoints + resume** (this notebook): mount Drive in the config cell; **`GRPO_OUTPUT_DIR`** defaults to **`.../NEXUS_GRPO_backups/active_grpo_checkpoints`** on Colab when Drive is present. After a disconnect, **re-run setup cells in order**, then training \u2014 **`trainer.train(resume_from_checkpoint=...)`** picks up the latest step unless **`NEXUS_FORCE_FRESH=true`**.\n2. **Shorter runs:** **`ONE_ROUND_TRAINING = True`** or fewer prompts per session; continue later with resume instead of one very long run.\n3. **Frequent `save_steps`** on the quick path so less work is lost.\n4. **Stable browser session:** avoid laptop sleep; keep the Colab tab in a focused window on a reliable network while training runs.\n5. **Colab Pro / Pro+** if you need longer single sessions.\n\n**Disable HF checkpoints entirely:** set **`NEXUS_ENABLE_CHECKPOINTS=false`** (no checkpoint files, no resume; saves Drive space).\n\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ]
45
  },
46
  {
@@ -73,15 +115,15 @@
73
  "IN_COLAB = _in_colab()\n",
74
  "\n",
75
  "# ---------------------------------------------------------------------------\n",
76
- "# Notebook configuration \u2014 edit defaults here or set environment variables.\n",
77
  "# Enhanced notebook extras:\n",
78
- "# NEXUS_INCIDENT_POOL \u2014 comma-separated case ids (default: INC003,INC008,INC001)\n",
79
- "# NEXUS_MULTI_INCIDENT \u2014 false \u2192 use only NEXUS_INCIDENT_ID (v2-style single task)\n",
80
- "# NEXUS_INCIDENT_ROTATION \u2014 round_robin | random\n",
81
- "# NEXUS_GRPO_RUN_ID \u2014 POST /reset run_id for scoped /learning-curve and /metrics\n",
82
  "#\n",
83
  "# See grpo_colab_v2.ipynb for the full list of NEXUS_* vars (same as here).\n",
84
- "# NEXUS_ENABLE_CHECKPOINTS (true/false) \u2014 false: no HF checkpoint files, no resume\n",
85
  "# ---------------------------------------------------------------------------\n",
86
  "\n",
87
  "BASE_URL = _env(\n",
@@ -167,13 +209,13 @@
167
  " if not BACKUP_TO_GOOGLE_DRIVE:\n",
168
  " return\n",
169
  " if not IN_COLAB:\n",
170
- " print(\"\u26a0\ufe0f BACKUP_TO_GOOGLE_DRIVE is True but not in Colab \u2014 skipping Drive mount.\")\n",
171
  " return\n",
172
  " if os.path.isdir(\"/content/drive/MyDrive\"):\n",
173
- " print(\"\u2705 Google Drive already mounted.\")\n",
174
  " return\n",
175
  " from google.colab import drive\n",
176
- " print(\"\ud83d\udcc2 Mount Google Drive when prompted (artifacts copy to My Drive / NEXUS_GRPO_backups).\")\n",
177
  " drive.mount(\"/content/drive\")\n",
178
  "\n",
179
  "\n",
@@ -224,12 +266,12 @@
224
  "try:\n",
225
  " resp = requests.get(f\"{BASE_URL}/health\", timeout=5)\n",
226
  " if resp.status_code == 200:\n",
227
- " print(\"\u2705 HF Space is reachable\")\n",
228
  " print(f\"Response: {resp.json()}\")\n",
229
  " else:\n",
230
- " print(f\"\u274c HF Space returned status {resp.status_code}\")\n",
231
  "except Exception as e:\n",
232
- " print(f\"\u274c Error connecting to HF Space: {e}\")\n",
233
  " print(f\"URL: {BASE_URL}\")\n",
234
  " print(\"Make sure HF Space is deployed and running\")\n"
235
  ]
@@ -289,7 +331,7 @@
289
  " return data[\"observation\"], data[\"reward\"], data[\"done\"], data[\"info\"]\n",
290
  "\n",
291
  " def get_learning_curve(self, run_id=None):\n",
292
- " \"\"\"GET /learning-curve \u2014 optional run_id scopes metrics to this Colab run.\"\"\"\n",
293
  " params = {}\n",
294
  " rid = run_id if run_id is not None else GRPO_RUN_ID\n",
295
  " if rid:\n",
@@ -304,7 +346,7 @@
304
  "\n",
305
  "\n",
306
  "env = NexusRemoteEnv()\n",
307
- "print(\"\u2705 Environment interface ready (enhanced: run_id + scoped learning curve)\")\n",
308
  "\n",
309
  "_NEXUS_DRIVE_RUN_DIR = None\n",
310
  "\n",
@@ -328,7 +370,7 @@
328
  "\n",
329
  "\n",
330
  "def backup_nexus_artifacts_to_drive(reason=\"manual\", *, include_learning_curve=True):\n",
331
- " \"\"\"Copy GRPO checkpoints, PNG plots (if present), learning curve JSON, manifest \u2192 Google Drive.\"\"\"\n",
332
  " if not BACKUP_TO_GOOGLE_DRIVE:\n",
333
  " print(f\"[Drive backup:{reason}] skipped (BACKUP_TO_GOOGLE_DRIVE=False)\")\n",
334
  " return None\n",
@@ -336,7 +378,7 @@
336
  " print(f\"[Drive backup:{reason}] skipped (not Colab)\")\n",
337
  " return None\n",
338
  " if not os.path.isdir(\"/content/drive/MyDrive\"):\n",
339
- " print(f\"[Drive backup:{reason}] skipped \u2014 mount Drive in the config cell first\")\n",
340
  " return None\n",
341
  " dest = _nexus_google_drive_run_dir()\n",
342
  " if dest is None:\n",
@@ -346,22 +388,22 @@
346
  " if os.path.isdir(GRPO_OUTPUT_DIR):\n",
347
  " tgt = os.path.join(dest, \"grpo_checkpoints\")\n",
348
  " shutil.copytree(GRPO_OUTPUT_DIR, tgt, dirs_exist_ok=True)\n",
349
- " print(f\" \u2705 checkpoints \u2192 {tgt}\")\n",
350
  "\n",
351
  " for name in (\"training_analysis.png\", \"reward_curves_hires.png\"):\n",
352
  " src = os.path.join(PLOT_OUTPUT_DIR, name)\n",
353
  " if os.path.isfile(src):\n",
354
  " shutil.copy2(src, os.path.join(dest, name))\n",
355
- " print(f\" \u2705 plot {name}\")\n",
356
  "\n",
357
  " if include_learning_curve:\n",
358
  " try:\n",
359
  " curve = env.get_learning_curve()\n",
360
  " with open(os.path.join(dest, \"learning_curve.json\"), \"w\") as f:\n",
361
  " _json_backup.dump(curve, f, indent=2)\n",
362
- " print(\" \u2705 learning_curve.json\")\n",
363
  " except Exception as e:\n",
364
- " print(f\" \u26a0\ufe0f learning curve fetch failed: {e}\")\n",
365
  "\n",
366
  " manifest = {\n",
367
  " \"reason\": reason,\n",
@@ -377,7 +419,7 @@
377
  " }\n",
378
  " with open(os.path.join(dest, \"run_manifest.json\"), \"w\") as f:\n",
379
  " _json_backup.dump(manifest, f, indent=2)\n",
380
- " print(f\"\\n\ud83d\udce6 Drive backup ({reason}): {dest}\\n\")\n",
381
  " return dest\n"
382
  ]
383
  },
@@ -463,7 +505,7 @@
463
  "\n",
464
  "\n",
465
  "print(\n",
466
- " \"\u2705 Reward function defined (pool=\",\n",
467
  " TRAINING_INCIDENT_POOL,\n",
468
  " \", rotation=\",\n",
469
  " INCIDENT_ROTATION,\n",
@@ -522,7 +564,7 @@
522
  " save_steps=GRPO_SAVE_STEPS_QUICK if ONE_ROUND_TRAINING else GRPO_SAVE_STEPS_FULL,\n",
523
  ")\n",
524
  "\n",
525
- "print(\"\u2705 Model loaded and GRPO configured\")\n"
526
  ]
527
  },
528
  {
@@ -541,6 +583,9 @@
541
  },
542
  {
543
  "cell_type": "code",
 
 
 
544
  "source": [
545
  "from datasets import Dataset\n",
546
  "import os\n",
@@ -550,18 +595,18 @@
550
  "\n",
551
  "print(\"\\n\" + \"=\" * 70)\n",
552
  "if ONE_ROUND_TRAINING:\n",
553
- " print(f\"\ud83d\ude80 GRPO \u2014 ONE ROUND ({n_target} prompts, fast path)\")\n",
554
  "else:\n",
555
- " print(f\"\ud83d\ude80 GRPO \u2014 FULL RUN ({n_target} prompts)\")\n",
556
  "print(\"=\" * 70)\n",
557
  "print(\"\\nConfiguration:\")\n",
558
- "print(f\" \u2022 Model: {MODEL_NAME}\")\n",
559
- "print(f\" \u2022 Dataset rows: {n_target}\")\n",
560
- "print(f\" \u2022 Environment: {BASE_URL}\")\n",
561
- "print(f\" \u2022 Incident pool: {TRAINING_INCIDENT_POOL} ({INCIDENT_ROTATION})\")\n",
562
- "print(f\" \u2022 GRPO_RUN_ID (metrics scope): {GRPO_RUN_ID}\")\n",
563
- "print(f\" \u2022 Checkpoints dir: {GRPO_OUTPUT_DIR}\")\n",
564
- "print(\" \u2022 Adjust settings in the configuration cell (or NEXUS_* env vars).\")\n",
565
  "print(\"\\nMonitor dashboard:\")\n",
566
  "print(f\" {BASE_URL}/\")\n",
567
  "print(\"=\" * 70 + \"\\n\")\n",
@@ -591,30 +636,30 @@
591
  "if ENABLE_CHECKPOINTS and CHECKPOINT_RESUME and not FORCE_FRESH_RUN:\n",
592
  " resume_ckpt = get_last_checkpoint(GRPO_OUTPUT_DIR)\n",
593
  "if resume_ckpt:\n",
594
- " print(f\"\ud83d\udcc2 Resuming training from: {resume_ckpt}\")\n",
595
  "else:\n",
596
- " print(\"\ud83d\udcc2 Starting training fresh (no checkpoint, or NEXUS_RESUME=false, or NEXUS_FORCE_FRESH=true)\")\n",
597
  "\n",
598
- "print(f\"\ud83d\udcca Dataset: {len(train_dataset)} prompts\")\n",
599
- "print(\"\u23f3 Training started...\")\n",
600
  "trainer.train(resume_from_checkpoint=resume_ckpt)\n",
601
  "\n",
602
  "print(\"\\n\" + \"=\" * 70)\n",
603
- "print(\"\u2705 Training step finished\")\n",
604
  "print(\"=\" * 70)\n",
605
  "print(f\"Dashboard: {BASE_URL}/\")\n",
606
  "print(f\"Learning curve API: {BASE_URL}/learning-curve\")\n",
607
- "print(\"\u25b6\ufe0f Run next cell to plot results.\")\n",
608
  "print(\"=\" * 70)\n",
609
  "\n",
610
  "backup_nexus_artifacts_to_drive(\"post_training\", include_learning_curve=True)\n"
611
- ],
612
- "metadata": {},
613
- "execution_count": null,
614
- "outputs": []
615
  },
616
  {
617
  "cell_type": "code",
 
 
 
618
  "source": [
619
  "import os\n",
620
  "import matplotlib.pyplot as plt\n",
@@ -622,7 +667,7 @@
622
  "os.makedirs(PLOT_OUTPUT_DIR, exist_ok=True)\n",
623
  "\n",
624
  "print(\"\\n\" + \"=\" * 70)\n",
625
- "print(\"\ud83d\udcca FETCHING REAL TRAINING DATA FROM HF SPACE\")\n",
626
  "print(\"=\" * 70)\n",
627
  "print(f\"Using run_id filter: {GRPO_RUN_ID}\")\n",
628
  "\n",
@@ -688,14 +733,14 @@
688
  " summary_text = f\"\"\"\n",
689
  "TRAINING SUMMARY\n",
690
  "\n",
691
- "\ud83d\udcca Episodes: {len(rewards)}\n",
692
- "\ud83d\udd35 Baseline: {baseline:.4f}\n",
693
- "\ud83d\udcc8 Average: {avg_reward:.4f}\n",
694
- "\u2b50 Best: {best_reward:.4f}\n",
695
- "\ud83d\udcc9 Worst: {min(rewards):.4f}\n",
696
  "\n",
697
- "\ud83d\udcca Improvement: +{improvement_from_baseline:.1f}%\n",
698
- "\ud83d\udccc Last 5 Avg: {last_5_avg:.4f}\n",
699
  " \"\"\"\n",
700
  "\n",
701
  " ax4.text(\n",
@@ -709,14 +754,14 @@
709
  " bbox=dict(boxstyle=\"round\", facecolor=\"#1e293b\", alpha=0.8, edgecolor=\"#0ea5e9\"),\n",
710
  " )\n",
711
  "\n",
712
- " plt.suptitle(\"NEXUS Enhanced \u2014 Complete Training Analysis\", fontsize=14, fontweight=\"bold\", y=0.995)\n",
713
  " plt.tight_layout()\n",
714
  "\n",
715
- " print(\"\\n\ud83d\udcc1 Saving visualizations...\")\n",
716
  " p1 = os.path.join(PLOT_OUTPUT_DIR, \"training_analysis.png\")\n",
717
  " p2 = os.path.join(PLOT_OUTPUT_DIR, \"reward_curves_hires.png\")\n",
718
  " plt.savefig(p1, dpi=150, bbox_inches=\"tight\")\n",
719
- " print(f\" \u2705 {p1} (4-panel comprehensive view)\")\n",
720
  "\n",
721
  " fig_single, ax = plt.subplots(figsize=(14, 7))\n",
722
  " ax.plot(episodes, rewards, \"o-\", label=\"Episode Reward\", color=\"#0ea5e9\", markersize=7, linewidth=2.5, alpha=0.8)\n",
@@ -725,18 +770,18 @@
725
  " ax.axhline(y=baseline, color=\"#ef4444\", linestyle=\"--\", linewidth=2.5, label=f\"Baseline: {baseline:.3f}\")\n",
726
  " ax.set_xlabel(\"Episode\", fontsize=12, fontweight=\"bold\")\n",
727
  " ax.set_ylabel(\"Reward Score\", fontsize=12, fontweight=\"bold\")\n",
728
- " ax.set_title(\"NEXUS Enhanced GRPO Training \u2014 Reward Progression\", fontsize=13, fontweight=\"bold\")\n",
729
  " ax.legend(fontsize=11, loc=\"lower right\")\n",
730
  " ax.grid(True, alpha=0.3, linestyle=\"--\")\n",
731
  " ax.set_ylim(-0.05, 1.05)\n",
732
  " plt.tight_layout()\n",
733
  " plt.savefig(p2, dpi=200, bbox_inches=\"tight\")\n",
734
- " print(f\" \u2705 {p2} (high-res)\")\n",
735
  "\n",
736
  " plt.show()\n",
737
  "\n",
738
  " print(\"\\n\" + \"=\" * 70)\n",
739
- " print(\"\ud83d\udcc8 FINAL TRAINING RESULTS\")\n",
740
  " print(\"=\" * 70)\n",
741
  " print(f\"\\n{'Metric':<35} {'Value':<20}\")\n",
742
  " print(\"-\" * 55)\n",
@@ -754,25 +799,22 @@
754
  " late_avg = sum(rewards[-5:]) / 5\n",
755
  " print(f\"\\n{'Early Phase (Ep 1-5) Avg':<35} {early_avg:.4f}\")\n",
756
  " print(f\"{'Late Phase (Ep -5) Avg':<35} {late_avg:.4f}\")\n",
757
- " learning_status = \"\u2705 Learning\" if late_avg > early_avg else \"\u26a0\ufe0f Plateau\"\n",
758
  " print(f\"{'Status':<35} {learning_status:<20}\")\n",
759
  "\n",
760
  " print(\"\\n\" + \"=\" * 70)\n",
761
- " print(\"\u2705 COMPLETE!\")\n",
762
  " print(\"=\" * 70)\n",
763
  "\n",
764
  " backup_nexus_artifacts_to_drive(\"post_plots\", include_learning_curve=True)\n",
765
  "\n",
766
  "else:\n",
767
- " print(\"\\n\u274c No episode data found\")\n",
768
- " print(\"\u23f3 Training may still be running...\")\n",
769
- " print(\"\ud83d\udca1 Rerun this cell in a few minutes\")\n",
770
- " print(f\"\ud83d\udcca Live: {BASE_URL}/learning-curve\")\n",
771
  " backup_nexus_artifacts_to_drive(\"no_rewards_yet\", include_learning_curve=True)\n"
772
- ],
773
- "metadata": {},
774
- "execution_count": null,
775
- "outputs": []
776
  }
777
  ],
778
  "metadata": {
 
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
+ "# NEXUS Enhanced GRPO Training Notebook (**Enhanced**)\n",
8
+ "\n",
9
+ "**Same pipeline as `grpo_colab_v2.ipynb`, with optional multi-incident rotation and scoped metrics `run_id`.**\n",
10
+ "\n",
11
+ "Use **`grpo_colab_v2.ipynb`** for the simplest single-incident (INC003) path. Use **this notebook** when you want a **defined incident pool** (enterprise + EA + lighter variety) without editing code between runs.\n",
12
+ "\n",
13
+ "- **Rotation:** `round_robin` (default) or `random` per reward episode (`NEXUS_INCIDENT_ROTATION`).\n",
14
+ "- **Pool:** `NEXUS_INCIDENT_POOL` (comma-separated) or defaults to `INC003,INC008,INC001`. Set `NEXUS_MULTI_INCIDENT=false` to lock to `NEXUS_INCIDENT_ID` only.\n",
15
+ "- **Metrics:** `NEXUS_GRPO_RUN_ID` tags `/reset` episodes so `GET /learning-curve?run_id=...` matches this Colab run.\n",
16
+ "\n",
17
+ "## How to run (please follow order)\n",
18
+ "\n",
19
+ "1. **Runtime:** GPU (e.g. T4).\n",
20
+ "2. Run cells **top to bottom** at least once per session: installs → **configuration** → environment → model → **training** → **plots**.\n",
21
+ "3. Edit the **configuration cell** for `BASE_URL`, incident pool, `GRPO_RUN_ID`, `ONE_ROUND_TRAINING`, and optional `NEXUS_*` env vars.\n",
22
+ "4. **Google Drive:** Same backup behavior as v2 when `BACKUP_TO_GOOGLE_DRIVE` is True on Colab.\n",
23
+ "- **Checkpoints / resume:** frequent `save_steps` on the quick path; default checkpoint folder moves to **Google Drive** on Colab when Drive is mounted so you can reconnect and **continue training** without restarting from scratch.\n",
24
+ "\n"
25
  ]
26
  },
27
  {
 
30
  "metadata": {},
31
  "outputs": [],
32
  "source": [
33
+ "# Validate OpenEnv installation (hackathon compliance)\n",
34
  "try:\n",
35
  " import openenv\n",
36
+ " print(f\" OpenEnv {openenv.__version__} installed\")\n",
37
  "except ImportError:\n",
38
+ " print(\"⚠️ OpenEnv not yet installed (will be installed in next cell)\")\n",
39
  "\n",
40
+ "print(\" OpenEnv (latest release) check passed — hackathon compliance\")"
41
  ]
42
  },
43
  {
 
57
  "cell_type": "markdown",
58
  "metadata": {},
59
  "source": [
60
+ "## Configuration & HF Space connectivity\n",
61
+ "\n",
62
+ "The next code cell adds **multi-incident** defaults and **`run_id`** for scoped learning curves. Override with:\n",
63
+ "\n",
64
+ "- `NEXUS_INCIDENT_POOL` — e.g. `INC003,INC008,INC004` (comma-separated). Ignored if `NEXUS_MULTI_INCIDENT=false`.\n",
65
+ "- `NEXUS_MULTI_INCIDENT` — `false` to train against **`NEXUS_INCIDENT_ID`** only (v2-style).\n",
66
+ "- `NEXUS_INCIDENT_ROTATION` — `round_robin` (default) or `random`.\n",
67
+ "- `NEXUS_GRPO_RUN_ID` — string passed to `POST /reset` as `run_id` (default `colab_grpo_enhanced`).\n",
68
+ "\n",
69
+ "`REWARD_MAX_STEPS` default is **35** so mixed pools have headroom vs INC003-only (28).\n",
70
+ "**Checkpoints:** On Colab with Drive mounted, weights save under `.../NEXUS_GRPO_backups/active_grpo_checkpoints` by default so a disconnect does not wipe them. Re-run the training cell to **resume** from the latest step (`NEXUS_FORCE_FRESH=true` starts over). Set `NEXUS_GRPO_OUTPUT_DIR` to override the directory.\n",
71
+ "\n",
72
+ "---\n",
73
+ "\n",
74
+ "## Colab free tier — reducing disconnect / expiry pain\n",
75
+ "\n",
76
+ "**Free Colab cannot be guaranteed** to stay alive for hours (idle limits, preemption, daily caps). Mitigations:\n",
77
+ "\n",
78
+ "1. **Drive checkpoints + resume** (this notebook): mount Drive in the config cell; **`GRPO_OUTPUT_DIR`** defaults to **`.../NEXUS_GRPO_backups/active_grpo_checkpoints`** on Colab when Drive is present. After a disconnect, **re-run setup cells in order**, then training — **`trainer.train(resume_from_checkpoint=...)`** picks up the latest step unless **`NEXUS_FORCE_FRESH=true`**.\n",
79
+ "2. **Shorter runs:** **`ONE_ROUND_TRAINING = True`** or fewer prompts per session; continue later with resume instead of one very long run.\n",
80
+ "3. **Frequent `save_steps`** on the quick path so less work is lost.\n",
81
+ "4. **Stable browser session:** avoid laptop sleep; keep the Colab tab in a focused window on a reliable network while training runs.\n",
82
+ "5. **Colab Pro / Pro+** if you need longer single sessions.\n",
83
+ "\n",
84
+ "**Disable HF checkpoints entirely:** set **`NEXUS_ENABLE_CHECKPOINTS=false`** (no checkpoint files, no resume; saves Drive space).\n",
85
+ "\n"
86
  ]
87
  },
88
  {
 
115
  "IN_COLAB = _in_colab()\n",
116
  "\n",
117
  "# ---------------------------------------------------------------------------\n",
118
+ "# Notebook configuration edit defaults here or set environment variables.\n",
119
  "# Enhanced notebook extras:\n",
120
+ "# NEXUS_INCIDENT_POOL comma-separated case ids (default: INC003,INC008,INC001)\n",
121
+ "# NEXUS_MULTI_INCIDENT false use only NEXUS_INCIDENT_ID (v2-style single task)\n",
122
+ "# NEXUS_INCIDENT_ROTATION round_robin | random\n",
123
+ "# NEXUS_GRPO_RUN_ID POST /reset run_id for scoped /learning-curve and /metrics\n",
124
  "#\n",
125
  "# See grpo_colab_v2.ipynb for the full list of NEXUS_* vars (same as here).\n",
126
+ "# NEXUS_ENABLE_CHECKPOINTS (true/false) false: no HF checkpoint files, no resume\n",
127
  "# ---------------------------------------------------------------------------\n",
128
  "\n",
129
  "BASE_URL = _env(\n",
 
209
  " if not BACKUP_TO_GOOGLE_DRIVE:\n",
210
  " return\n",
211
  " if not IN_COLAB:\n",
212
+ " print(\"⚠️ BACKUP_TO_GOOGLE_DRIVE is True but not in Colab skipping Drive mount.\")\n",
213
  " return\n",
214
  " if os.path.isdir(\"/content/drive/MyDrive\"):\n",
215
+ " print(\" Google Drive already mounted.\")\n",
216
  " return\n",
217
  " from google.colab import drive\n",
218
+ " print(\"📂 Mount Google Drive when prompted (artifacts copy to My Drive / NEXUS_GRPO_backups).\")\n",
219
  " drive.mount(\"/content/drive\")\n",
220
  "\n",
221
  "\n",
 
266
  "try:\n",
267
  " resp = requests.get(f\"{BASE_URL}/health\", timeout=5)\n",
268
  " if resp.status_code == 200:\n",
269
+ " print(\" HF Space is reachable\")\n",
270
  " print(f\"Response: {resp.json()}\")\n",
271
  " else:\n",
272
+ " print(f\" HF Space returned status {resp.status_code}\")\n",
273
  "except Exception as e:\n",
274
+ " print(f\" Error connecting to HF Space: {e}\")\n",
275
  " print(f\"URL: {BASE_URL}\")\n",
276
  " print(\"Make sure HF Space is deployed and running\")\n"
277
  ]
 
331
  " return data[\"observation\"], data[\"reward\"], data[\"done\"], data[\"info\"]\n",
332
  "\n",
333
  " def get_learning_curve(self, run_id=None):\n",
334
+ " \"\"\"GET /learning-curve optional run_id scopes metrics to this Colab run.\"\"\"\n",
335
  " params = {}\n",
336
  " rid = run_id if run_id is not None else GRPO_RUN_ID\n",
337
  " if rid:\n",
 
346
  "\n",
347
  "\n",
348
  "env = NexusRemoteEnv()\n",
349
+ "print(\" Environment interface ready (enhanced: run_id + scoped learning curve)\")\n",
350
  "\n",
351
  "_NEXUS_DRIVE_RUN_DIR = None\n",
352
  "\n",
 
370
  "\n",
371
  "\n",
372
  "def backup_nexus_artifacts_to_drive(reason=\"manual\", *, include_learning_curve=True):\n",
373
+ " \"\"\"Copy GRPO checkpoints, PNG plots (if present), learning curve JSON, manifest Google Drive.\"\"\"\n",
374
  " if not BACKUP_TO_GOOGLE_DRIVE:\n",
375
  " print(f\"[Drive backup:{reason}] skipped (BACKUP_TO_GOOGLE_DRIVE=False)\")\n",
376
  " return None\n",
 
378
  " print(f\"[Drive backup:{reason}] skipped (not Colab)\")\n",
379
  " return None\n",
380
  " if not os.path.isdir(\"/content/drive/MyDrive\"):\n",
381
+ " print(f\"[Drive backup:{reason}] skipped mount Drive in the config cell first\")\n",
382
  " return None\n",
383
  " dest = _nexus_google_drive_run_dir()\n",
384
  " if dest is None:\n",
 
388
  " if os.path.isdir(GRPO_OUTPUT_DIR):\n",
389
  " tgt = os.path.join(dest, \"grpo_checkpoints\")\n",
390
  " shutil.copytree(GRPO_OUTPUT_DIR, tgt, dirs_exist_ok=True)\n",
391
+ " print(f\" checkpoints {tgt}\")\n",
392
  "\n",
393
  " for name in (\"training_analysis.png\", \"reward_curves_hires.png\"):\n",
394
  " src = os.path.join(PLOT_OUTPUT_DIR, name)\n",
395
  " if os.path.isfile(src):\n",
396
  " shutil.copy2(src, os.path.join(dest, name))\n",
397
+ " print(f\" plot {name}\")\n",
398
  "\n",
399
  " if include_learning_curve:\n",
400
  " try:\n",
401
  " curve = env.get_learning_curve()\n",
402
  " with open(os.path.join(dest, \"learning_curve.json\"), \"w\") as f:\n",
403
  " _json_backup.dump(curve, f, indent=2)\n",
404
+ " print(\" learning_curve.json\")\n",
405
  " except Exception as e:\n",
406
+ " print(f\" ⚠️ learning curve fetch failed: {e}\")\n",
407
  "\n",
408
  " manifest = {\n",
409
  " \"reason\": reason,\n",
 
419
  " }\n",
420
  " with open(os.path.join(dest, \"run_manifest.json\"), \"w\") as f:\n",
421
  " _json_backup.dump(manifest, f, indent=2)\n",
422
+ " print(f\"\\n📦 Drive backup ({reason}): {dest}\\n\")\n",
423
  " return dest\n"
424
  ]
425
  },
 
505
  "\n",
506
  "\n",
507
  "print(\n",
508
+ " \" Reward function defined (pool=\",\n",
509
  " TRAINING_INCIDENT_POOL,\n",
510
  " \", rotation=\",\n",
511
  " INCIDENT_ROTATION,\n",
 
564
  " save_steps=GRPO_SAVE_STEPS_QUICK if ONE_ROUND_TRAINING else GRPO_SAVE_STEPS_FULL,\n",
565
  ")\n",
566
  "\n",
567
+ "print(\" Model loaded and GRPO configured\")\n"
568
  ]
569
  },
570
  {
 
583
  },
584
  {
585
  "cell_type": "code",
586
+ "execution_count": null,
587
+ "metadata": {},
588
+ "outputs": [],
589
  "source": [
590
  "from datasets import Dataset\n",
591
  "import os\n",
 
595
  "\n",
596
  "print(\"\\n\" + \"=\" * 70)\n",
597
  "if ONE_ROUND_TRAINING:\n",
598
+ " print(f\"🚀 GRPO ONE ROUND ({n_target} prompts, fast path)\")\n",
599
  "else:\n",
600
+ " print(f\"🚀 GRPO FULL RUN ({n_target} prompts)\")\n",
601
  "print(\"=\" * 70)\n",
602
  "print(\"\\nConfiguration:\")\n",
603
+ "print(f\" Model: {MODEL_NAME}\")\n",
604
+ "print(f\" Dataset rows: {n_target}\")\n",
605
+ "print(f\" Environment: {BASE_URL}\")\n",
606
+ "print(f\" Incident pool: {TRAINING_INCIDENT_POOL} ({INCIDENT_ROTATION})\")\n",
607
+ "print(f\" GRPO_RUN_ID (metrics scope): {GRPO_RUN_ID}\")\n",
608
+ "print(f\" Checkpoints dir: {GRPO_OUTPUT_DIR}\")\n",
609
+ "print(\" Adjust settings in the configuration cell (or NEXUS_* env vars).\")\n",
610
  "print(\"\\nMonitor dashboard:\")\n",
611
  "print(f\" {BASE_URL}/\")\n",
612
  "print(\"=\" * 70 + \"\\n\")\n",
 
636
  "if ENABLE_CHECKPOINTS and CHECKPOINT_RESUME and not FORCE_FRESH_RUN:\n",
637
  " resume_ckpt = get_last_checkpoint(GRPO_OUTPUT_DIR)\n",
638
  "if resume_ckpt:\n",
639
+ " print(f\"📂 Resuming training from: {resume_ckpt}\")\n",
640
  "else:\n",
641
+ " print(\"📂 Starting training fresh (no checkpoint, or NEXUS_RESUME=false, or NEXUS_FORCE_FRESH=true)\")\n",
642
  "\n",
643
+ "print(f\"📊 Dataset: {len(train_dataset)} prompts\")\n",
644
+ "print(\" Training started...\")\n",
645
  "trainer.train(resume_from_checkpoint=resume_ckpt)\n",
646
  "\n",
647
  "print(\"\\n\" + \"=\" * 70)\n",
648
+ "print(\" Training step finished\")\n",
649
  "print(\"=\" * 70)\n",
650
  "print(f\"Dashboard: {BASE_URL}/\")\n",
651
  "print(f\"Learning curve API: {BASE_URL}/learning-curve\")\n",
652
+ "print(\"▶️ Run next cell to plot results.\")\n",
653
  "print(\"=\" * 70)\n",
654
  "\n",
655
  "backup_nexus_artifacts_to_drive(\"post_training\", include_learning_curve=True)\n"
656
+ ]
 
 
 
657
  },
658
  {
659
  "cell_type": "code",
660
+ "execution_count": null,
661
+ "metadata": {},
662
+ "outputs": [],
663
  "source": [
664
  "import os\n",
665
  "import matplotlib.pyplot as plt\n",
 
667
  "os.makedirs(PLOT_OUTPUT_DIR, exist_ok=True)\n",
668
  "\n",
669
  "print(\"\\n\" + \"=\" * 70)\n",
670
+ "print(\"📊 FETCHING REAL TRAINING DATA FROM HF SPACE\")\n",
671
  "print(\"=\" * 70)\n",
672
  "print(f\"Using run_id filter: {GRPO_RUN_ID}\")\n",
673
  "\n",
 
733
  " summary_text = f\"\"\"\n",
734
  "TRAINING SUMMARY\n",
735
  "\n",
736
+ "📊 Episodes: {len(rewards)}\n",
737
+ "🔵 Baseline: {baseline:.4f}\n",
738
+ "📈 Average: {avg_reward:.4f}\n",
739
+ " Best: {best_reward:.4f}\n",
740
+ "📉 Worst: {min(rewards):.4f}\n",
741
  "\n",
742
+ "📊 Improvement: +{improvement_from_baseline:.1f}%\n",
743
+ "📌 Last 5 Avg: {last_5_avg:.4f}\n",
744
  " \"\"\"\n",
745
  "\n",
746
  " ax4.text(\n",
 
754
  " bbox=dict(boxstyle=\"round\", facecolor=\"#1e293b\", alpha=0.8, edgecolor=\"#0ea5e9\"),\n",
755
  " )\n",
756
  "\n",
757
+ " plt.suptitle(\"NEXUS Enhanced Complete Training Analysis\", fontsize=14, fontweight=\"bold\", y=0.995)\n",
758
  " plt.tight_layout()\n",
759
  "\n",
760
+ " print(\"\\n📁 Saving visualizations...\")\n",
761
  " p1 = os.path.join(PLOT_OUTPUT_DIR, \"training_analysis.png\")\n",
762
  " p2 = os.path.join(PLOT_OUTPUT_DIR, \"reward_curves_hires.png\")\n",
763
  " plt.savefig(p1, dpi=150, bbox_inches=\"tight\")\n",
764
+ " print(f\" {p1} (4-panel comprehensive view)\")\n",
765
  "\n",
766
  " fig_single, ax = plt.subplots(figsize=(14, 7))\n",
767
  " ax.plot(episodes, rewards, \"o-\", label=\"Episode Reward\", color=\"#0ea5e9\", markersize=7, linewidth=2.5, alpha=0.8)\n",
 
770
  " ax.axhline(y=baseline, color=\"#ef4444\", linestyle=\"--\", linewidth=2.5, label=f\"Baseline: {baseline:.3f}\")\n",
771
  " ax.set_xlabel(\"Episode\", fontsize=12, fontweight=\"bold\")\n",
772
  " ax.set_ylabel(\"Reward Score\", fontsize=12, fontweight=\"bold\")\n",
773
+ " ax.set_title(\"NEXUS Enhanced GRPO Training Reward Progression\", fontsize=13, fontweight=\"bold\")\n",
774
  " ax.legend(fontsize=11, loc=\"lower right\")\n",
775
  " ax.grid(True, alpha=0.3, linestyle=\"--\")\n",
776
  " ax.set_ylim(-0.05, 1.05)\n",
777
  " plt.tight_layout()\n",
778
  " plt.savefig(p2, dpi=200, bbox_inches=\"tight\")\n",
779
+ " print(f\" {p2} (high-res)\")\n",
780
  "\n",
781
  " plt.show()\n",
782
  "\n",
783
  " print(\"\\n\" + \"=\" * 70)\n",
784
+ " print(\"📈 FINAL TRAINING RESULTS\")\n",
785
  " print(\"=\" * 70)\n",
786
  " print(f\"\\n{'Metric':<35} {'Value':<20}\")\n",
787
  " print(\"-\" * 55)\n",
 
799
  " late_avg = sum(rewards[-5:]) / 5\n",
800
  " print(f\"\\n{'Early Phase (Ep 1-5) Avg':<35} {early_avg:.4f}\")\n",
801
  " print(f\"{'Late Phase (Ep -5) Avg':<35} {late_avg:.4f}\")\n",
802
+ " learning_status = \" Learning\" if late_avg > early_avg else \"⚠️ Plateau\"\n",
803
  " print(f\"{'Status':<35} {learning_status:<20}\")\n",
804
  "\n",
805
  " print(\"\\n\" + \"=\" * 70)\n",
806
+ " print(\" COMPLETE!\")\n",
807
  " print(\"=\" * 70)\n",
808
  "\n",
809
  " backup_nexus_artifacts_to_drive(\"post_plots\", include_learning_curve=True)\n",
810
  "\n",
811
  "else:\n",
812
+ " print(\"\\n No episode data found\")\n",
813
+ " print(\" Training may still be running...\")\n",
814
+ " print(\"💡 Rerun this cell in a few minutes\")\n",
815
+ " print(f\"📊 Live: {BASE_URL}/learning-curve\")\n",
816
  " backup_nexus_artifacts_to_drive(\"no_rewards_yet\", include_learning_curve=True)\n"
817
+ ]
 
 
 
818
  }
819
  ],
820
  "metadata": {
notebooks/grpo_colab_v2.ipynb CHANGED
@@ -33,14 +33,14 @@
33
  "metadata": {},
34
  "outputs": [],
35
  "source": [
36
- "# Validate OpenEnv installation (Hard Gate #1)\n",
37
  "try:\n",
38
  " import openenv\n",
39
  " print(f\"✅ OpenEnv {openenv.__version__} installed\")\n",
40
  "except ImportError:\n",
41
  " print(\"⚠️ OpenEnv not yet installed (will be installed in next cell)\")\n",
42
  "\n",
43
- "print(\"✅ This notebook meets BRD Hard Gate #1: 'Usage of OpenEnv (latest release)'\")"
44
  ]
45
  },
46
  {
 
33
  "metadata": {},
34
  "outputs": [],
35
  "source": [
36
+ "# Validate OpenEnv installation (hackathon compliance)\n",
37
  "try:\n",
38
  " import openenv\n",
39
  " print(f\"✅ OpenEnv {openenv.__version__} installed\")\n",
40
  "except ImportError:\n",
41
  " print(\"⚠️ OpenEnv not yet installed (will be installed in next cell)\")\n",
42
  "\n",
43
+ "print(\"✅ OpenEnv (latest release) check passed — hackathon compliance\")"
44
  ]
45
  },
46
  {
server/app.py CHANGED
@@ -546,7 +546,7 @@ def get_episodes(run_id: Optional[str] = None):
546
 
547
  @app.get("/learning-curve")
548
  def get_learning_curve(run_id: Optional[str] = None):
549
- """Rolling reward average — for Criterion 3 observable improvement evidence."""
550
  run_key = _normalize_run_id_filter(run_id)
551
  scoped_records = _get_records_for_run(run_key)
552
  rewards = [float(rec.get("reward", 0.0)) for rec in scoped_records]
@@ -561,7 +561,7 @@ def get_learning_curve(run_id: Optional[str] = None):
561
  "run_id": run_key or "all",
562
  "rewards": rewards,
563
  "rolling_avg": rolling,
564
- "baseline": 0.265, # Pre-event scripted baseline avg (BRD Criterion 3)
565
  "episode_count": len(rewards),
566
  "current_avg": round(sum(rewards) / len(rewards), 4),
567
  "improvement": round(sum(rewards) / len(rewards) - 0.265, 4),
 
546
 
547
  @app.get("/learning-curve")
548
  def get_learning_curve(run_id: Optional[str] = None):
549
+ """Rolling reward average — for observable training-progress evidence (judging rubric)."""
550
  run_key = _normalize_run_id_filter(run_id)
551
  scoped_records = _get_records_for_run(run_key)
552
  rewards = [float(rec.get("reward", 0.0)) for rec in scoped_records]
 
561
  "run_id": run_key or "all",
562
  "rewards": rewards,
563
  "rolling_avg": rolling,
564
+ "baseline": 0.265, # Pre-event scripted baseline avg (observable improvement baseline)
565
  "episode_count": len(rewards),
566
  "current_avg": round(sum(rewards) / len(rewards), 4),
567
  "improvement": round(sum(rewards) / len(rewards) - 0.265, 4),
server/data_models.py CHANGED
@@ -9,7 +9,7 @@ class IncidentType(Enum):
9
  CASCADE = "cascade"
10
  SECURITY = "security"
11
  DATA = "data"
12
- # Theme 3.2 — personalized delegation / conflicting priorities (BRD §12)
13
  PERSONAL_ASSISTANT = "personal_assistant"
14
 
15
 
 
9
  CASCADE = "cascade"
10
  SECURITY = "security"
11
  DATA = "data"
12
+ # Theme 3.2 — personalized delegation / conflicting priorities (hackathon personalized track)
13
  PERSONAL_ASSISTANT = "personal_assistant"
14
 
15
 
server/reward.py CHANGED
@@ -299,7 +299,7 @@ def compute_oversight_score(state: EpisodeState) -> float:
299
  def compute_depth_bonus(state: EpisodeState) -> float:
300
  """
301
  Mercor sub-theme: reward longer, better-structured IC reasoning.
302
- UNCAPPED — per BRD Section 10.7: rewards scale with token output without ceiling.
303
 
304
  Calibration principle (per Mercor requirement):
305
  - Short canned strings (<30 words) earn 0 — they do not represent "reasoning"
 
299
  def compute_depth_bonus(state: EpisodeState) -> float:
300
  """
301
  Mercor sub-theme: reward longer, better-structured IC reasoning.
302
+ UNCAPPED — per Mercor sub-theme: rewards scale with token output without ceiling.
303
 
304
  Calibration principle (per Mercor requirement):
305
  - Short canned strings (<30 words) earn 0 — they do not represent "reasoning"
training_artifacts/pre_event_benchmark.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "description": "Untrained scripted baseline on INC003 \u2014 establishes reward floor for GRPO improvement (BRD Criterion 3)",
3
  "incident_id": "INC003",
4
  "policy": "scripted_baseline",
5
  "n_trials": 5,
 
1
  {
2
+ "description": "Untrained scripted baseline on INC003 \u2014 establishes reward floor for GRPO improvement (observable improvement evidence)",
3
  "incident_id": "INC003",
4
  "policy": "scripted_baseline",
5
  "n_trials": 5,