Upload folder using huggingface_hub
Browse files- README.md +10 -10
- docs/blog/blog_post_hf.md +5 -5
- docs/deployment/DEPLOYMENT_CHECKLIST.md +14 -14
- docs/deployment/HF_SPACES_DEPLOYMENT.md +10 -10
- docs/guides/QUICK_START.md +2 -2
- docs/pitch/DEMO_WALKTHROUGH.md +1 -1
- docs/pitch/PITCH.md +6 -6
- docs/pitch/PITCH_3MIN.md +1 -1
- docs/pitch/VIDEO_RECORD_NOW_PACK.md +144 -0
- docs/project/BEHAVIORAL_DELTA_PROOF.md +4 -4
- docs/project/COMPLIANCE_LOCK_MATRIX.md +17 -17
- docs/project/CURRICULUM_AND_ABLATION.md +1 -1
- docs/project/FINAL_OPERATIONS_RUNBOOK.md +1 -1
- docs/project/FINAL_READINESS_REPORT.md +2 -2
- docs/project/IMPLEMENTATION_SUMMARY.md +14 -14
- docs/project/JUDGING_EVIDENCE_INDEX.md +4 -4
- docs/project/PLAN_OF_ACTION.md +13 -13
- docs/project/PROJECT_STATUS.md +8 -7
- docs/project/SUBTHEME_EVIDENCE_MATRIX.md +9 -9
- docs/project/TEST_RESULTS_SUMMARY.md +1 -1
- episode_rewards.json +1 -1
- notebooks/grpo_colab_enhanced.ipynb +114 -72
- notebooks/grpo_colab_v2.ipynb +2 -2
- server/app.py +2 -2
- server/data_models.py +1 -1
- server/reward.py +1 -1
- training_artifacts/pre_event_benchmark.json +1 -1
README.md
CHANGED
|
@@ -30,7 +30,7 @@ Use this section first during judging/review.
|
|
| 30 |
- **Live environment (HF Space):** https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 31 |
- **3-minute pitch script:** [`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)
|
| 32 |
- **2-minute demo walkthrough:** [`docs/pitch/DEMO_WALKTHROUGH.md`](docs/pitch/DEMO_WALKTHROUGH.md)
|
| 33 |
-
- **
|
| 34 |
- **Behavioral delta (before vs after):** [`docs/project/BEHAVIORAL_DELTA_PROOF.md`](docs/project/BEHAVIORAL_DELTA_PROOF.md)
|
| 35 |
- **Compliance lock matrix:** [`docs/project/COMPLIANCE_LOCK_MATRIX.md`](docs/project/COMPLIANCE_LOCK_MATRIX.md)
|
| 36 |
- **HF blog draft (publish-ready):** [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md)
|
|
@@ -139,9 +139,9 @@ python scripts/export_reward_plot.py \
|
|
| 139 |
|
| 140 |
Caption: blue line is per-episode reward, green is rolling average, red dashed line is baseline (`0.265`).
|
| 141 |
|
| 142 |
-
##
|
| 143 |
|
| 144 |
-
Per
|
| 145 |
|
| 146 |
**Local (dev machine, after `pip install "openenv>=0.2.3"`):**
|
| 147 |
|
|
@@ -163,7 +163,7 @@ openenv validate --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
|
| 163 |
|
| 164 |
**Deploying with OpenEnv:** use `openenv push . --repo-id <user>/<space> --exclude .hfignore` (or **`./gate.sh --push`**, which adds `--exclude` for you). OpenEnv does not load `.hfignore` unless you pass it via `--exclude`; omitting it does **not** break the build, it only uploads extra paths (less lean). See `docs/guides/QUICK_START.md` for a short rationale.
|
| 165 |
|
| 166 |
-
`requirements.txt` **omits** `openenv` on the Space Docker image to keep builds reliable; the **Colab notebook** installs `openenv>=0.2.3`
|
| 167 |
|
| 168 |
## API Endpoints
|
| 169 |
|
|
@@ -192,15 +192,15 @@ openenv validate --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
|
| 192 |
- **Snorkel AI** — Rotating expert review board (4 criteria)
|
| 193 |
- **Patronus AI** — Live schema drift in INC007 at step 18
|
| 194 |
|
| 195 |
-
## Pitch, plan, and
|
| 196 |
|
| 197 |
Documentation lives under [`docs/`](docs/) (guides, deployment, project status, pitch/demo scripts, blog drafts).
|
| 198 |
|
| 199 |
-
- **[`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)** — 3-minute spoken script + 2-minute Q&A bullets (
|
| 200 |
-
- **[`docs/project/PLAN_OF_ACTION.md`](docs/project/PLAN_OF_ACTION.md)** —
|
| 201 |
-
- **`scripts/export_reward_plot.py`** — export reward curve PNG from `--url` or `episode_rewards.json` (
|
| 202 |
|
| 203 |
-
## Final submission checklist (
|
| 204 |
|
| 205 |
- [ ] Space URL is live and included in final form: `https://kunalkachru23-nexus-enhanced-stage.hf.space/`
|
| 206 |
- [ ] `openenv validate .` passes locally.
|
|
@@ -216,7 +216,7 @@ Documentation lives under [`docs/`](docs/) (guides, deployment, project status,
|
|
| 216 |
|
| 217 |
## Blog Post
|
| 218 |
|
| 219 |
-
See [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md) for the publish-ready HuggingFace blog draft (includes reward model deep-dive, training methodology, and demo walkthrough). Publish and add the public URL
|
| 220 |
|
| 221 |
## Team
|
| 222 |
|
|
|
|
| 30 |
- **Live environment (HF Space):** https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 31 |
- **3-minute pitch script:** [`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)
|
| 32 |
- **2-minute demo walkthrough:** [`docs/pitch/DEMO_WALKTHROUGH.md`](docs/pitch/DEMO_WALKTHROUGH.md)
|
| 33 |
+
- **Compliance + judging evidence index:** [`docs/project/JUDGING_EVIDENCE_INDEX.md`](docs/project/JUDGING_EVIDENCE_INDEX.md)
|
| 34 |
- **Behavioral delta (before vs after):** [`docs/project/BEHAVIORAL_DELTA_PROOF.md`](docs/project/BEHAVIORAL_DELTA_PROOF.md)
|
| 35 |
- **Compliance lock matrix:** [`docs/project/COMPLIANCE_LOCK_MATRIX.md`](docs/project/COMPLIANCE_LOCK_MATRIX.md)
|
| 36 |
- **HF blog draft (publish-ready):** [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md)
|
|
|
|
| 139 |
|
| 140 |
Caption: blue line is per-episode reward, green is rolling average, red dashed line is baseline (`0.265`).
|
| 141 |
|
| 142 |
+
## OpenEnv (reproduce)
|
| 143 |
|
| 144 |
+
Per **hackathon compliance criteria**, the submission uses **OpenEnv (latest release)** in the toolchain—not only a custom HTTP server. Reproduce validation with the commands below.
|
| 145 |
|
| 146 |
**Local (dev machine, after `pip install "openenv>=0.2.3"`):**
|
| 147 |
|
|
|
|
| 163 |
|
| 164 |
**Deploying with OpenEnv:** use `openenv push . --repo-id <user>/<space> --exclude .hfignore` (or **`./gate.sh --push`**, which adds `--exclude` for you). OpenEnv does not load `.hfignore` unless you pass it via `--exclude`; omitting it does **not** break the build, it only uploads extra paths (less lean). See `docs/guides/QUICK_START.md` for a short rationale.
|
| 165 |
|
| 166 |
+
`requirements.txt` **omits** `openenv` on the Space Docker image to keep builds reliable; the **Colab notebook** installs `openenv>=0.2.3` to satisfy the **Colab + OpenEnv** portion of compliance. Contract-only routes (`/metadata`, `/schema`, `GET /state`, `POST /mcp`) satisfy `openenv validate --url`; episode logic uses **`/reset`**, **`/step/{session_id}`**, **`/state/{session_id}`** only.
|
| 167 |
|
| 168 |
## API Endpoints
|
| 169 |
|
|
|
|
| 192 |
- **Snorkel AI** — Rotating expert review board (4 criteria)
|
| 193 |
- **Patronus AI** — Live schema drift in INC007 at step 18
|
| 194 |
|
| 195 |
+
## Pitch, plan, and compliance evidence
|
| 196 |
|
| 197 |
Documentation lives under [`docs/`](docs/) (guides, deployment, project status, pitch/demo scripts, blog drafts).
|
| 198 |
|
| 199 |
+
- **[`docs/pitch/PITCH.md`](docs/pitch/PITCH.md)** — 3-minute spoken script + 2-minute Q&A bullets (organizer pitch format).
|
| 200 |
+
- **[`docs/project/PLAN_OF_ACTION.md`](docs/project/PLAN_OF_ACTION.md)** — hackathon compliance matrix + prioritized todo table.
|
| 201 |
+
- **`scripts/export_reward_plot.py`** — export reward curve PNG from `--url` or `episode_rewards.json` (slides / observable improvement evidence). Canonical chart (tracked in git): **`docs/images/training_reward_curve.png`** (see section above).
|
| 202 |
|
| 203 |
+
## Final submission checklist (compliance-ready)
|
| 204 |
|
| 205 |
- [ ] Space URL is live and included in final form: `https://kunalkachru23-nexus-enhanced-stage.hf.space/`
|
| 206 |
- [ ] `openenv validate .` passes locally.
|
|
|
|
| 216 |
|
| 217 |
## Blog Post
|
| 218 |
|
| 219 |
+
See [`docs/blog/blog_post_hf.md`](docs/blog/blog_post_hf.md) for the publish-ready HuggingFace blog draft (includes reward model deep-dive, training methodology, and demo walkthrough). Publish and add the public URL to your submission package (blog or short video, per organizer requirements).
|
| 220 |
|
| 221 |
## Team
|
| 222 |
|
docs/blog/blog_post_hf.md
CHANGED
|
@@ -9,8 +9,8 @@ authors:
|
|
| 9 |
|
| 10 |
*Team Falcons — Meta PyTorch OpenEnv Hackathon Grand Finale, April 2026*
|
| 11 |
|
| 12 |
-
**Live Demo:** [huggingface.co/spaces/kunalkachru23/nexus-enhanced](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced)
|
| 13 |
-
**Training Notebook:** [grpo_colab_v2.ipynb](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced/blob/main/notebooks/grpo_colab_v2.ipynb)
|
| 14 |
|
| 15 |
---
|
| 16 |
|
|
@@ -98,7 +98,7 @@ Before on-site GPU training, 30 scripted baseline episodes established the floor
|
|
| 98 |
|
| 99 |
Expected trained model performance after 200 GRPO steps: **0.55–0.75**
|
| 100 |
|
| 101 |
-
The gap shows clear, observable training signal
|
| 102 |
|
| 103 |
---
|
| 104 |
|
|
@@ -141,10 +141,10 @@ The expert board notices IC weaknesses and shifts its evaluation focus — simul
|
|
| 141 |
|
| 142 |
```bash
|
| 143 |
# Live demo — no install needed
|
| 144 |
-
curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/demo/run/INC003 | python -m json.tool
|
| 145 |
|
| 146 |
# Or open the web dashboard
|
| 147 |
-
https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
|
| 148 |
```
|
| 149 |
|
| 150 |
**Training notebook** (GRPO on Qwen2.5-1.5B, cells 1–3 run without GPU):
|
|
|
|
| 9 |
|
| 10 |
*Team Falcons — Meta PyTorch OpenEnv Hackathon Grand Finale, April 2026*
|
| 11 |
|
| 12 |
+
**Live Demo:** [huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage)
|
| 13 |
+
**Training Notebook:** [grpo_colab_v2.ipynb](https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage/blob/main/notebooks/grpo_colab_v2.ipynb)
|
| 14 |
|
| 15 |
---
|
| 16 |
|
|
|
|
| 98 |
|
| 99 |
Expected trained model performance after 200 GRPO steps: **0.55–0.75**
|
| 100 |
|
| 101 |
+
The gap shows clear, observable training signal—supporting the hackathon rubric emphasis on **observable reward improvement**.
|
| 102 |
|
| 103 |
---
|
| 104 |
|
|
|
|
| 141 |
|
| 142 |
```bash
|
| 143 |
# Live demo — no install needed
|
| 144 |
+
curl -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
|
| 145 |
|
| 146 |
# Or open the web dashboard
|
| 147 |
+
https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
|
| 148 |
```
|
| 149 |
|
| 150 |
**Training notebook** (GRPO on Qwen2.5-1.5B, cells 1–3 run without GPU):
|
docs/deployment/DEPLOYMENT_CHECKLIST.md
CHANGED
|
@@ -85,7 +85,7 @@ python deploy_to_hf_spaces.py
|
|
| 85 |
# 🎉 Deployment complete!
|
| 86 |
|
| 87 |
# Monitor build progress:
|
| 88 |
-
# https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
|
| 89 |
# Look for "Building" → "Running" status (5-10 min)
|
| 90 |
```
|
| 91 |
|
|
@@ -94,12 +94,12 @@ Once HF Space shows "Running" status:
|
|
| 94 |
|
| 95 |
```bash
|
| 96 |
# Test public health endpoint
|
| 97 |
-
curl https://kunalkachru23-nexus-enhanced.hf.space/health
|
| 98 |
|
| 99 |
# Expected: {"status": "healthy", "environment": "nexus-enhanced", ...}
|
| 100 |
|
| 101 |
# Run full remote test suite
|
| 102 |
-
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
|
| 103 |
|
| 104 |
# Expected: ✅ ALL 7 TESTS PASS
|
| 105 |
```
|
|
@@ -107,7 +107,7 @@ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf
|
|
| 107 |
### Step 5: Verify Judge Dashboard
|
| 108 |
Open in browser:
|
| 109 |
```
|
| 110 |
-
https://kunalkachru23-nexus-enhanced.hf.space/
|
| 111 |
```
|
| 112 |
|
| 113 |
**Expected to see**:
|
|
@@ -125,13 +125,13 @@ After deployment, run:
|
|
| 125 |
|
| 126 |
```bash
|
| 127 |
# Full test suite against deployed environment
|
| 128 |
-
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
|
| 129 |
|
| 130 |
# Individual endpoint tests:
|
| 131 |
-
curl https://kunalkachru23-nexus-enhanced.hf.space/health | jq .
|
| 132 |
-
curl https://kunalkachru23-nexus-enhanced.hf.space/metrics | jq .
|
| 133 |
-
curl https://kunalkachru23-nexus-enhanced.hf.space/learning-curve | jq .
|
| 134 |
-
curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
|
| 135 |
-H "Content-Type: application/json" \
|
| 136 |
-d '{"incident_id": "INC003"}' | jq .
|
| 137 |
```
|
|
@@ -166,14 +166,14 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
|
|
| 166 |
### After Phase 7 Passes:
|
| 167 |
|
| 168 |
1. **Update Colab Notebook** (grpo_colab_v2.ipynb)
|
| 169 |
-
- Verify `BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space"`
|
| 170 |
- Run connectivity check cell
|
| 171 |
- Expected: ✅ Connected message
|
| 172 |
|
| 173 |
2. **Start Training**
|
| 174 |
- Run all cells in Colab
|
| 175 |
- Training produces reward curves
|
| 176 |
-
- Monitor: https://kunalkachru23-nexus-enhanced.hf.space/learning-curve
|
| 177 |
|
| 178 |
3. **Expected Training Results**
|
| 179 |
- First 5-10 episodes: ~0.2 reward (baseline)
|
|
@@ -181,7 +181,7 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
|
|
| 181 |
- Episodes 30+: Convergence around 0.6-0.8
|
| 182 |
- Total training time: ~6 hours for 50 episodes on Colab GPU
|
| 183 |
|
| 184 |
-
4. **Blog Post** (
|
| 185 |
- Topic: "How NEXUS Enhanced Trains Multi-Agent Incident Response via GRPO"
|
| 186 |
- Sections:
|
| 187 |
- Problem statement (CrowdStrike scale)
|
|
@@ -191,7 +191,7 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
|
|
| 191 |
- Length: ~800-1200 words (< 2 min read)
|
| 192 |
- Publish to HF blog or Medium
|
| 193 |
|
| 194 |
-
5. **Pitch Video** (
|
| 195 |
- Duration: 3 minutes max
|
| 196 |
- Content:
|
| 197 |
- Show judge dashboard at `/`
|
|
@@ -222,7 +222,7 @@ curl -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
|
|
| 222 |
|
| 223 |
| Service | URL | Purpose |
|
| 224 |
|---------|-----|---------|
|
| 225 |
-
| Judge Dashboard | `https://kunalkachru23-nexus-enhanced.hf.space/` | Live metrics + curves |
|
| 226 |
| API Health | `.../health` | Connectivity check |
|
| 227 |
| Metrics | `.../metrics` | Training stats |
|
| 228 |
| Learning Curve | `.../learning-curve` | Reward history |
|
|
|
|
| 85 |
# 🎉 Deployment complete!
|
| 86 |
|
| 87 |
# Monitor build progress:
|
| 88 |
+
# https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
|
| 89 |
# Look for "Building" → "Running" status (5-10 min)
|
| 90 |
```
|
| 91 |
|
|
|
|
| 94 |
|
| 95 |
```bash
|
| 96 |
# Test public health endpoint
|
| 97 |
+
curl https://kunalkachru23-nexus-enhanced-stage.hf.space/health
|
| 98 |
|
| 99 |
# Expected: {"status": "healthy", "environment": "nexus-enhanced", ...}
|
| 100 |
|
| 101 |
# Run full remote test suite
|
| 102 |
+
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
| 103 |
|
| 104 |
# Expected: ✅ ALL 7 TESTS PASS
|
| 105 |
```
|
|
|
|
| 107 |
### Step 5: Verify Judge Dashboard
|
| 108 |
Open in browser:
|
| 109 |
```
|
| 110 |
+
https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 111 |
```
|
| 112 |
|
| 113 |
**Expected to see**:
|
|
|
|
| 125 |
|
| 126 |
```bash
|
| 127 |
# Full test suite against deployed environment
|
| 128 |
+
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
| 129 |
|
| 130 |
# Individual endpoint tests:
|
| 131 |
+
curl https://kunalkachru23-nexus-enhanced-stage.hf.space/health | jq .
|
| 132 |
+
curl https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics | jq .
|
| 133 |
+
curl https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve | jq .
|
| 134 |
+
curl -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/reset \
|
| 135 |
-H "Content-Type: application/json" \
|
| 136 |
-d '{"incident_id": "INC003"}' | jq .
|
| 137 |
```
|
|
|
|
| 166 |
### After Phase 7 Passes:
|
| 167 |
|
| 168 |
1. **Update Colab Notebook** (grpo_colab_v2.ipynb)
|
| 169 |
+
- Verify `BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space"`
|
| 170 |
- Run connectivity check cell
|
| 171 |
- Expected: ✅ Connected message
|
| 172 |
|
| 173 |
2. **Start Training**
|
| 174 |
- Run all cells in Colab
|
| 175 |
- Training produces reward curves
|
| 176 |
+
- Monitor: https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve
|
| 177 |
|
| 178 |
3. **Expected Training Results**
|
| 179 |
- First 5-10 episodes: ~0.2 reward (baseline)
|
|
|
|
| 181 |
- Episodes 30+: Convergence around 0.6-0.8
|
| 182 |
- Total training time: ~6 hours for 50 episodes on Colab GPU
|
| 183 |
|
| 184 |
+
4. **Blog Post** (submission requirement)
|
| 185 |
- Topic: "How NEXUS Enhanced Trains Multi-Agent Incident Response via GRPO"
|
| 186 |
- Sections:
|
| 187 |
- Problem statement (CrowdStrike scale)
|
|
|
|
| 191 |
- Length: ~800-1200 words (< 2 min read)
|
| 192 |
- Publish to HF blog or Medium
|
| 193 |
|
| 194 |
+
5. **Pitch Video** (submission requirement)
|
| 195 |
- Duration: 3 minutes max
|
| 196 |
- Content:
|
| 197 |
- Show judge dashboard at `/`
|
|
|
|
| 222 |
|
| 223 |
| Service | URL | Purpose |
|
| 224 |
|---------|-----|---------|
|
| 225 |
+
| Judge Dashboard | `https://kunalkachru23-nexus-enhanced-stage.hf.space/` | Live metrics + curves |
|
| 226 |
| API Health | `.../health` | Connectivity check |
|
| 227 |
| Metrics | `.../metrics` | Training stats |
|
| 228 |
| Learning Curve | `.../learning-curve` | Reward history |
|
docs/deployment/HF_SPACES_DEPLOYMENT.md
CHANGED
|
@@ -63,18 +63,18 @@ Once HF Spaces shows "Running" status:
|
|
| 63 |
|
| 64 |
```bash
|
| 65 |
# Test judge dashboard endpoint
|
| 66 |
-
curl -s https://kunalkachru23-nexus-enhanced.hf.space/health | jq .
|
| 67 |
|
| 68 |
# Test reset endpoint
|
| 69 |
-
curl -s -X POST https://kunalkachru23-nexus-enhanced.hf.space/reset \
|
| 70 |
-H "Content-Type: application/json" \
|
| 71 |
-d '{"incident_id": "INC003"}' | jq .
|
| 72 |
|
| 73 |
# Test metrics endpoint
|
| 74 |
-
curl -s https://kunalkachru23-nexus-enhanced.hf.space/metrics | jq .
|
| 75 |
|
| 76 |
# Test learning curve
|
| 77 |
-
curl -s https://kunalkachru23-nexus-enhanced.hf.space/learning-curve | jq .
|
| 78 |
```
|
| 79 |
|
| 80 |
### Step 6: Update Colab Notebook
|
|
@@ -82,7 +82,7 @@ curl -s https://kunalkachru23-nexus-enhanced.hf.space/learning-curve | jq .
|
|
| 82 |
In `notebooks/grpo_colab_v2.ipynb`, ensure BASE_URL points to deployed space:
|
| 83 |
|
| 84 |
```python
|
| 85 |
-
BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space" # YOUR SPACE URL
|
| 86 |
```
|
| 87 |
|
| 88 |
Then run connectivity check:
|
|
@@ -102,7 +102,7 @@ cat > test_hf_space_deployment.py << 'EOF'
|
|
| 102 |
import requests
|
| 103 |
import json
|
| 104 |
|
| 105 |
-
BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space"
|
| 106 |
|
| 107 |
def test_health():
|
| 108 |
resp = requests.get(f"{BASE_URL}/health")
|
|
@@ -193,8 +193,8 @@ python test_hf_space_deployment.py
|
|
| 193 |
|
| 194 |
While Colab trains:
|
| 195 |
|
| 196 |
-
1. **Watch reward curves**: https://kunalkachru23-nexus-enhanced.hf.space/learning-curve
|
| 197 |
-
2. **Check metrics**: `curl https://kunalkachru23-nexus-enhanced.hf.space/metrics`
|
| 198 |
3. **Monitor Colab logs** for reward_fn errors
|
| 199 |
4. **Expected pattern**: First 5-10 episodes ~0.2 reward, then gradual improvement to 0.6-0.8
|
| 200 |
|
|
@@ -212,5 +212,5 @@ git push origin main
|
|
| 212 |
## Next Steps
|
| 213 |
|
| 214 |
- **Phase 7**: Run full regression tests against deployed HF Space
|
| 215 |
-
- **Blog Post**: Write HF blog explaining NEXUS architecture (
|
| 216 |
-
- **Pitch**: Prepare 3-minute demo for judges (
|
|
|
|
| 63 |
|
| 64 |
```bash
|
| 65 |
# Test judge dashboard endpoint
|
| 66 |
+
curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/health | jq .
|
| 67 |
|
| 68 |
# Test reset endpoint
|
| 69 |
+
curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/reset \
|
| 70 |
-H "Content-Type: application/json" \
|
| 71 |
-d '{"incident_id": "INC003"}' | jq .
|
| 72 |
|
| 73 |
# Test metrics endpoint
|
| 74 |
+
curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics | jq .
|
| 75 |
|
| 76 |
# Test learning curve
|
| 77 |
+
curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve | jq .
|
| 78 |
```
|
| 79 |
|
| 80 |
### Step 6: Update Colab Notebook
|
|
|
|
| 82 |
In `notebooks/grpo_colab_v2.ipynb`, ensure BASE_URL points to deployed space:
|
| 83 |
|
| 84 |
```python
|
| 85 |
+
BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space" # YOUR SPACE URL
|
| 86 |
```
|
| 87 |
|
| 88 |
Then run connectivity check:
|
|
|
|
| 102 |
import requests
|
| 103 |
import json
|
| 104 |
|
| 105 |
+
BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space"
|
| 106 |
|
| 107 |
def test_health():
|
| 108 |
resp = requests.get(f"{BASE_URL}/health")
|
|
|
|
| 193 |
|
| 194 |
While Colab trains:
|
| 195 |
|
| 196 |
+
1. **Watch reward curves**: https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve
|
| 197 |
+
2. **Check metrics**: `curl https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics`
|
| 198 |
3. **Monitor Colab logs** for reward_fn errors
|
| 199 |
4. **Expected pattern**: First 5-10 episodes ~0.2 reward, then gradual improvement to 0.6-0.8
|
| 200 |
|
|
|
|
| 212 |
## Next Steps
|
| 213 |
|
| 214 |
- **Phase 7**: Run full regression tests against deployed HF Space
|
| 215 |
+
- **Blog Post**: Write HF blog explaining NEXUS architecture (per submission requirements)
|
| 216 |
+
- **Pitch**: Prepare 3-minute demo for judges (per submission requirements)
|
docs/guides/QUICK_START.md
CHANGED
|
@@ -48,7 +48,7 @@ openenv push . --repo-id kunalkachru23/nexus-enhanced-stage --exclude .hfignore
|
|
| 48 |
4. Watch dashboard: https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 49 |
```
|
| 50 |
|
| 51 |
-
### Export reward plot for slides (
|
| 52 |
```bash
|
| 53 |
python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
| 54 |
# or from local episode_rewards.json:
|
|
@@ -188,7 +188,7 @@ python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-st
|
|
| 188 |
| Reward Progress | **20%** | Chart.js curves on dashboard ← **KEY** |
|
| 189 |
| Pipeline | 10% | GRPO on Colab GPU → HF Space API |
|
| 190 |
|
| 191 |
-
**🎯 Priority**: Ensure reward curves are visible and improving to
|
| 192 |
|
| 193 |
---
|
| 194 |
|
|
|
|
| 48 |
4. Watch dashboard: https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 49 |
```
|
| 50 |
|
| 51 |
+
### Export reward plot for slides (observable improvement evidence)
|
| 52 |
```bash
|
| 53 |
python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
| 54 |
# or from local episode_rewards.json:
|
|
|
|
| 188 |
| Reward Progress | **20%** | Chart.js curves on dashboard ← **KEY** |
|
| 189 |
| Pipeline | 10% | GRPO on Colab GPU → HF Space API |
|
| 190 |
|
| 191 |
+
**🎯 Priority**: Ensure reward curves are visible and improving to support the observable-improvement rubric row.
|
| 192 |
|
| 193 |
---
|
| 194 |
|
docs/pitch/DEMO_WALKTHROUGH.md
CHANGED
|
@@ -69,7 +69,7 @@ Say:
|
|
| 69 |
## [1:50-2:00] Close
|
| 70 |
|
| 71 |
Say:
|
| 72 |
-
"We satisfy OpenEnv
|
| 73 |
|
| 74 |
---
|
| 75 |
|
|
|
|
| 69 |
## [1:50-2:00] Close
|
| 70 |
|
| 71 |
Say:
|
| 72 |
+
"We satisfy OpenEnv validation requirements, show live reward improvement, and can directly inspect behavior delta in the running environment."
|
| 73 |
|
| 74 |
---
|
| 75 |
|
docs/pitch/PITCH.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
-
# NEXUS Enhanced — 3-minute pitch + 2-minute Q&A
|
| 2 |
|
| 3 |
**Event:** Meta PyTorch OpenEnv Hackathon × Scaler — Grand Finale
|
| 4 |
-
**Format (
|
| 5 |
|
| 6 |
---
|
| 7 |
|
|
@@ -29,7 +29,7 @@
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
## Observable evidence (~35 s) —
|
| 33 |
|
| 34 |
- Show **dashboard** reward curve and rolling average vs **baseline** (pre-event benchmark).
|
| 35 |
- Use one canonical metrics callout (snapshot `2026-04-24T16:48:26Z`, stage URL): **episodes 387**, **avg 0.4634**, **best 1.0032**, **+74.9% vs baseline 0.265**.
|
|
@@ -38,7 +38,7 @@
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
-
## Training & improvement (~30 s) —
|
| 42 |
|
| 43 |
- **Colab** runs minimal GRPO training against the **real** remote environment API (not a mocked reward).
|
| 44 |
- Improvement is **observable** on the curve and in **behaviour** (shorter paths, better notifications, fewer oversight violations) — tie any checkpoint story to **what the IC does differently**, not only the scalar.
|
|
@@ -47,7 +47,7 @@
|
|
| 47 |
|
| 48 |
## Close (~20 s)
|
| 49 |
|
| 50 |
-
> “NEXUS is **OpenEnv-shaped**: isolated episodes, structured actions, measurable outcomes, and a problem that stays hard after the novelty wears off. We meet
|
| 51 |
|
| 52 |
**Stop at 3:00. Breathe. Hand off for Q&A.**
|
| 53 |
|
|
@@ -65,7 +65,7 @@ Answer in **short paragraphs**; do not invent numbers not on the dashboard.
|
|
| 65 |
| **Reward hacking?** | Sparse terminal reward + oversight + coalition + tool budgets; red herrings in harder incidents. |
|
| 66 |
| **What is partial observability?** | IC observation is a slice; specialists see tool outputs for their role; IC never sees everything at once. |
|
| 67 |
| **INC007 in 30 s?** | Nightmare incident: multi-region blast radius; **schema drift** forces contract change mid-episode — reserved for sharp Q&A, not the full live path if time is short. |
|
| 68 |
-
| **Why GRPO / TRL / Unsloth?** |
|
| 69 |
| **What if the Space is slow?** | Training is async from Colab; dashboard refreshes on timer; auto-demo is one POST chain. |
|
| 70 |
| **Baseline 0.265?** | Pre-event scripted benchmark documented in server; curve compares **trained vs that baseline** for “observable improvement.” |
|
| 71 |
| **Single strongest differentiator?** | Multi-agent + sparse reward + **schema drift** on INC007 in one OpenEnv-hosted stack judges can open in the browser. |
|
|
|
|
| 1 |
+
# NEXUS Enhanced — 3-minute pitch + 2-minute Q&A
|
| 2 |
|
| 3 |
**Event:** Meta PyTorch OpenEnv Hackathon × Scaler — Grand Finale
|
| 4 |
+
**Format (per hackathon compliance):** 3 min pitch + 2 min Q&A = 5 min total.
|
| 5 |
|
| 6 |
---
|
| 7 |
|
|
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
+
## Observable evidence (~35 s) — reward improvement (judging rubric)
|
| 33 |
|
| 34 |
- Show **dashboard** reward curve and rolling average vs **baseline** (pre-event benchmark).
|
| 35 |
- Use one canonical metrics callout (snapshot `2026-04-24T16:48:26Z`, stage URL): **episodes 387**, **avg 0.4634**, **best 1.0032**, **+74.9% vs baseline 0.265**.
|
|
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
+
## Training & improvement (~30 s) — improvement + pipeline coherence
|
| 42 |
|
| 43 |
- **Colab** runs minimal GRPO training against the **real** remote environment API (not a mocked reward).
|
| 44 |
- Improvement is **observable** on the curve and in **behaviour** (shorter paths, better notifications, fewer oversight violations) — tie any checkpoint story to **what the IC does differently**, not only the scalar.
|
|
|
|
| 47 |
|
| 48 |
## Close (~20 s)
|
| 49 |
|
| 50 |
+
> “NEXUS is **OpenEnv-shaped**: isolated episodes, structured actions, measurable outcomes, and a problem that stays hard after the novelty wears off. We meet **hackathon compliance**: OpenEnv latest in the toolchain, Colab TRL/Unsloth training, and the required HF blog or short video slot—and we optimised for the **40% environment** and **30% storytelling** rubric weights with a demo you can **drive live** in under two minutes.”
|
| 51 |
|
| 52 |
**Stop at 3:00. Breathe. Hand off for Q&A.**
|
| 53 |
|
|
|
|
| 65 |
| **Reward hacking?** | Sparse terminal reward + oversight + coalition + tool budgets; red herrings in harder incidents. |
|
| 66 |
| **What is partial observability?** | IC observation is a slice; specialists see tool outputs for their role; IC never sees everything at once. |
|
| 67 |
| **INC007 in 30 s?** | Nightmare incident: multi-region blast radius; **schema drift** forces contract change mid-episode — reserved for sharp Q&A, not the full live path if time is short. |
|
| 68 |
+
| **Why GRPO / TRL / Unsloth?** | Per compliance: minimal training in Colab with **HF TRL** and **Unsloth** for efficient QLoRA on Qwen-class IC policy. |
|
| 69 |
| **What if the Space is slow?** | Training is async from Colab; dashboard refreshes on timer; auto-demo is one POST chain. |
|
| 70 |
| **Baseline 0.265?** | Pre-event scripted benchmark documented in server; curve compares **trained vs that baseline** for “observable improvement.” |
|
| 71 |
| **Single strongest differentiator?** | Multi-agent + sparse reward + **schema drift** on INC007 in one OpenEnv-hosted stack judges can open in the browser. |
|
docs/pitch/PITCH_3MIN.md
CHANGED
|
@@ -86,7 +86,7 @@ The trained model learns to query the right service, notify customers proactivel
|
|
| 86 |
|
| 87 |
For academia, it's a benchmarkable environment. For enterprise, it's a path toward AI-assisted incident management. For Meta and PyTorch, it demonstrates OpenEnv's potential for real-world complexity.
|
| 88 |
|
| 89 |
-
**Live demo:** https://kunalkachru23-nexus-enhanced.hf.space/
|
| 90 |
|
| 91 |
Thank you."
|
| 92 |
|
|
|
|
| 86 |
|
| 87 |
For academia, it's a benchmarkable environment. For enterprise, it's a path toward AI-assisted incident management. For Meta and PyTorch, it demonstrates OpenEnv's potential for real-world complexity.
|
| 88 |
|
| 89 |
+
**Live demo:** https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 90 |
|
| 91 |
Thank you."
|
| 92 |
|
docs/pitch/VIDEO_RECORD_NOW_PACK.md
ADDED
|
@@ -0,0 +1,144 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NEXUS Enhanced — Record-Now Video Pack
|
| 2 |
+
|
| 3 |
+
Target length: 1:45 to 2:00
|
| 4 |
+
Presenter style: calm, clear, outcome-first
|
| 5 |
+
Audience: judges / hackathon reviewers
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1) Pre-record setup (2-3 minutes)
|
| 10 |
+
|
| 11 |
+
Keep only these windows ready:
|
| 12 |
+
- Browser tab 1: `https://kunalkachru23-nexus-enhanced-stage.hf.space/web`
|
| 13 |
+
- Browser tab 2: `https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics`
|
| 14 |
+
- Terminal tab with this command pre-pasted:
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
Visual quality checklist:
|
| 21 |
+
- Screen recording at 1080p.
|
| 22 |
+
- Terminal font size: 16+.
|
| 23 |
+
- Browser zoom: 125%.
|
| 24 |
+
- Hide desktop notifications.
|
| 25 |
+
- Use dark mode consistently (optional, but cleaner).
|
| 26 |
+
|
| 27 |
+
Canonical narration numbers (freeze these):
|
| 28 |
+
- Episodes: `387`
|
| 29 |
+
- Average reward: `0.4634`
|
| 30 |
+
- Best reward: `1.0032`
|
| 31 |
+
- Baseline: `0.265`
|
| 32 |
+
- Improvement: `+74.9%`
|
| 33 |
+
|
| 34 |
+
Source: `docs/project/snapshots/submission_snapshot_20260424T164826Z.md`
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 2) One-take timeline (timestamped)
|
| 39 |
+
|
| 40 |
+
### 0:00-0:12 — Hook (camera or title card)
|
| 41 |
+
|
| 42 |
+
Say:
|
| 43 |
+
"Most incident-response AI demos are single-agent and unrealistic. NEXUS Enhanced is a multi-agent OpenEnv environment where an Incident Commander coordinates specialists under partial observability, business constraints, and schema drift."
|
| 44 |
+
|
| 45 |
+
On screen:
|
| 46 |
+
- Title card or the `/web` dashboard landing view.
|
| 47 |
+
|
| 48 |
+
### 0:12-0:35 — What you built
|
| 49 |
+
|
| 50 |
+
Say:
|
| 51 |
+
"We built seven incident scenarios, from easier outages to a nightmare schema-drift incident. The system is deployed on Hugging Face Spaces and validated with OpenEnv-compatible workflow checks. Training runs through TRL GRPO with Unsloth in Colab."
|
| 52 |
+
|
| 53 |
+
On screen:
|
| 54 |
+
- Briefly show `/metrics` page and then back to `/web`.
|
| 55 |
+
|
| 56 |
+
### 0:35-1:00 — Measurable improvement
|
| 57 |
+
|
| 58 |
+
Say:
|
| 59 |
+
"In the current frozen snapshot we have 387 completed episodes, average reward 0.4634 versus baseline 0.265, and best reward 1.0032. That's a 74.9% uplift over baseline."
|
| 60 |
+
|
| 61 |
+
On screen:
|
| 62 |
+
- Show dashboard training metrics and curve.
|
| 63 |
+
|
| 64 |
+
### 1:00-1:30 — Behavioral evidence
|
| 65 |
+
|
| 66 |
+
Say:
|
| 67 |
+
"The key claim is behavior change, not only scalar gain. In INC003, the policy commits earlier to the memory-leak hypothesis, sequences runbook steps cleanly, sends proactive customer notifications, and reaches postmortem with fewer redundant actions."
|
| 68 |
+
|
| 69 |
+
On screen:
|
| 70 |
+
- Run the terminal command:
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
- Point to:
|
| 77 |
+
- `final_phase: "postmortem"`
|
| 78 |
+
- `reward_breakdown.total`
|
| 79 |
+
- `coalition_correct: true`
|
| 80 |
+
- `notifications_sent`
|
| 81 |
+
|
| 82 |
+
### 1:30-1:52 — Safety and robustness
|
| 83 |
+
|
| 84 |
+
Say:
|
| 85 |
+
"To reduce reward hacking, diagnosis is evidence-gated, customer score requires actual notification actions, and coordination penalizes duplicate tool calls. Oversight checks also affect final scoring."
|
| 86 |
+
|
| 87 |
+
On screen:
|
| 88 |
+
- Keep terminal result visible (reward breakdown + oversight report).
|
| 89 |
+
|
| 90 |
+
### 1:52-2:00 — Close
|
| 91 |
+
|
| 92 |
+
Say:
|
| 93 |
+
"NEXUS combines innovation, observable improvement, and reproducible deployment for real incident-management RL. Thanks for watching."
|
| 94 |
+
|
| 95 |
+
On screen:
|
| 96 |
+
- End card with:
|
| 97 |
+
- Stage URL
|
| 98 |
+
- Repo URL
|
| 99 |
+
- Evidence docs path
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## 3) Backup branch (if UI or network is slow)
|
| 104 |
+
|
| 105 |
+
If dashboard lags, use terminal-only sequence:
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/health | python -m json.tool
|
| 109 |
+
curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics | python -m json.tool
|
| 110 |
+
curl -s https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve | python -m json.tool
|
| 111 |
+
curl -s -X POST https://kunalkachru23-nexus-enhanced-stage.hf.space/demo/run/INC003 | python -m json.tool
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
Narration line for fallback:
|
| 115 |
+
"Even without UI, these live endpoints show health, metrics, learning progression, and full behavioral transcript from auto-demo INC003."
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## 4) Retake checklist (30 seconds)
|
| 120 |
+
|
| 121 |
+
Before final export, verify:
|
| 122 |
+
- Audio is clear and steady.
|
| 123 |
+
- Duration stays under 2:00.
|
| 124 |
+
- Numbers match frozen snapshot.
|
| 125 |
+
- Stage URL shown at least once.
|
| 126 |
+
- Demo command succeeds in-frame.
|
| 127 |
+
- No mention of internal-only terminology.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## 5) Upload metadata (copy/paste)
|
| 132 |
+
|
| 133 |
+
Published video URL:
|
| 134 |
+
`https://www.youtube.com/watch?v=a9YZF30tomw`
|
| 135 |
+
|
| 136 |
+
Title:
|
| 137 |
+
`NEXUS Enhanced Demo — Multi-Agent Incident Response RL (OpenEnv + GRPO)`
|
| 138 |
+
|
| 139 |
+
Description:
|
| 140 |
+
`Live demo of NEXUS Enhanced on HF Space. Shows OpenEnv-compatible environment checks, observable reward improvement, and transcript-level behavioral evidence on INC003. Stage URL: https://kunalkachru23-nexus-enhanced-stage.hf.space`
|
| 141 |
+
|
| 142 |
+
Tags:
|
| 143 |
+
`openenv, reinforcement learning, multi-agent, incident response, grpo, trl, unsloth, huggingface`
|
| 144 |
+
|
docs/project/BEHAVIORAL_DELTA_PROOF.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
This sheet demonstrates
|
| 4 |
|
| 5 |
## Fixed evaluation set (canonical)
|
| 6 |
|
|
@@ -60,9 +60,9 @@ These support that gain is not only speed; diagnosis/coordination/customer dimen
|
|
| 60 |
- `GET /learning-curve`
|
| 61 |
- confirm aggregate improvement trend and rolling average.
|
| 62 |
|
| 63 |
-
## Why this satisfies
|
| 64 |
|
| 65 |
-
|
| 66 |
NEXUS evidence shows:
|
| 67 |
|
| 68 |
- Coherent multi-dimensional reward decomposition (MTTR, diagnosis, customer, coordination, oversight, depth).
|
|
|
|
| 1 |
+
# Behavioral delta proof (pipeline coherence)
|
| 2 |
|
| 3 |
+
This sheet demonstrates **hackathon judging intent** for reward-and-pipeline coherence: measurable improvement in **how the agent acts**, not only reward numbers.
|
| 4 |
|
| 5 |
## Fixed evaluation set (canonical)
|
| 6 |
|
|
|
|
| 60 |
- `GET /learning-curve`
|
| 61 |
- confirm aggregate improvement trend and rolling average.
|
| 62 |
|
| 63 |
+
## Why this satisfies the pipeline-coherence rubric row
|
| 64 |
|
| 65 |
+
The judging rubric asks for coherent reward logic and meaningful pipeline-driven behavior change.
|
| 66 |
NEXUS evidence shows:
|
| 67 |
|
| 68 |
- Coherent multi-dimensional reward decomposition (MTTR, diagnosis, customer, coordination, oversight, depth).
|
docs/project/COMPLIANCE_LOCK_MATRIX.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
-
# Compliance Lock Matrix (
|
| 2 |
|
| 3 |
-
Purpose: freeze
|
| 4 |
|
| 5 |
-
##
|
| 6 |
|
| 7 |
| Gate | Requirement | Project evidence | Verification command |
|
| 8 |
|---|---|---|---|
|
|
@@ -10,11 +10,11 @@ Purpose: freeze hard-gate and scoring-criterion traceability so implementation c
|
|
| 10 |
| Colab training script | Minimal Colab script with TRL/Unsloth path | `notebooks/grpo_colab_v2.ipynb` | Notebook config + run cells |
|
| 11 |
| Public artifact | HF blog or <2 min video | `docs/blog/*`, `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md` | Submission URL checklist |
|
| 12 |
|
| 13 |
-
## Weighted scoring criteria map (
|
| 14 |
|
| 15 |
These four rows are what Cerebral Valley–aggregated scoring uses. The **Demonstration** column is how a judge should *see* each criterion in the live Space or repo in under five minutes.
|
| 16 |
|
| 17 |
-
| Criterion |
|
| 18 |
|---|---:|---|---|---|
|
| 19 |
| **1 — Environment Innovation** | 40% | Novel, creative, or challenging; meaningfully tests agent behaviour | `server/environment.py`, `server/incidents.py`, `server/agents.py`, `server/tools.py`, `docs/project/SUBTHEME_EVIDENCE_MATRIX.md` | Open the dashboard **Training** tab → **manual validation** on **INC008** (Theme 3.2) and INC004–INC007; show **coalition**, **role-scoped tooling**, **INC007 schema drift** in Q&A or deep demo. Eight incidents = difficulty ladder + operational nuance + personalized track. |
|
| 20 |
| **2 — Storytelling** | 30% | Clear explanation of problem, environment, and agent behaviour; demo **engaging and easy to follow** | `docs/pitch/PITCH.md`, `docs/pitch/DEMO_WALKTHROUGH.md`, `docs/project/FINAL_OPERATIONS_RUNBOOK.md` | Follow `DEMO_WALKTHROUGH.md`: metrics -> **Validation tab auto-demo** (INC003) -> optional **Guided** steps to completion. One sentence hook: “IC coordinates specialists under partial observability and contract drift.” |
|
|
@@ -23,22 +23,22 @@ These four rows are what Cerebral Valley–aggregated scoring uses. The **Demons
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
-
##
|
| 27 |
|
| 28 |
-
Official hackathon **themes** (
|
| 29 |
|
| 30 |
-
|
|
| 31 |
|---|---|---|---|---|
|
| 32 |
-
| **Theme 1 — Multi-agent** |
|
| 33 |
-
| **Theme 2 — Long horizon & instruction following** |
|
| 34 |
-
| **Theme 3.1 — World modeling (professional)** |
|
| 35 |
-
| **Theme 3.2 — World modeling (personalized)** |
|
| 36 |
-
| **Theme 4 — Self-improvement** |
|
| 37 |
-
| **Theme 5 — Wild card** |
|
| 38 |
|
| 39 |
-
### Sub-theme bonuses
|
| 40 |
|
| 41 |
-
Fleet, Halluminate, Snorkel, Patronus, Mercor, Scaler AI Labs rows remain the detailed sponsor map. This matrix ties **parent themes
|
| 42 |
|
| 43 |
### One-line pitch bank (theme → sentence)
|
| 44 |
|
|
@@ -47,7 +47,7 @@ Use in Q&A if asked “which themes?”
|
|
| 47 |
1. **Theme 1:** “IC coordinates five roles under partial observability and coalition mechanics.”
|
| 48 |
2. **Theme 2:** “Sparse end-of-episode reward and persistent server state force long-horizon planning beyond a single context.”
|
| 49 |
3. **Theme 3.1:** “Tool-bound enterprise workflows with anti-shortcut evidence and runbook discipline.”
|
| 50 |
-
4. **Theme 3.2:** “**INC008** is a
|
| 51 |
5. **Theme 4:** “**`/curriculum`** shows live adaptive difficulty from rolling rewards; expert criteria rotate; GRPO improves the policy.”
|
| 52 |
6. **Theme 5:** “Wild-card angle: one deployable environment that fuses the hardest ops themes for LLM incident command training.”
|
| 53 |
|
|
|
|
| 1 |
+
# Compliance Lock Matrix (hackathon-aligned)
|
| 2 |
|
| 3 |
+
Purpose: freeze **mandatory requirements** and **judging rubric** traceability so implementation changes stay aligned with hackathon compliance.
|
| 4 |
|
| 5 |
+
## Mandatory requirements (pass/fail)
|
| 6 |
|
| 7 |
| Gate | Requirement | Project evidence | Verification command |
|
| 8 |
|---|---|---|---|
|
|
|
|
| 10 |
| Colab training script | Minimal Colab script with TRL/Unsloth path | `notebooks/grpo_colab_v2.ipynb` | Notebook config + run cells |
|
| 11 |
| Public artifact | HF blog or <2 min video | `docs/blog/*`, `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md` | Submission URL checklist |
|
| 12 |
|
| 13 |
+
## Weighted scoring criteria map (judging rubric)
|
| 14 |
|
| 15 |
These four rows are what Cerebral Valley–aggregated scoring uses. The **Demonstration** column is how a judge should *see* each criterion in the live Space or repo in under five minutes.
|
| 16 |
|
| 17 |
+
| Criterion | Weight | What judges need (intent) | NEXUS evidence | How the design demonstrates it (demo / artefact) |
|
| 18 |
|---|---:|---|---|---|
|
| 19 |
| **1 — Environment Innovation** | 40% | Novel, creative, or challenging; meaningfully tests agent behaviour | `server/environment.py`, `server/incidents.py`, `server/agents.py`, `server/tools.py`, `docs/project/SUBTHEME_EVIDENCE_MATRIX.md` | Open the dashboard **Training** tab → **manual validation** on **INC008** (Theme 3.2) and INC004–INC007; show **coalition**, **role-scoped tooling**, **INC007 schema drift** in Q&A or deep demo. Eight incidents = difficulty ladder + operational nuance + personalized track. |
|
| 20 |
| **2 — Storytelling** | 30% | Clear explanation of problem, environment, and agent behaviour; demo **engaging and easy to follow** | `docs/pitch/PITCH.md`, `docs/pitch/DEMO_WALKTHROUGH.md`, `docs/project/FINAL_OPERATIONS_RUNBOOK.md` | Follow `DEMO_WALKTHROUGH.md`: metrics -> **Validation tab auto-demo** (INC003) -> optional **Guided** steps to completion. One sentence hook: “IC coordinates specialists under partial observability and contract drift.” |
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
## Hackathon content themes — how NEXUS maps
|
| 27 |
|
| 28 |
+
Official hackathon **themes** (organizer brief) are distinct from the **four scoring criteria** above. NEXUS is architected as an **enterprise incident-command** environment; the table below states where each parent theme is **primary** (core loop), **secondary** (explicit mechanic but not the headline), or **bridge** (honest pitch link without claiming a different product genre).
|
| 29 |
|
| 30 |
+
| Parent theme | Track | Primary requirement (summary) | NEXUS demonstration | Verification |
|
| 31 |
|---|---|---|---|---|
|
| 32 |
+
| **Theme 1 — Multi-agent** | Multi-agent track | Cooperation / competition / negotiation / **coalition**; **partial observability**; theory-of-mind style incentives | Five specialist roles + IC; coalition votes on harder incidents; IC observation is a slice, not full state | `server/environment.py`, `server/agents.py`, INC003+ in `server/incidents.py`, manual validation + `/state/{session_id}` |
|
| 33 |
+
| **Theme 2 — Long horizon & instruction following** | Long-horizon track | **Sparse / delayed** reward; task **beyond one context**; decomposition & recovery; Scale sub-theme: **non-code** business workflows (incl. **HR & IT**) | Episode-level reward on `done`; long `max_steps` incidents; **server-side** session state and tool history so the task cannot fit in one static transcript; **IT ops** coordination (not a coding benchmark) | `server/reward.py`, `server/app.py` session store, INC006–INC007 length/complexity; see Scale AI row in `SUBTHEME_EVIDENCE_MATRIX.md` |
|
| 34 |
+
| **Theme 3.1 — World modeling (professional)** | World modeling | Realistic tools/workflows; **anti-shortcut**; causal / persistent world | Datadog / runbook / portal-style tools; evidence-gated diagnosis; runbook steps; red herrings on harder tracks | `server/tools.py`, `server/reward.py`, `docs/project/REWARD_HACKING_DEFENSE.md` |
|
| 35 |
+
| **Theme 3.2 — World modeling (personalized)** | Personalized track | Personal delegation / conflicts / messaging-style tasks | **Dedicated incident INC008** (EA calendar: board prep vs school concert, family thread, auto-accept root cause) using `IncidentType.PERSONAL_ASSISTANT`, same multi-agent tool loop as ops incidents. **Plus** enterprise paths: customer **notifications** and SLA framing on INC001–INC007. | Manual validation **INC008** on dashboard; `server/incidents.py` (`INC008`), `server/data_models.py` |
|
| 36 |
+
| **Theme 4 — Self-improvement** | Self-improvement track | Curriculum / adaptive difficulty; recursive capability growth | **Process-wide adaptive tier:** `server/global_curriculum.py` + `GET /curriculum` — last-5 rolling avg ≥ 0.55 promotes difficulty across **HTTP/Colab sessions** (not lost per `NexusEnvironment()`). **Plus** seven-incident ladder + **Snorkel-style** rotating `expert_criteria`. Full recursive self-play is out of scope; GRPO improves policy externally. | `server/difficulty.py`, `server/global_curriculum.py`, `server/app.py` `/curriculum`, Colab GRPO |
|
| 37 |
+
| **Theme 5 — Wild card** | Wild card | Creative value for LLM training on a defined task | **Primary positioning option:** “Out-of-box” fusion of **multi-agent ops + schema drift + oversight + token-scaled depth bonus** in one OpenEnv-deployable stack | Pitch close in `docs/pitch/PITCH.md`; innovation narrative in rubric row 1 |
|
| 38 |
|
| 39 |
+
### Sub-theme bonuses — already locked in `SUBTHEME_EVIDENCE_MATRIX.md`
|
| 40 |
|
| 41 |
+
Fleet, Halluminate, Snorkel, Patronus, Mercor, Scaler AI Labs rows remain the detailed sponsor map. This matrix ties **parent themes** to the same implementation so evaluators see both **theme** and **sponsor** coverage.
|
| 42 |
|
| 43 |
### One-line pitch bank (theme → sentence)
|
| 44 |
|
|
|
|
| 47 |
1. **Theme 1:** “IC coordinates five roles under partial observability and coalition mechanics.”
|
| 48 |
2. **Theme 2:** “Sparse end-of-episode reward and persistent server state force long-horizon planning beyond a single context.”
|
| 49 |
3. **Theme 3.1:** “Tool-bound enterprise workflows with anti-shortcut evidence and runbook discipline.”
|
| 50 |
+
4. **Theme 3.2:** “**INC008** is a personalized EA-style conflict (calendar + family messaging) on the same engine; other incidents stress customer delegation under SLA.”
|
| 51 |
5. **Theme 4:** “**`/curriculum`** shows live adaptive difficulty from rolling rewards; expert criteria rotate; GRPO improves the policy.”
|
| 52 |
6. **Theme 5:** “Wild-card angle: one deployable environment that fuses the hardest ops themes for LLM incident command training.”
|
| 53 |
|
docs/project/CURRICULUM_AND_ABLATION.md
CHANGED
|
@@ -17,7 +17,7 @@ NEXUS follows a staged difficulty pattern:
|
|
| 17 |
|
| 18 |
The environment and reward system are explicitly structured to support this progression (`server/environment.py`, `server/difficulty.py`, `server/incidents.py`).
|
| 19 |
|
| 20 |
-
## Why this curriculum is valid for
|
| 21 |
|
| 22 |
- Keeps success probability > 0 early (prevents RL stall).
|
| 23 |
- Increases branching complexity only after stable basic policy behavior.
|
|
|
|
| 17 |
|
| 18 |
The environment and reward system are explicitly structured to support this progression (`server/environment.py`, `server/difficulty.py`, `server/incidents.py`).
|
| 19 |
|
| 20 |
+
## Why this curriculum is valid for hackathon compliance + Self-Serve guidance
|
| 21 |
|
| 22 |
- Keeps success probability > 0 early (prevents RL stall).
|
| 23 |
- Increases branching complexity only after stable basic policy behavior.
|
docs/project/FINAL_OPERATIONS_RUNBOOK.md
CHANGED
|
@@ -65,7 +65,7 @@ Purpose: deterministic final-day operations with explicit fallback paths and no
|
|
| 65 |
- `/health`
|
| 66 |
- `/metadata`
|
| 67 |
- `/schema`
|
| 68 |
-
- Pivot to
|
| 69 |
- Do not attempt risky hotfixes during final window.
|
| 70 |
|
| 71 |
## Roles and ownership
|
|
|
|
| 65 |
- `/health`
|
| 66 |
- `/metadata`
|
| 67 |
- `/schema`
|
| 68 |
+
- Pivot to compliance proof + archived transcript evidence.
|
| 69 |
- Do not attempt risky hotfixes during final window.
|
| 70 |
|
| 71 |
## Roles and ownership
|
docs/project/FINAL_READINESS_REPORT.md
CHANGED
|
@@ -45,12 +45,12 @@ Completed deliverables:
|
|
| 45 |
- Network traces show successful API polling and interactions (`200` status on judge-critical endpoints).
|
| 46 |
- No blocking console errors observed.
|
| 47 |
|
| 48 |
-
##
|
| 49 |
|
| 50 |
- OpenEnv workflow compliance: **pass**
|
| 51 |
- Colab training script path (TRL/Unsloth): **present and documented**
|
| 52 |
- Public artifact path (blog/video): **script and references prepared**
|
| 53 |
-
-
|
| 54 |
|
| 55 |
## Residual risks
|
| 56 |
|
|
|
|
| 45 |
- Network traces show successful API polling and interactions (`200` status on judge-critical endpoints).
|
| 46 |
- No blocking console errors observed.
|
| 47 |
|
| 48 |
+
## Hackathon compliance status
|
| 49 |
|
| 50 |
- OpenEnv workflow compliance: **pass**
|
| 51 |
- Colab training script path (TRL/Unsloth): **present and documented**
|
| 52 |
- Public artifact path (blog/video): **script and references prepared**
|
| 53 |
+
- Rubric evidence mapping and traceability: **locked in compliance and evidence docs**
|
| 54 |
|
| 55 |
## Residual risks
|
| 56 |
|
docs/project/IMPLEMENTATION_SUMMARY.md
CHANGED
|
@@ -76,7 +76,7 @@ NEXUS Enhanced is a multi-agent incident response RL environment for the Meta Py
|
|
| 76 |
**7 Cells**:
|
| 77 |
1. Install: unsloth, trl, transformers, matplotlib
|
| 78 |
2. Connectivity check: Verify HF Space reachable
|
| 79 |
-
3. `NexusRemoteEnv`: Reset/step interface to PUBLIC `https://kunalkachru23-nexus-enhanced.hf.space`
|
| 80 |
4. `reward_fn`: Parse IC action → call remote env → collect reward
|
| 81 |
5. Load Qwen2.5-1.5B: Unsloth QLoRA (rank=16, 4-bit, targets q_proj/k_proj/v_proj/o_proj)
|
| 82 |
6. GRPOTrainer: learning_rate=5e-5, batch_size=2, num_generations=4
|
|
@@ -104,10 +104,10 @@ NEXUS Enhanced is a multi-agent incident response RL environment for the Meta Py
|
|
| 104 |
### Phase 6: HF Spaces Deployment (Ready) 🚀
|
| 105 |
|
| 106 |
**Steps**:
|
| 107 |
-
1. Push code to https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
|
| 108 |
2. HF Spaces auto-builds Docker image
|
| 109 |
3. Services available at:
|
| 110 |
-
- Judge dashboard: `https://kunalkachru23-nexus-enhanced.hf.space/` (port 7860)
|
| 111 |
- Metrics: `/metrics`, `/learning-curve`, `/health`
|
| 112 |
- API: `/reset`, `/step/{session_id}`
|
| 113 |
|
|
@@ -128,7 +128,7 @@ NEXUS Enhanced is a multi-agent incident response RL environment for the Meta Py
|
|
| 128 |
6. ✅ HTML dashboard (`GET /`)
|
| 129 |
7. ✅ Full episode execution (20 steps)
|
| 130 |
|
| 131 |
-
**Run**: `python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space`
|
| 132 |
|
| 133 |
---
|
| 134 |
|
|
@@ -239,7 +239,7 @@ git commit -m "Phase 5-7: Docker multi-service setup + deployment tests"
|
|
| 239 |
# Push to HF Spaces repo
|
| 240 |
git push origin main
|
| 241 |
|
| 242 |
-
# Monitor build: https://huggingface.co/spaces/kunalkachru23/nexus-enhanced
|
| 243 |
# Takes ~5-10 minutes for Docker build
|
| 244 |
```
|
| 245 |
|
|
@@ -249,7 +249,7 @@ Once HF Spaces shows "Running":
|
|
| 249 |
|
| 250 |
```bash
|
| 251 |
# Test all endpoints
|
| 252 |
-
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
|
| 253 |
|
| 254 |
# Expected: ✅ ALL TESTS PASS
|
| 255 |
```
|
|
@@ -260,9 +260,9 @@ Once Phase 7 tests pass:
|
|
| 260 |
|
| 261 |
```
|
| 262 |
1. Open notebooks/grpo_colab_v2.ipynb
|
| 263 |
-
2. Verify BASE_URL = "https://kunalkachru23-nexus-enhanced.hf.space"
|
| 264 |
3. Run all cells (Unsloth + TRL GRPO training)
|
| 265 |
-
4. Monitor reward curves at: https://kunalkachru23-nexus-enhanced.hf.space/learning-curve
|
| 266 |
5. Expected trajectory: baseline 0.28 → improve to 0.6-0.8 over 50-100 episodes
|
| 267 |
```
|
| 268 |
|
|
@@ -277,7 +277,7 @@ Once Phase 7 tests pass:
|
|
| 277 |
| **Reward Progress** | 20% | Observable Chart.js curves + MTTR improvements | ✅ Dashboard ready |
|
| 278 |
| **Pipeline** | 10% | GRPO on Colab GPU → HF Space API | ✅ Tests ready |
|
| 279 |
|
| 280 |
-
**
|
| 281 |
- ✅ OpenEnv v0.2.3 compatible
|
| 282 |
- ✅ HuggingFace TRL GRPO training
|
| 283 |
- ✅ Trained checkpoint (TODO: save during training)
|
|
@@ -300,19 +300,19 @@ bash test_local_deployment.sh
|
|
| 300 |
### HF Space Testing
|
| 301 |
```bash
|
| 302 |
# Against deployed environment
|
| 303 |
-
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced.hf.space
|
| 304 |
```
|
| 305 |
|
| 306 |
### Manual Verification
|
| 307 |
```bash
|
| 308 |
# FastAPI health
|
| 309 |
-
curl https://kunalkachru23-nexus-enhanced.hf.space/health
|
| 310 |
|
| 311 |
# Judge dashboard
|
| 312 |
-
open https://kunalkachru23-nexus-enhanced.hf.space/
|
| 313 |
|
| 314 |
# Metrics snapshot
|
| 315 |
-
curl https://kunalkachru23-nexus-enhanced.hf.space/metrics
|
| 316 |
```
|
| 317 |
|
| 318 |
---
|
|
@@ -350,7 +350,7 @@ curl https://kunalkachru23-nexus-enhanced.hf.space/metrics
|
|
| 350 |
|
| 351 |
## Questions for User
|
| 352 |
|
| 353 |
-
1. **HF Space URL**:
|
| 354 |
2. **Training time**: Target training duration on Colab GPU (default ~6 hours for 50 episodes)?
|
| 355 |
3. **Checkpoint save**: Should checkpoint be saved to HF model hub or kept local?
|
| 356 |
4. **Blog post**: Topic preference (technical deep-dive vs. storytelling narrative)?
|
|
|
|
| 76 |
**7 Cells**:
|
| 77 |
1. Install: unsloth, trl, transformers, matplotlib
|
| 78 |
2. Connectivity check: Verify HF Space reachable
|
| 79 |
+
3. `NexusRemoteEnv`: Reset/step interface to PUBLIC `https://kunalkachru23-nexus-enhanced-stage.hf.space`
|
| 80 |
4. `reward_fn`: Parse IC action → call remote env → collect reward
|
| 81 |
5. Load Qwen2.5-1.5B: Unsloth QLoRA (rank=16, 4-bit, targets q_proj/k_proj/v_proj/o_proj)
|
| 82 |
6. GRPOTrainer: learning_rate=5e-5, batch_size=2, num_generations=4
|
|
|
|
| 104 |
### Phase 6: HF Spaces Deployment (Ready) 🚀
|
| 105 |
|
| 106 |
**Steps**:
|
| 107 |
+
1. Push code to https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
|
| 108 |
2. HF Spaces auto-builds Docker image
|
| 109 |
3. Services available at:
|
| 110 |
+
- Judge dashboard: `https://kunalkachru23-nexus-enhanced-stage.hf.space/` (port 7860)
|
| 111 |
- Metrics: `/metrics`, `/learning-curve`, `/health`
|
| 112 |
- API: `/reset`, `/step/{session_id}`
|
| 113 |
|
|
|
|
| 128 |
6. ✅ HTML dashboard (`GET /`)
|
| 129 |
7. ✅ Full episode execution (20 steps)
|
| 130 |
|
| 131 |
+
**Run**: `python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space`
|
| 132 |
|
| 133 |
---
|
| 134 |
|
|
|
|
| 239 |
# Push to HF Spaces repo
|
| 240 |
git push origin main
|
| 241 |
|
| 242 |
+
# Monitor build: https://huggingface.co/spaces/kunalkachru23/nexus-enhanced-stage
|
| 243 |
# Takes ~5-10 minutes for Docker build
|
| 244 |
```
|
| 245 |
|
|
|
|
| 249 |
|
| 250 |
```bash
|
| 251 |
# Test all endpoints
|
| 252 |
+
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
| 253 |
|
| 254 |
# Expected: ✅ ALL TESTS PASS
|
| 255 |
```
|
|
|
|
| 260 |
|
| 261 |
```
|
| 262 |
1. Open notebooks/grpo_colab_v2.ipynb
|
| 263 |
+
2. Verify BASE_URL = "https://kunalkachru23-nexus-enhanced-stage.hf.space"
|
| 264 |
3. Run all cells (Unsloth + TRL GRPO training)
|
| 265 |
+
4. Monitor reward curves at: https://kunalkachru23-nexus-enhanced-stage.hf.space/learning-curve
|
| 266 |
5. Expected trajectory: baseline 0.28 → improve to 0.6-0.8 over 50-100 episodes
|
| 267 |
```
|
| 268 |
|
|
|
|
| 277 |
| **Reward Progress** | 20% | Observable Chart.js curves + MTTR improvements | ✅ Dashboard ready |
|
| 278 |
| **Pipeline** | 10% | GRPO on Colab GPU → HF Space API | ✅ Tests ready |
|
| 279 |
|
| 280 |
+
**Mandatory submission requirements**:
|
| 281 |
- ✅ OpenEnv v0.2.3 compatible
|
| 282 |
- ✅ HuggingFace TRL GRPO training
|
| 283 |
- ✅ Trained checkpoint (TODO: save during training)
|
|
|
|
| 300 |
### HF Space Testing
|
| 301 |
```bash
|
| 302 |
# Against deployed environment
|
| 303 |
+
python test_hf_space_deployment.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space
|
| 304 |
```
|
| 305 |
|
| 306 |
### Manual Verification
|
| 307 |
```bash
|
| 308 |
# FastAPI health
|
| 309 |
+
curl https://kunalkachru23-nexus-enhanced-stage.hf.space/health
|
| 310 |
|
| 311 |
# Judge dashboard
|
| 312 |
+
open https://kunalkachru23-nexus-enhanced-stage.hf.space/
|
| 313 |
|
| 314 |
# Metrics snapshot
|
| 315 |
+
curl https://kunalkachru23-nexus-enhanced-stage.hf.space/metrics
|
| 316 |
```
|
| 317 |
|
| 318 |
---
|
|
|
|
| 350 |
|
| 351 |
## Questions for User
|
| 352 |
|
| 353 |
+
1. **HF Space URL**: Canonical judge/demo Space is `kunalkachru23/nexus-enhanced-stage` (`kunalkachru23-nexus-enhanced-stage.hf.space`).
|
| 354 |
2. **Training time**: Target training duration on Colab GPU (default ~6 hours for 50 episodes)?
|
| 355 |
3. **Checkpoint save**: Should checkpoint be saved to HF model hub or kept local?
|
| 356 |
4. **Blog post**: Topic preference (technical deep-dive vs. storytelling narrative)?
|
docs/project/JUDGING_EVIDENCE_INDEX.md
CHANGED
|
@@ -3,7 +3,7 @@
|
|
| 3 |
Snapshot timestamp (UTC): `2026-04-24T16:48:26Z`
|
| 4 |
Stage URL: `https://kunalkachru23-nexus-enhanced-stage.hf.space`
|
| 5 |
|
| 6 |
-
##
|
| 7 |
|
| 8 |
1. OpenEnv latest-release workflow in use
|
| 9 |
- Local package validate: `openenv validate .`
|
|
@@ -20,9 +20,9 @@ Stage URL: `https://kunalkachru23-nexus-enhanced-stage.hf.space`
|
|
| 20 |
- Owner action: publish final link and add URL to submission package.
|
| 21 |
|
| 22 |
4. Compliance lock matrix
|
| 23 |
-
-
|
| 24 |
|
| 25 |
-
## Live metrics snapshot (
|
| 26 |
|
| 27 |
Source endpoints:
|
| 28 |
- `GET /metrics`
|
|
@@ -78,7 +78,7 @@ Canonical demo-day snapshot set (stage URL only):
|
|
| 78 |
- 2-minute live walkthrough: `docs/pitch/DEMO_WALKTHROUGH.md`
|
| 79 |
- <2 minute recording script: `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md`
|
| 80 |
- Manual demo test cases: `docs/pitch/DEMO_MANUAL_TEST_CASES.md`
|
| 81 |
-
-
|
| 82 |
- Sub-theme matrix: `docs/project/SUBTHEME_EVIDENCE_MATRIX.md`
|
| 83 |
- Reward-hacking defense: `docs/project/REWARD_HACKING_DEFENSE.md`
|
| 84 |
- Training audit ledger: `docs/project/TRAINING_AUDIT_LOG.md`
|
|
|
|
| 3 |
Snapshot timestamp (UTC): `2026-04-24T16:48:26Z`
|
| 4 |
Stage URL: `https://kunalkachru23-nexus-enhanced-stage.hf.space`
|
| 5 |
|
| 6 |
+
## Mandatory compliance evidence (OpenEnv + submission artifacts)
|
| 7 |
|
| 8 |
1. OpenEnv latest-release workflow in use
|
| 9 |
- Local package validate: `openenv validate .`
|
|
|
|
| 20 |
- Owner action: publish final link and add URL to submission package.
|
| 21 |
|
| 22 |
4. Compliance lock matrix
|
| 23 |
+
- Rubric and mandatory-requirement mapping: `docs/project/COMPLIANCE_LOCK_MATRIX.md`
|
| 24 |
|
| 25 |
+
## Live metrics snapshot (observable improvement evidence)
|
| 26 |
|
| 27 |
Source endpoints:
|
| 28 |
- `GET /metrics`
|
|
|
|
| 78 |
- 2-minute live walkthrough: `docs/pitch/DEMO_WALKTHROUGH.md`
|
| 79 |
- <2 minute recording script: `docs/pitch/YOUTUBE_RECORDING_SCRIPT.md`
|
| 80 |
- Manual demo test cases: `docs/pitch/DEMO_MANUAL_TEST_CASES.md`
|
| 81 |
+
- Behavior delta proof sheet: `docs/project/BEHAVIORAL_DELTA_PROOF.md`
|
| 82 |
- Sub-theme matrix: `docs/project/SUBTHEME_EVIDENCE_MATRIX.md`
|
| 83 |
- Reward-hacking defense: `docs/project/REWARD_HACKING_DEFENSE.md`
|
| 84 |
- Training audit ledger: `docs/project/TRAINING_AUDIT_LOG.md`
|
docs/project/PLAN_OF_ACTION.md
CHANGED
|
@@ -1,23 +1,23 @@
|
|
| 1 |
-
# Plan of action +
|
| 2 |
|
| 3 |
-
**
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
##
|
| 8 |
|
| 9 |
| Ref | Requirement | Repo / runtime evidence | Status |
|
| 10 |
|-----|----------------|-------------------------|--------|
|
| 11 |
-
| **
|
| 12 |
-
| **
|
| 13 |
-
| **
|
| 14 |
-
| **
|
| 15 |
-
| **
|
| 16 |
-
| **
|
| 17 |
-
| **
|
| 18 |
-
| **
|
| 19 |
|
| 20 |
-
**
|
| 21 |
|
| 22 |
---
|
| 23 |
|
|
@@ -30,7 +30,7 @@
|
|
| 30 |
| 3 | Execute **`grpo_colab_v2.ipynb`** on Colab T4+ end-to-end | Team | Notebook completes; curve updates on stage |
|
| 31 |
| 4 | **`python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space`** → drop PNG into deck | Team | `docs/images/training_reward_curve.png` in slide asset folder |
|
| 32 |
| 5 | Rehearse **`docs/pitch/PITCH.md`** with live demo (timer **3:00**) | Team | No overrun; Q&A 2:00 bank ready |
|
| 33 |
-
| 6 | **Publish** HF blog *or* record **≤2 min** YouTube; add URL to README | Team |
|
| 34 |
| 7 | Submission package: Space URL, Colab link, blog/video link, `openenv --version` screenshot | Team | Checklist complete |
|
| 35 |
| 8 | (Optional) INC007 **60 s** clip for innovation Q&A | Team | Recorded path in repo or drive link |
|
| 36 |
|
|
|
|
| 1 |
+
# Plan of action + hackathon compliance matrix
|
| 2 |
|
| 3 |
+
**Scope:** This matrix tracks evidence against **hackathon compliance criteria** (OpenEnv toolchain, Colab training with TRL/Unsloth, published blog or short video, pitch format, and judging rubric dimensions). Align deliverables with the official organizer requirements for your submission wave.
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## Compliance — strict review (evidence-based)
|
| 8 |
|
| 9 |
| Ref | Requirement | Repo / runtime evidence | Status |
|
| 10 |
|-----|----------------|-------------------------|--------|
|
| 11 |
+
| **C1** | OpenEnv **(latest release)** — not fork, not old | `openenv validate .` OK; `openenv validate --url` OK after contract routes; `openenv push` workflow; Colab `pip install openenv>=0.2.3`; README **OpenEnv (reproduce)** section | **Pass** — document `openenv --version` on submission day |
|
| 12 |
+
| **C2** | Minimal training in **Colab** with **Unsloth** or **HF TRL** | `notebooks/grpo_colab_v2.ipynb` installs TRL + Unsloth + trains GRPO | **Pass pending** — you must execute notebook once on GPU before final submit |
|
| 13 |
+
| **C3** | **HF blog** *or* **YouTube video < 2 min** | `docs/blog/blog_post_hf.md` draft exists; **publish** + URL in README/submission | **Gap** — publishing is owner action |
|
| 14 |
+
| **C4** | **3 min** pitch + **2 min** Q&A | `docs/pitch/PITCH.md` script + Q&A table timed to format | **Pass** (content) — rehearsal is owner action |
|
| 15 |
+
| **R1** | Environment innovation **40%** | Multi-agent, partial observability, 7 incidents, INC007 schema drift, coalition | **Strong** — rehearse one INC007 sentence |
|
| 16 |
+
| **R2** | Storytelling **30%** | Dashboard + demo flow in `docs/pitch/DEMO_MANUAL_TEST_CASES.md` + `docs/pitch/PITCH.md` | **Pass** — practice run |
|
| 17 |
+
| **R3** | Observable reward improvement **20%** | `/learning-curve`, dashboard, `scripts/export_reward_plot.py` | **Pass** — keep Space populated for live curve |
|
| 18 |
+
| **R4** | Reward + pipeline coherence **10%** | Sparse reward, dimensions in README; trained **behaviour** narrative | **Medium** — tie checkpoint to different IC actions, not only reward |
|
| 19 |
|
| 20 |
+
**Additional quality gates:** `pytest tests/` green; `test_hf_space_deployment.py` 8/8 on stage URL; `./gate.sh` or `scripts/shell/gate.sh` optional full run before deploy.
|
| 21 |
|
| 22 |
---
|
| 23 |
|
|
|
|
| 30 |
| 3 | Execute **`grpo_colab_v2.ipynb`** on Colab T4+ end-to-end | Team | Notebook completes; curve updates on stage |
|
| 31 |
| 4 | **`python scripts/export_reward_plot.py --url https://kunalkachru23-nexus-enhanced-stage.hf.space`** → drop PNG into deck | Team | `docs/images/training_reward_curve.png` in slide asset folder |
|
| 32 |
| 5 | Rehearse **`docs/pitch/PITCH.md`** with live demo (timer **3:00**) | Team | No overrun; Q&A 2:00 bank ready |
|
| 33 |
+
| 6 | **Publish** HF blog *or* record **≤2 min** YouTube; add URL to README | Team | Hackathon submission artifact (blog or video) satisfied |
|
| 34 |
| 7 | Submission package: Space URL, Colab link, blog/video link, `openenv --version` screenshot | Team | Checklist complete |
|
| 35 |
| 8 | (Optional) INC007 **60 s** clip for innovation Q&A | Team | Recorded path in repo or drive link |
|
| 36 |
|
docs/project/PROJECT_STATUS.md
CHANGED
|
@@ -1,7 +1,8 @@
|
|
| 1 |
# NEXUS Enhanced — project status & backlog
|
| 2 |
|
| 3 |
-
**
|
| 4 |
-
|
|
|
|
| 5 |
|
| 6 |
**See also:** [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md).
|
| 7 |
|
|
@@ -22,24 +23,24 @@
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
| # | Requirement | Status |
|
| 28 |
|---|-------------|--------|
|
| 29 |
-
| 1 | OpenEnv **latest release** | **Evidence in repo:** README
|
| 30 |
| 2 | Minimal **Colab** training script (**Unsloth** or **HF TRL**) | **Notebook aligned:** `grpo_colab_v2.ipynb` now defaults `BASE_URL` to **stage** (`kunalkachru23-nexus-enhanced-stage.hf.space`). You still need one successful T4+ run before submission. |
|
| 31 |
| 3 | **Blog (HF)** or **Video (YouTube, <2 min)** | **You own:** publish + link in submission. |
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
-
## Judging rubric
|
| 36 |
|
| 37 |
| Criterion | Weight | Focus next |
|
| 38 |
|-----------|--------|------------|
|
| 39 |
| Environment innovation | 40% | One sharp “why NEXUS is hard” story (partial observability, schema drift, coalitions) backed by INC007 / live UI. |
|
| 40 |
| Storytelling | 30% | 3-minute pitch script rehearsed; demo path: metrics → auto-demo → guided manual complete. |
|
| 41 |
| Observable reward improvement | 20% | Keep dashboard + `/learning-curve` honest; optional: export static plot artifact for slides. |
|
| 42 |
-
| Reward / pipeline coherence | 10% | Tie reward dimensions to
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
@@ -47,7 +48,7 @@
|
|
| 47 |
|
| 48 |
1. ~~**Submission hygiene:** OpenEnv reproduce block in README + `outputs/` for clean `openenv validate .`~~ (done this iteration).
|
| 49 |
2. **Colab:** Run `grpo_colab_v2.ipynb` once on T4+; capture reward curve screenshot for slides.
|
| 50 |
-
3. **
|
| 51 |
4. **Pitch (30% storytelling):** 3-minute path: Training tab metrics → **Run Auto-Demo** → Manual **Guided: fill + execute** to complete (see [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md)).
|
| 52 |
5. **Optional hardening:** Richer `/schema` from Pydantic models (cosmetic).
|
| 53 |
|
|
|
|
| 1 |
# NEXUS Enhanced — project status & backlog
|
| 2 |
|
| 3 |
+
**Compliance reference:** Use **hackathon compliance criteria** (OpenEnv latest in toolchain, Colab training with TRL/Unsloth, published blog or ≤2 min video, pitch format, judging rubric). This file tracks status against those expectations; compliance wording lives in-repo only (no linked external design doc).
|
| 4 |
+
|
| 5 |
+
**Last reviewed:** [`../pitch/PITCH.md`](../pitch/PITCH.md), [`PLAN_OF_ACTION.md`](PLAN_OF_ACTION.md), [`../../scripts/export_reward_plot.py`](../../scripts/export_reward_plot.py), [`COMPLIANCE_LOCK_MATRIX.md`](COMPLIANCE_LOCK_MATRIX.md).
|
| 6 |
|
| 7 |
**See also:** [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md).
|
| 8 |
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
## Mandatory submission checklist (OpenEnv + artifacts)
|
| 27 |
|
| 28 |
| # | Requirement | Status |
|
| 29 |
|---|-------------|--------|
|
| 30 |
+
| 1 | OpenEnv **latest release** | **Evidence in repo:** README **OpenEnv (reproduce)** commands; `openenv validate .` + `openenv validate --url` green after stubs; Colab still `pip install openenv>=0.2.3`. Record `openenv --version` in your pitch appendix. |
|
| 31 |
| 2 | Minimal **Colab** training script (**Unsloth** or **HF TRL**) | **Notebook aligned:** `grpo_colab_v2.ipynb` now defaults `BASE_URL` to **stage** (`kunalkachru23-nexus-enhanced-stage.hf.space`). You still need one successful T4+ run before submission. |
|
| 32 |
| 3 | **Blog (HF)** or **Video (YouTube, <2 min)** | **You own:** publish + link in submission. |
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
+
## Judging rubric — quick gap scan
|
| 37 |
|
| 38 |
| Criterion | Weight | Focus next |
|
| 39 |
|-----------|--------|------------|
|
| 40 |
| Environment innovation | 40% | One sharp “why NEXUS is hard” story (partial observability, schema drift, coalitions) backed by INC007 / live UI. |
|
| 41 |
| Storytelling | 30% | 3-minute pitch script rehearsed; demo path: metrics → auto-demo → guided manual complete. |
|
| 42 |
| Observable reward improvement | 20% | Keep dashboard + `/learning-curve` honest; optional: export static plot artifact for slides. |
|
| 43 |
+
| Reward / pipeline coherence | 10% | Tie reward dimensions to the published reward model; show before/after behaviour if you have checkpoints. |
|
| 44 |
|
| 45 |
---
|
| 46 |
|
|
|
|
| 48 |
|
| 49 |
1. ~~**Submission hygiene:** OpenEnv reproduce block in README + `outputs/` for clean `openenv validate .`~~ (done this iteration).
|
| 50 |
2. **Colab:** Run `grpo_colab_v2.ipynb` once on T4+; capture reward curve screenshot for slides.
|
| 51 |
+
3. **Submission artifacts:** Publish HF **blog** or **YouTube <2 min** and add the link next to README “Blog Post” section.
|
| 52 |
4. **Pitch (30% storytelling):** 3-minute path: Training tab metrics → **Run Auto-Demo** → Manual **Guided: fill + execute** to complete (see [`../pitch/DEMO_MANUAL_TEST_CASES.md`](../pitch/DEMO_MANUAL_TEST_CASES.md)).
|
| 53 |
5. **Optional hardening:** Richer `/schema` from Pydantic models (cosmetic).
|
| 54 |
|
docs/project/SUBTHEME_EVIDENCE_MATRIX.md
CHANGED
|
@@ -1,14 +1,14 @@
|
|
| 1 |
# Sub-Theme Evidence Matrix (Judge-Ready)
|
| 2 |
|
| 3 |
-
This matrix maps implemented mechanics to
|
| 4 |
|
| 5 |
-
**Parent
|
| 6 |
|
| 7 |
## Targeted sub-themes
|
| 8 |
|
| 9 |
-
| Sponsor / Sub-theme |
|
| 10 |
|---|---|---|---|
|
| 11 |
-
| **Theme 3.2 — Personalized
|
| 12 |
| Fleet AI — Scalable Oversight | "monitor, analyze, and explain" | Oversight-oriented behavior + oversight reward component in final score model | `server/reward.py`, `server/agents.py`, live run transcript from `/demo/run/INC003` |
|
| 13 |
| Halluminate — Multi-Actor Environments | "interacts with and manages multiple actors ... to discover and achieve task" | IC orchestrates L1/L2/SRE/PM actions with partial observability; coalition mechanics present | `server/environment.py`, `server/agents.py`, `server/incidents.py` (INC003+), dashboard manual flow |
|
| 14 |
| Snorkel AI — Simulated Experts | "changing requirements/preferences" | Rotating expert criteria and adaptive scoring emphasis over episodes | `server/reward.py`, project docs (`README.md`, `docs/project/PLAN_OF_ACTION.md`) |
|
|
@@ -17,9 +17,9 @@ This matrix maps implemented mechanics to BRD wording and where judges can verif
|
|
| 17 |
| Scale AI — Non-code business (HR & IT) | Long-horizon **non-code** workflows in Sales / PM / **HR & IT** only | **IT / on-call incident command** (status pages, escalations, runbooks, customer comms)—no code-writing task as the core object | Multi-step dashboard validation, SLA/revenue semantics in `server/incidents.py`, L1 customer paths |
|
| 18 |
| Scaler AI Labs — Multi-App Enterprise RL | "business rule nuances" in enterprise multi-app world | Datadog/Jira/Runbook/Customer interactions with operational constraints and role-specific visibility | `server/tools.py`, `server/incidents.py`, dashboard and auto-demo flow |
|
| 19 |
|
| 20 |
-
## Cross-
|
| 21 |
|
| 22 |
-
-
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
-
|
|
|
|
| 1 |
# Sub-Theme Evidence Matrix (Judge-Ready)
|
| 2 |
|
| 3 |
+
This matrix maps implemented mechanics to **organizer theme wording** and where judges can verify each claim.
|
| 4 |
|
| 5 |
+
**Parent hackathon themes** and the **four judging rubric rows** are mapped end-to-end in `docs/project/COMPLIANCE_LOCK_MATRIX.md` (theme demonstration + demo beats). This file focuses on **sponsor sub-themes** and cross-links to the same implementation paths.
|
| 6 |
|
| 7 |
## Targeted sub-themes
|
| 8 |
|
| 9 |
+
| Sponsor / Sub-theme | Theme wording to satisfy | Implemented evidence | Where to verify |
|
| 10 |
|---|---|---|---|
|
| 11 |
+
| **Theme 3.2 — Personalized** | Personal tasks, delegation, conflicting priorities | **INC008** — executive EA calendar conflict (family vs board), smart-scheduler auto-accept root cause; `IncidentType.PERSONAL_ASSISTANT` | Dashboard manual validation select **INC008**; `server/incidents.py` |
|
| 12 |
| Fleet AI — Scalable Oversight | "monitor, analyze, and explain" | Oversight-oriented behavior + oversight reward component in final score model | `server/reward.py`, `server/agents.py`, live run transcript from `/demo/run/INC003` |
|
| 13 |
| Halluminate — Multi-Actor Environments | "interacts with and manages multiple actors ... to discover and achieve task" | IC orchestrates L1/L2/SRE/PM actions with partial observability; coalition mechanics present | `server/environment.py`, `server/agents.py`, `server/incidents.py` (INC003+), dashboard manual flow |
|
| 14 |
| Snorkel AI — Simulated Experts | "changing requirements/preferences" | Rotating expert criteria and adaptive scoring emphasis over episodes | `server/reward.py`, project docs (`README.md`, `docs/project/PLAN_OF_ACTION.md`) |
|
|
|
|
| 17 |
| Scale AI — Non-code business (HR & IT) | Long-horizon **non-code** workflows in Sales / PM / **HR & IT** only | **IT / on-call incident command** (status pages, escalations, runbooks, customer comms)—no code-writing task as the core object | Multi-step dashboard validation, SLA/revenue semantics in `server/incidents.py`, L1 customer paths |
|
| 18 |
| Scaler AI Labs — Multi-App Enterprise RL | "business rule nuances" in enterprise multi-app world | Datadog/Jira/Runbook/Customer interactions with operational constraints and role-specific visibility | `server/tools.py`, `server/incidents.py`, dashboard and auto-demo flow |
|
| 19 |
|
| 20 |
+
## Cross-rubric reinforcement
|
| 21 |
|
| 22 |
+
- Innovation (40%): multi-agent + partial observability + schema drift + business-rule constraints.
|
| 23 |
+
- Storytelling (30%): deterministic live flow in `docs/pitch/PITCH.md` and `docs/pitch/DEMO_WALKTHROUGH.md`.
|
| 24 |
+
- Observable improvement (20%): `/learning-curve`, `/metrics`, `docs/images/training_reward_curve.png`.
|
| 25 |
+
- Pipeline coherence (10%): Colab GRPO script + behavior delta sheet (`docs/project/BEHAVIORAL_DELTA_PROOF.md`).
|
docs/project/TEST_RESULTS_SUMMARY.md
CHANGED
|
@@ -474,7 +474,7 @@ Episodes will complete naturally without needing the "End Episode" button workar
|
|
| 474 |
| **API Tests** | ✅ PASS | Episode completion with reward |
|
| 475 |
| **UI Playwright** | ✅ PASS | 8-iteration state management |
|
| 476 |
| **Manual Testing** | ⚠️ PARTIAL | Phase progression works, reward needs workaround |
|
| 477 |
-
| **Environment Audit** | ✅ PASS | All
|
| 478 |
|
| 479 |
---
|
| 480 |
|
|
|
|
| 474 |
| **API Tests** | ✅ PASS | Episode completion with reward |
|
| 475 |
| **UI Playwright** | ✅ PASS | 8-iteration state management |
|
| 476 |
| **Manual Testing** | ⚠️ PARTIAL | Phase progression works, reward needs workaround |
|
| 477 |
+
| **Environment Audit** | ✅ PASS | All hackathon compliance requirements met |
|
| 478 |
|
| 479 |
---
|
| 480 |
|
episode_rewards.json
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
[{"session_id": "legacy-1", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3047, "timestamp": 0.0}, {"session_id": "legacy-2", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2568, "timestamp": 0.0}, {"session_id": "legacy-3", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3226, "timestamp": 0.0}, {"session_id": "legacy-4", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3956, "timestamp": 0.0}, {"session_id": "legacy-5", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2579, "timestamp": 0.0}, {"session_id": "legacy-6", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2608, "timestamp": 0.0}, {"session_id": "legacy-7", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4088, "timestamp": 0.0}, {"session_id": "legacy-8", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3468, "timestamp": 0.0}, {"session_id": "legacy-9", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2507, "timestamp": 0.0}, {"session_id": "legacy-10", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3346, "timestamp": 0.0}, {"session_id": "legacy-11", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.257, "timestamp": 0.0}, {"session_id": "legacy-12", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2597, "timestamp": 0.0}, {"session_id": "legacy-13", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3193, "timestamp": 0.0}, {"session_id": "legacy-14", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1497, "timestamp": 0.0}, {"session_id": "legacy-15", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1677, "timestamp": 0.0}, {"session_id": "legacy-16", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2636, "timestamp": 0.0}, {"session_id": "legacy-17", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2305, "timestamp": 0.0}, {"session_id": "legacy-18", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3396, "timestamp": 0.0}, {"session_id": "legacy-19", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2447, "timestamp": 0.0}, {"session_id": "legacy-20", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2073, "timestamp": 0.0}, {"session_id": "legacy-21", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-22", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.308, "timestamp": 0.0}, {"session_id": "legacy-23", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3344, "timestamp": 0.0}, {"session_id": "legacy-24", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2179, "timestamp": 0.0}, {"session_id": "legacy-25", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2912, "timestamp": 0.0}, {"session_id": "legacy-26", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3466, "timestamp": 0.0}, {"session_id": "legacy-27", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2485, "timestamp": 0.0}, {"session_id": "legacy-28", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3736, "timestamp": 0.0}, {"session_id": "legacy-29", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2984, "timestamp": 0.0}, {"session_id": "legacy-30", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.326, "timestamp": 0.0}, {"session_id": "legacy-31", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3041, "timestamp": 0.0}, {"session_id": "legacy-32", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5033, "timestamp": 0.0}, {"session_id": "legacy-33", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.357, "timestamp": 0.0}, {"session_id": "legacy-34", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2764, "timestamp": 0.0}, {"session_id": "legacy-35", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4297, "timestamp": 0.0}, {"session_id": "legacy-36", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2691, "timestamp": 0.0}, {"session_id": "legacy-37", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3864, "timestamp": 0.0}, {"session_id": "legacy-38", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2158, "timestamp": 0.0}, {"session_id": "legacy-39", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2693, "timestamp": 0.0}, {"session_id": "legacy-40", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3942, "timestamp": 0.0}, {"session_id": "legacy-41", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-42", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3979, "timestamp": 0.0}, {"session_id": "legacy-43", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3779, "timestamp": 0.0}, {"session_id": "legacy-44", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.366, "timestamp": 0.0}, {"session_id": "legacy-45", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2747, "timestamp": 0.0}, {"session_id": "legacy-46", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3383, "timestamp": 0.0}, {"session_id": "legacy-47", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3619, "timestamp": 0.0}, {"session_id": "legacy-48", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4863, "timestamp": 0.0}, {"session_id": "legacy-49", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4321, "timestamp": 0.0}, {"session_id": "legacy-50", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2665, "timestamp": 0.0}, {"session_id": "legacy-51", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4363, "timestamp": 0.0}, {"session_id": "legacy-52", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3825, "timestamp": 0.0}, {"session_id": "legacy-53", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3621, "timestamp": 0.0}, {"session_id": "legacy-54", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4681, "timestamp": 0.0}, {"session_id": "legacy-55", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5045, "timestamp": 0.0}, {"session_id": "legacy-56", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4995, "timestamp": 0.0}, {"session_id": "legacy-57", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3607, "timestamp": 0.0}, {"session_id": "legacy-58", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.406, "timestamp": 0.0}, {"session_id": "legacy-59", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4602, "timestamp": 0.0}, {"session_id": "legacy-60", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5146, "timestamp": 0.0}, {"session_id": "legacy-61", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4012, "timestamp": 0.0}, {"session_id": "legacy-62", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4275, "timestamp": 0.0}, {"session_id": "legacy-63", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3568, "timestamp": 0.0}, {"session_id": "legacy-64", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3525, "timestamp": 0.0}, {"session_id": "legacy-65", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5161, "timestamp": 0.0}, {"session_id": "legacy-66", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5625, "timestamp": 0.0}, {"session_id": "legacy-67", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4512, "timestamp": 0.0}, {"session_id": "legacy-68", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5401, "timestamp": 0.0}, {"session_id": "legacy-69", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4917, "timestamp": 0.0}, {"session_id": "legacy-70", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4141, "timestamp": 0.0}, {"session_id": "legacy-71", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4975, "timestamp": 0.0}, {"session_id": "legacy-72", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5945, "timestamp": 0.0}, {"session_id": "legacy-73", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4715, "timestamp": 0.0}, {"session_id": "legacy-74", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6025, "timestamp": 0.0}, {"session_id": "legacy-75", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2706, "timestamp": 0.0}, {"session_id": "legacy-76", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5489, "timestamp": 0.0}, {"session_id": "legacy-77", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.493, "timestamp": 0.0}, {"session_id": "legacy-78", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.465, "timestamp": 0.0}, {"session_id": "legacy-79", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4992, "timestamp": 0.0}, {"session_id": "legacy-80", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3357, "timestamp": 0.0}, {"session_id": "legacy-81", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4801, "timestamp": 0.0}, {"session_id": "legacy-82", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5291, "timestamp": 0.0}, {"session_id": "legacy-83", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6217, "timestamp": 0.0}, {"session_id": "legacy-84", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4649, "timestamp": 0.0}, {"session_id": "legacy-85", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4446, "timestamp": 0.0}, {"session_id": "legacy-86", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.9484, "timestamp": 0.0}, {"session_id": "legacy-87", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5883, "timestamp": 0.0}, {"session_id": "legacy-88", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5443, "timestamp": 0.0}, {"session_id": "legacy-89", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4785, "timestamp": 0.0}, {"session_id": "legacy-90", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5649, "timestamp": 0.0}, {"session_id": "legacy-91", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5345, "timestamp": 0.0}, {"session_id": "legacy-92", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6071, "timestamp": 0.0}, {"session_id": "legacy-93", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4764, "timestamp": 0.0}, {"session_id": "legacy-94", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5092, "timestamp": 0.0}, {"session_id": "legacy-95", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.507, "timestamp": 0.0}, {"session_id": "legacy-96", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4242, "timestamp": 0.0}, {"session_id": "legacy-97", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5679, "timestamp": 0.0}, {"session_id": "legacy-98", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.568, "timestamp": 0.0}, {"session_id": "legacy-99", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5504, "timestamp": 0.0}, {"session_id": "legacy-100", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-101", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-102", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-103", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-104", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-105", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-106", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-107", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-108", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-109", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-110", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-111", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-112", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-113", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-114", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-115", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-116", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-117", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-118", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-119", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-120", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-121", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-122", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-123", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-124", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-125", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-126", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-127", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-128", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-129", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-130", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-131", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-132", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-133", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "a003e153-9026-406c-879d-25172aa11eda", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776965109.7711759}, {"session_id": "bf34a807-a40d-4ae5-b8b0-4d8333a62c81", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776965109.825952}, {"session_id": "8876cfb6-f5e9-4ad0-941e-84bc7d8e2b96", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776967485.476565}, {"session_id": "db31a9f2-ca48-4f93-8d1e-3fa50fa5d21a", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776967485.519014}, {"session_id": "4aaf8a4d-0db7-47f5-9dca-8bc360b19088", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776978808.275504}, {"session_id": "c13f72bb-5715-4209-9872-e85acecbc8b3", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776978808.317802}, {"session_id": "4da60545-6cb0-4a65-acf2-9a8fd8cf7e59", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776982752.592321}, {"session_id": "48a56b54-d4cd-47c9-b2e1-165dd330a6c9", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776982752.632246}, {"session_id": "2eab58c0-ffe3-4a11-9512-3b606eb2a957", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776983166.506444}, {"session_id": "ec0ae7f3-6095-4dce-b323-1cd1482c1ba4", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776983166.548518}, {"session_id": "b720c132-ddce-41fa-99d1-be7ebaa32de2", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984636.36427}, {"session_id": "edb8bdb1-5903-4710-b9f4-cd68c67f6474", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984636.4119911}, {"session_id": "2654393f-5c7e-4d17-938d-2570195a3c5f", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984647.97215}, {"session_id": "c0e98721-13b3-49e3-a3fd-735801f5f9f7", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984648.018783}, {"session_id": "6fe810db-c9ae-48e7-a205-6d2df902b555", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049568.085221}, {"session_id": "58c9af47-0d8b-4359-9889-a6a04552db83", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049568.1298869}, {"session_id": "552395e3-cae4-4423-bc3c-9314b8cc276d", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049580.030255}, {"session_id": "49a9bb9e-c9b1-465f-a617-6443da42d1be", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049580.0770292}, {"session_id": "fb4a1426-bf33-47f0-8357-29b19a75a19c", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777051304.085602}, {"session_id": "2714abc5-60be-4bf4-981c-da1c2ba328b5", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777051304.1293068}, {"session_id": "69407c50-1635-4bd8-b1b9-e1017cbb5297", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777070951.992904}, {"session_id": "9fe3899c-455e-483b-b5ce-19240df9f0e6", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777070952.039953}]
|
|
|
|
| 1 |
+
[{"session_id": "legacy-1", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3047, "timestamp": 0.0}, {"session_id": "legacy-2", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2568, "timestamp": 0.0}, {"session_id": "legacy-3", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3226, "timestamp": 0.0}, {"session_id": "legacy-4", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3956, "timestamp": 0.0}, {"session_id": "legacy-5", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2579, "timestamp": 0.0}, {"session_id": "legacy-6", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2608, "timestamp": 0.0}, {"session_id": "legacy-7", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4088, "timestamp": 0.0}, {"session_id": "legacy-8", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3468, "timestamp": 0.0}, {"session_id": "legacy-9", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2507, "timestamp": 0.0}, {"session_id": "legacy-10", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3346, "timestamp": 0.0}, {"session_id": "legacy-11", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.257, "timestamp": 0.0}, {"session_id": "legacy-12", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2597, "timestamp": 0.0}, {"session_id": "legacy-13", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3193, "timestamp": 0.0}, {"session_id": "legacy-14", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1497, "timestamp": 0.0}, {"session_id": "legacy-15", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.1677, "timestamp": 0.0}, {"session_id": "legacy-16", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2636, "timestamp": 0.0}, {"session_id": "legacy-17", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2305, "timestamp": 0.0}, {"session_id": "legacy-18", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3396, "timestamp": 0.0}, {"session_id": "legacy-19", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2447, "timestamp": 0.0}, {"session_id": "legacy-20", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2073, "timestamp": 0.0}, {"session_id": "legacy-21", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-22", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.308, "timestamp": 0.0}, {"session_id": "legacy-23", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3344, "timestamp": 0.0}, {"session_id": "legacy-24", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2179, "timestamp": 0.0}, {"session_id": "legacy-25", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2912, "timestamp": 0.0}, {"session_id": "legacy-26", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3466, "timestamp": 0.0}, {"session_id": "legacy-27", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2485, "timestamp": 0.0}, {"session_id": "legacy-28", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3736, "timestamp": 0.0}, {"session_id": "legacy-29", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2984, "timestamp": 0.0}, {"session_id": "legacy-30", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.326, "timestamp": 0.0}, {"session_id": "legacy-31", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3041, "timestamp": 0.0}, {"session_id": "legacy-32", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5033, "timestamp": 0.0}, {"session_id": "legacy-33", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.357, "timestamp": 0.0}, {"session_id": "legacy-34", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2764, "timestamp": 0.0}, {"session_id": "legacy-35", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4297, "timestamp": 0.0}, {"session_id": "legacy-36", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2691, "timestamp": 0.0}, {"session_id": "legacy-37", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3864, "timestamp": 0.0}, {"session_id": "legacy-38", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2158, "timestamp": 0.0}, {"session_id": "legacy-39", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2693, "timestamp": 0.0}, {"session_id": "legacy-40", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3942, "timestamp": 0.0}, {"session_id": "legacy-41", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4404, "timestamp": 0.0}, {"session_id": "legacy-42", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3979, "timestamp": 0.0}, {"session_id": "legacy-43", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3779, "timestamp": 0.0}, {"session_id": "legacy-44", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.366, "timestamp": 0.0}, {"session_id": "legacy-45", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2747, "timestamp": 0.0}, {"session_id": "legacy-46", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3383, "timestamp": 0.0}, {"session_id": "legacy-47", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3619, "timestamp": 0.0}, {"session_id": "legacy-48", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4863, "timestamp": 0.0}, {"session_id": "legacy-49", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4321, "timestamp": 0.0}, {"session_id": "legacy-50", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2665, "timestamp": 0.0}, {"session_id": "legacy-51", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4363, "timestamp": 0.0}, {"session_id": "legacy-52", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3825, "timestamp": 0.0}, {"session_id": "legacy-53", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3621, "timestamp": 0.0}, {"session_id": "legacy-54", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4681, "timestamp": 0.0}, {"session_id": "legacy-55", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5045, "timestamp": 0.0}, {"session_id": "legacy-56", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4995, "timestamp": 0.0}, {"session_id": "legacy-57", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3607, "timestamp": 0.0}, {"session_id": "legacy-58", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.406, "timestamp": 0.0}, {"session_id": "legacy-59", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4602, "timestamp": 0.0}, {"session_id": "legacy-60", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5146, "timestamp": 0.0}, {"session_id": "legacy-61", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4012, "timestamp": 0.0}, {"session_id": "legacy-62", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4275, "timestamp": 0.0}, {"session_id": "legacy-63", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3568, "timestamp": 0.0}, {"session_id": "legacy-64", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3525, "timestamp": 0.0}, {"session_id": "legacy-65", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5161, "timestamp": 0.0}, {"session_id": "legacy-66", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5625, "timestamp": 0.0}, {"session_id": "legacy-67", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4512, "timestamp": 0.0}, {"session_id": "legacy-68", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5401, "timestamp": 0.0}, {"session_id": "legacy-69", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4917, "timestamp": 0.0}, {"session_id": "legacy-70", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4141, "timestamp": 0.0}, {"session_id": "legacy-71", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4975, "timestamp": 0.0}, {"session_id": "legacy-72", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5945, "timestamp": 0.0}, {"session_id": "legacy-73", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4715, "timestamp": 0.0}, {"session_id": "legacy-74", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6025, "timestamp": 0.0}, {"session_id": "legacy-75", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.2706, "timestamp": 0.0}, {"session_id": "legacy-76", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5489, "timestamp": 0.0}, {"session_id": "legacy-77", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.493, "timestamp": 0.0}, {"session_id": "legacy-78", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.465, "timestamp": 0.0}, {"session_id": "legacy-79", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4992, "timestamp": 0.0}, {"session_id": "legacy-80", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3357, "timestamp": 0.0}, {"session_id": "legacy-81", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4801, "timestamp": 0.0}, {"session_id": "legacy-82", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5291, "timestamp": 0.0}, {"session_id": "legacy-83", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6217, "timestamp": 0.0}, {"session_id": "legacy-84", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4649, "timestamp": 0.0}, {"session_id": "legacy-85", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4446, "timestamp": 0.0}, {"session_id": "legacy-86", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.9484, "timestamp": 0.0}, {"session_id": "legacy-87", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5883, "timestamp": 0.0}, {"session_id": "legacy-88", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5443, "timestamp": 0.0}, {"session_id": "legacy-89", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4785, "timestamp": 0.0}, {"session_id": "legacy-90", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5649, "timestamp": 0.0}, {"session_id": "legacy-91", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5345, "timestamp": 0.0}, {"session_id": "legacy-92", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.6071, "timestamp": 0.0}, {"session_id": "legacy-93", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4764, "timestamp": 0.0}, {"session_id": "legacy-94", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5092, "timestamp": 0.0}, {"session_id": "legacy-95", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.507, "timestamp": 0.0}, {"session_id": "legacy-96", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4242, "timestamp": 0.0}, {"session_id": "legacy-97", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5679, "timestamp": 0.0}, {"session_id": "legacy-98", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.568, "timestamp": 0.0}, {"session_id": "legacy-99", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.5504, "timestamp": 0.0}, {"session_id": "legacy-100", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-101", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-102", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-103", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-104", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-105", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-106", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.8889, "timestamp": 0.0}, {"session_id": "legacy-107", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-108", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-109", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-110", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-111", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-112", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-113", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-114", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-115", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-116", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-117", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-118", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-119", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-120", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-121", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-122", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-123", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-124", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-125", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-126", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-127", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-128", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-129", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-130", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.3378, "timestamp": 0.0}, {"session_id": "legacy-131", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-132", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "legacy-133", "run_id": "default", "incident_id": "unknown", "difficulty": "unknown", "reward": 0.4252, "timestamp": 0.0}, {"session_id": "a003e153-9026-406c-879d-25172aa11eda", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776965109.7711759}, {"session_id": "bf34a807-a40d-4ae5-b8b0-4d8333a62c81", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776965109.825952}, {"session_id": "8876cfb6-f5e9-4ad0-941e-84bc7d8e2b96", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776967485.476565}, {"session_id": "db31a9f2-ca48-4f93-8d1e-3fa50fa5d21a", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776967485.519014}, {"session_id": "4aaf8a4d-0db7-47f5-9dca-8bc360b19088", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776978808.275504}, {"session_id": "c13f72bb-5715-4209-9872-e85acecbc8b3", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776978808.317802}, {"session_id": "4da60545-6cb0-4a65-acf2-9a8fd8cf7e59", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776982752.592321}, {"session_id": "48a56b54-d4cd-47c9-b2e1-165dd330a6c9", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776982752.632246}, {"session_id": "2eab58c0-ffe3-4a11-9512-3b606eb2a957", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776983166.506444}, {"session_id": "ec0ae7f3-6095-4dce-b323-1cd1482c1ba4", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776983166.548518}, {"session_id": "b720c132-ddce-41fa-99d1-be7ebaa32de2", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984636.36427}, {"session_id": "edb8bdb1-5903-4710-b9f4-cd68c67f6474", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984636.4119911}, {"session_id": "2654393f-5c7e-4d17-938d-2570195a3c5f", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1776984647.97215}, {"session_id": "c0e98721-13b3-49e3-a3fd-735801f5f9f7", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1776984648.018783}, {"session_id": "6fe810db-c9ae-48e7-a205-6d2df902b555", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049568.085221}, {"session_id": "58c9af47-0d8b-4359-9889-a6a04552db83", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049568.1298869}, {"session_id": "552395e3-cae4-4423-bc3c-9314b8cc276d", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777049580.030255}, {"session_id": "49a9bb9e-c9b1-465f-a617-6443da42d1be", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777049580.0770292}, {"session_id": "fb4a1426-bf33-47f0-8357-29b19a75a19c", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777051304.085602}, {"session_id": "2714abc5-60be-4bf4-981c-da1c2ba328b5", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777051304.1293068}, {"session_id": "29f5e924-bb71-4b4d-85ea-e6f321f8b674", "run_id": "default", "incident_id": "INC001", "difficulty": "easy", "reward": 0.3378, "timestamp": 1777184935.9644742}, {"session_id": "057b4802-136c-4fb3-b62b-0e60014a6fcf", "run_id": "pytest_full_episode_metrics", "incident_id": "INC008", "difficulty": "easy", "reward": 0.4252, "timestamp": 1777184936.008428}]
|
notebooks/grpo_colab_enhanced.ipynb
CHANGED
|
@@ -4,7 +4,24 @@
|
|
| 4 |
"cell_type": "markdown",
|
| 5 |
"metadata": {},
|
| 6 |
"source": [
|
| 7 |
-
"# NEXUS Enhanced
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
]
|
| 9 |
},
|
| 10 |
{
|
|
@@ -13,14 +30,14 @@
|
|
| 13 |
"metadata": {},
|
| 14 |
"outputs": [],
|
| 15 |
"source": [
|
| 16 |
-
"# Validate OpenEnv installation (
|
| 17 |
"try:\n",
|
| 18 |
" import openenv\n",
|
| 19 |
-
" print(f\"
|
| 20 |
"except ImportError:\n",
|
| 21 |
-
" print(\"
|
| 22 |
"\n",
|
| 23 |
-
"print(\"
|
| 24 |
]
|
| 25 |
},
|
| 26 |
{
|
|
@@ -40,7 +57,32 @@
|
|
| 40 |
"cell_type": "markdown",
|
| 41 |
"metadata": {},
|
| 42 |
"source": [
|
| 43 |
-
"## Configuration & HF Space connectivity\n
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
]
|
| 45 |
},
|
| 46 |
{
|
|
@@ -73,15 +115,15 @@
|
|
| 73 |
"IN_COLAB = _in_colab()\n",
|
| 74 |
"\n",
|
| 75 |
"# ---------------------------------------------------------------------------\n",
|
| 76 |
-
"# Notebook configuration
|
| 77 |
"# Enhanced notebook extras:\n",
|
| 78 |
-
"# NEXUS_INCIDENT_POOL
|
| 79 |
-
"# NEXUS_MULTI_INCIDENT
|
| 80 |
-
"# NEXUS_INCIDENT_ROTATION
|
| 81 |
-
"# NEXUS_GRPO_RUN_ID
|
| 82 |
"#\n",
|
| 83 |
"# See grpo_colab_v2.ipynb for the full list of NEXUS_* vars (same as here).\n",
|
| 84 |
-
"# NEXUS_ENABLE_CHECKPOINTS (true/false)
|
| 85 |
"# ---------------------------------------------------------------------------\n",
|
| 86 |
"\n",
|
| 87 |
"BASE_URL = _env(\n",
|
|
@@ -167,13 +209,13 @@
|
|
| 167 |
" if not BACKUP_TO_GOOGLE_DRIVE:\n",
|
| 168 |
" return\n",
|
| 169 |
" if not IN_COLAB:\n",
|
| 170 |
-
" print(\"
|
| 171 |
" return\n",
|
| 172 |
" if os.path.isdir(\"/content/drive/MyDrive\"):\n",
|
| 173 |
-
" print(\"
|
| 174 |
" return\n",
|
| 175 |
" from google.colab import drive\n",
|
| 176 |
-
" print(\"
|
| 177 |
" drive.mount(\"/content/drive\")\n",
|
| 178 |
"\n",
|
| 179 |
"\n",
|
|
@@ -224,12 +266,12 @@
|
|
| 224 |
"try:\n",
|
| 225 |
" resp = requests.get(f\"{BASE_URL}/health\", timeout=5)\n",
|
| 226 |
" if resp.status_code == 200:\n",
|
| 227 |
-
" print(\"
|
| 228 |
" print(f\"Response: {resp.json()}\")\n",
|
| 229 |
" else:\n",
|
| 230 |
-
" print(f\"
|
| 231 |
"except Exception as e:\n",
|
| 232 |
-
" print(f\"
|
| 233 |
" print(f\"URL: {BASE_URL}\")\n",
|
| 234 |
" print(\"Make sure HF Space is deployed and running\")\n"
|
| 235 |
]
|
|
@@ -289,7 +331,7 @@
|
|
| 289 |
" return data[\"observation\"], data[\"reward\"], data[\"done\"], data[\"info\"]\n",
|
| 290 |
"\n",
|
| 291 |
" def get_learning_curve(self, run_id=None):\n",
|
| 292 |
-
" \"\"\"GET /learning-curve
|
| 293 |
" params = {}\n",
|
| 294 |
" rid = run_id if run_id is not None else GRPO_RUN_ID\n",
|
| 295 |
" if rid:\n",
|
|
@@ -304,7 +346,7 @@
|
|
| 304 |
"\n",
|
| 305 |
"\n",
|
| 306 |
"env = NexusRemoteEnv()\n",
|
| 307 |
-
"print(\"
|
| 308 |
"\n",
|
| 309 |
"_NEXUS_DRIVE_RUN_DIR = None\n",
|
| 310 |
"\n",
|
|
@@ -328,7 +370,7 @@
|
|
| 328 |
"\n",
|
| 329 |
"\n",
|
| 330 |
"def backup_nexus_artifacts_to_drive(reason=\"manual\", *, include_learning_curve=True):\n",
|
| 331 |
-
" \"\"\"Copy GRPO checkpoints, PNG plots (if present), learning curve JSON, manifest
|
| 332 |
" if not BACKUP_TO_GOOGLE_DRIVE:\n",
|
| 333 |
" print(f\"[Drive backup:{reason}] skipped (BACKUP_TO_GOOGLE_DRIVE=False)\")\n",
|
| 334 |
" return None\n",
|
|
@@ -336,7 +378,7 @@
|
|
| 336 |
" print(f\"[Drive backup:{reason}] skipped (not Colab)\")\n",
|
| 337 |
" return None\n",
|
| 338 |
" if not os.path.isdir(\"/content/drive/MyDrive\"):\n",
|
| 339 |
-
" print(f\"[Drive backup:{reason}] skipped
|
| 340 |
" return None\n",
|
| 341 |
" dest = _nexus_google_drive_run_dir()\n",
|
| 342 |
" if dest is None:\n",
|
|
@@ -346,22 +388,22 @@
|
|
| 346 |
" if os.path.isdir(GRPO_OUTPUT_DIR):\n",
|
| 347 |
" tgt = os.path.join(dest, \"grpo_checkpoints\")\n",
|
| 348 |
" shutil.copytree(GRPO_OUTPUT_DIR, tgt, dirs_exist_ok=True)\n",
|
| 349 |
-
" print(f\"
|
| 350 |
"\n",
|
| 351 |
" for name in (\"training_analysis.png\", \"reward_curves_hires.png\"):\n",
|
| 352 |
" src = os.path.join(PLOT_OUTPUT_DIR, name)\n",
|
| 353 |
" if os.path.isfile(src):\n",
|
| 354 |
" shutil.copy2(src, os.path.join(dest, name))\n",
|
| 355 |
-
" print(f\"
|
| 356 |
"\n",
|
| 357 |
" if include_learning_curve:\n",
|
| 358 |
" try:\n",
|
| 359 |
" curve = env.get_learning_curve()\n",
|
| 360 |
" with open(os.path.join(dest, \"learning_curve.json\"), \"w\") as f:\n",
|
| 361 |
" _json_backup.dump(curve, f, indent=2)\n",
|
| 362 |
-
" print(\"
|
| 363 |
" except Exception as e:\n",
|
| 364 |
-
" print(f\"
|
| 365 |
"\n",
|
| 366 |
" manifest = {\n",
|
| 367 |
" \"reason\": reason,\n",
|
|
@@ -377,7 +419,7 @@
|
|
| 377 |
" }\n",
|
| 378 |
" with open(os.path.join(dest, \"run_manifest.json\"), \"w\") as f:\n",
|
| 379 |
" _json_backup.dump(manifest, f, indent=2)\n",
|
| 380 |
-
" print(f\"\\n
|
| 381 |
" return dest\n"
|
| 382 |
]
|
| 383 |
},
|
|
@@ -463,7 +505,7 @@
|
|
| 463 |
"\n",
|
| 464 |
"\n",
|
| 465 |
"print(\n",
|
| 466 |
-
" \"
|
| 467 |
" TRAINING_INCIDENT_POOL,\n",
|
| 468 |
" \", rotation=\",\n",
|
| 469 |
" INCIDENT_ROTATION,\n",
|
|
@@ -522,7 +564,7 @@
|
|
| 522 |
" save_steps=GRPO_SAVE_STEPS_QUICK if ONE_ROUND_TRAINING else GRPO_SAVE_STEPS_FULL,\n",
|
| 523 |
")\n",
|
| 524 |
"\n",
|
| 525 |
-
"print(\"
|
| 526 |
]
|
| 527 |
},
|
| 528 |
{
|
|
@@ -541,6 +583,9 @@
|
|
| 541 |
},
|
| 542 |
{
|
| 543 |
"cell_type": "code",
|
|
|
|
|
|
|
|
|
|
| 544 |
"source": [
|
| 545 |
"from datasets import Dataset\n",
|
| 546 |
"import os\n",
|
|
@@ -550,18 +595,18 @@
|
|
| 550 |
"\n",
|
| 551 |
"print(\"\\n\" + \"=\" * 70)\n",
|
| 552 |
"if ONE_ROUND_TRAINING:\n",
|
| 553 |
-
" print(f\"
|
| 554 |
"else:\n",
|
| 555 |
-
" print(f\"
|
| 556 |
"print(\"=\" * 70)\n",
|
| 557 |
"print(\"\\nConfiguration:\")\n",
|
| 558 |
-
"print(f\"
|
| 559 |
-
"print(f\"
|
| 560 |
-
"print(f\"
|
| 561 |
-
"print(f\"
|
| 562 |
-
"print(f\"
|
| 563 |
-
"print(f\"
|
| 564 |
-
"print(\"
|
| 565 |
"print(\"\\nMonitor dashboard:\")\n",
|
| 566 |
"print(f\" {BASE_URL}/\")\n",
|
| 567 |
"print(\"=\" * 70 + \"\\n\")\n",
|
|
@@ -591,30 +636,30 @@
|
|
| 591 |
"if ENABLE_CHECKPOINTS and CHECKPOINT_RESUME and not FORCE_FRESH_RUN:\n",
|
| 592 |
" resume_ckpt = get_last_checkpoint(GRPO_OUTPUT_DIR)\n",
|
| 593 |
"if resume_ckpt:\n",
|
| 594 |
-
" print(f\"
|
| 595 |
"else:\n",
|
| 596 |
-
" print(\"
|
| 597 |
"\n",
|
| 598 |
-
"print(f\"
|
| 599 |
-
"print(\"
|
| 600 |
"trainer.train(resume_from_checkpoint=resume_ckpt)\n",
|
| 601 |
"\n",
|
| 602 |
"print(\"\\n\" + \"=\" * 70)\n",
|
| 603 |
-
"print(\"
|
| 604 |
"print(\"=\" * 70)\n",
|
| 605 |
"print(f\"Dashboard: {BASE_URL}/\")\n",
|
| 606 |
"print(f\"Learning curve API: {BASE_URL}/learning-curve\")\n",
|
| 607 |
-
"print(\"
|
| 608 |
"print(\"=\" * 70)\n",
|
| 609 |
"\n",
|
| 610 |
"backup_nexus_artifacts_to_drive(\"post_training\", include_learning_curve=True)\n"
|
| 611 |
-
]
|
| 612 |
-
"metadata": {},
|
| 613 |
-
"execution_count": null,
|
| 614 |
-
"outputs": []
|
| 615 |
},
|
| 616 |
{
|
| 617 |
"cell_type": "code",
|
|
|
|
|
|
|
|
|
|
| 618 |
"source": [
|
| 619 |
"import os\n",
|
| 620 |
"import matplotlib.pyplot as plt\n",
|
|
@@ -622,7 +667,7 @@
|
|
| 622 |
"os.makedirs(PLOT_OUTPUT_DIR, exist_ok=True)\n",
|
| 623 |
"\n",
|
| 624 |
"print(\"\\n\" + \"=\" * 70)\n",
|
| 625 |
-
"print(\"
|
| 626 |
"print(\"=\" * 70)\n",
|
| 627 |
"print(f\"Using run_id filter: {GRPO_RUN_ID}\")\n",
|
| 628 |
"\n",
|
|
@@ -688,14 +733,14 @@
|
|
| 688 |
" summary_text = f\"\"\"\n",
|
| 689 |
"TRAINING SUMMARY\n",
|
| 690 |
"\n",
|
| 691 |
-
"
|
| 692 |
-
"
|
| 693 |
-
"
|
| 694 |
-
"
|
| 695 |
-
"
|
| 696 |
"\n",
|
| 697 |
-
"
|
| 698 |
-
"
|
| 699 |
" \"\"\"\n",
|
| 700 |
"\n",
|
| 701 |
" ax4.text(\n",
|
|
@@ -709,14 +754,14 @@
|
|
| 709 |
" bbox=dict(boxstyle=\"round\", facecolor=\"#1e293b\", alpha=0.8, edgecolor=\"#0ea5e9\"),\n",
|
| 710 |
" )\n",
|
| 711 |
"\n",
|
| 712 |
-
" plt.suptitle(\"NEXUS Enhanced
|
| 713 |
" plt.tight_layout()\n",
|
| 714 |
"\n",
|
| 715 |
-
" print(\"\\n
|
| 716 |
" p1 = os.path.join(PLOT_OUTPUT_DIR, \"training_analysis.png\")\n",
|
| 717 |
" p2 = os.path.join(PLOT_OUTPUT_DIR, \"reward_curves_hires.png\")\n",
|
| 718 |
" plt.savefig(p1, dpi=150, bbox_inches=\"tight\")\n",
|
| 719 |
-
" print(f\"
|
| 720 |
"\n",
|
| 721 |
" fig_single, ax = plt.subplots(figsize=(14, 7))\n",
|
| 722 |
" ax.plot(episodes, rewards, \"o-\", label=\"Episode Reward\", color=\"#0ea5e9\", markersize=7, linewidth=2.5, alpha=0.8)\n",
|
|
@@ -725,18 +770,18 @@
|
|
| 725 |
" ax.axhline(y=baseline, color=\"#ef4444\", linestyle=\"--\", linewidth=2.5, label=f\"Baseline: {baseline:.3f}\")\n",
|
| 726 |
" ax.set_xlabel(\"Episode\", fontsize=12, fontweight=\"bold\")\n",
|
| 727 |
" ax.set_ylabel(\"Reward Score\", fontsize=12, fontweight=\"bold\")\n",
|
| 728 |
-
" ax.set_title(\"NEXUS Enhanced GRPO Training
|
| 729 |
" ax.legend(fontsize=11, loc=\"lower right\")\n",
|
| 730 |
" ax.grid(True, alpha=0.3, linestyle=\"--\")\n",
|
| 731 |
" ax.set_ylim(-0.05, 1.05)\n",
|
| 732 |
" plt.tight_layout()\n",
|
| 733 |
" plt.savefig(p2, dpi=200, bbox_inches=\"tight\")\n",
|
| 734 |
-
" print(f\"
|
| 735 |
"\n",
|
| 736 |
" plt.show()\n",
|
| 737 |
"\n",
|
| 738 |
" print(\"\\n\" + \"=\" * 70)\n",
|
| 739 |
-
" print(\"
|
| 740 |
" print(\"=\" * 70)\n",
|
| 741 |
" print(f\"\\n{'Metric':<35} {'Value':<20}\")\n",
|
| 742 |
" print(\"-\" * 55)\n",
|
|
@@ -754,25 +799,22 @@
|
|
| 754 |
" late_avg = sum(rewards[-5:]) / 5\n",
|
| 755 |
" print(f\"\\n{'Early Phase (Ep 1-5) Avg':<35} {early_avg:.4f}\")\n",
|
| 756 |
" print(f\"{'Late Phase (Ep -5) Avg':<35} {late_avg:.4f}\")\n",
|
| 757 |
-
" learning_status = \"
|
| 758 |
" print(f\"{'Status':<35} {learning_status:<20}\")\n",
|
| 759 |
"\n",
|
| 760 |
" print(\"\\n\" + \"=\" * 70)\n",
|
| 761 |
-
" print(\"
|
| 762 |
" print(\"=\" * 70)\n",
|
| 763 |
"\n",
|
| 764 |
" backup_nexus_artifacts_to_drive(\"post_plots\", include_learning_curve=True)\n",
|
| 765 |
"\n",
|
| 766 |
"else:\n",
|
| 767 |
-
" print(\"\\n
|
| 768 |
-
" print(\"
|
| 769 |
-
" print(\"
|
| 770 |
-
" print(f\"
|
| 771 |
" backup_nexus_artifacts_to_drive(\"no_rewards_yet\", include_learning_curve=True)\n"
|
| 772 |
-
]
|
| 773 |
-
"metadata": {},
|
| 774 |
-
"execution_count": null,
|
| 775 |
-
"outputs": []
|
| 776 |
}
|
| 777 |
],
|
| 778 |
"metadata": {
|
|
|
|
| 4 |
"cell_type": "markdown",
|
| 5 |
"metadata": {},
|
| 6 |
"source": [
|
| 7 |
+
"# NEXUS Enhanced — GRPO Training Notebook (**Enhanced**)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"**Same pipeline as `grpo_colab_v2.ipynb`, with optional multi-incident rotation and scoped metrics `run_id`.**\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"Use **`grpo_colab_v2.ipynb`** for the simplest single-incident (INC003) path. Use **this notebook** when you want a **defined incident pool** (enterprise + EA + lighter variety) without editing code between runs.\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"- **Rotation:** `round_robin` (default) or `random` per reward episode (`NEXUS_INCIDENT_ROTATION`).\n",
|
| 14 |
+
"- **Pool:** `NEXUS_INCIDENT_POOL` (comma-separated) or defaults to `INC003,INC008,INC001`. Set `NEXUS_MULTI_INCIDENT=false` to lock to `NEXUS_INCIDENT_ID` only.\n",
|
| 15 |
+
"- **Metrics:** `NEXUS_GRPO_RUN_ID` tags `/reset` episodes so `GET /learning-curve?run_id=...` matches this Colab run.\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"## How to run (please follow order)\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"1. **Runtime:** GPU (e.g. T4).\n",
|
| 20 |
+
"2. Run cells **top to bottom** at least once per session: installs → **configuration** → environment → model → **training** → **plots**.\n",
|
| 21 |
+
"3. Edit the **configuration cell** for `BASE_URL`, incident pool, `GRPO_RUN_ID`, `ONE_ROUND_TRAINING`, and optional `NEXUS_*` env vars.\n",
|
| 22 |
+
"4. **Google Drive:** Same backup behavior as v2 when `BACKUP_TO_GOOGLE_DRIVE` is True on Colab.\n",
|
| 23 |
+
"- **Checkpoints / resume:** frequent `save_steps` on the quick path; default checkpoint folder moves to **Google Drive** on Colab when Drive is mounted so you can reconnect and **continue training** without restarting from scratch.\n",
|
| 24 |
+
"\n"
|
| 25 |
]
|
| 26 |
},
|
| 27 |
{
|
|
|
|
| 30 |
"metadata": {},
|
| 31 |
"outputs": [],
|
| 32 |
"source": [
|
| 33 |
+
"# Validate OpenEnv installation (hackathon compliance)\n",
|
| 34 |
"try:\n",
|
| 35 |
" import openenv\n",
|
| 36 |
+
" print(f\"✅ OpenEnv {openenv.__version__} installed\")\n",
|
| 37 |
"except ImportError:\n",
|
| 38 |
+
" print(\"⚠️ OpenEnv not yet installed (will be installed in next cell)\")\n",
|
| 39 |
"\n",
|
| 40 |
+
"print(\"✅ OpenEnv (latest release) check passed — hackathon compliance\")"
|
| 41 |
]
|
| 42 |
},
|
| 43 |
{
|
|
|
|
| 57 |
"cell_type": "markdown",
|
| 58 |
"metadata": {},
|
| 59 |
"source": [
|
| 60 |
+
"## Configuration & HF Space connectivity\n",
|
| 61 |
+
"\n",
|
| 62 |
+
"The next code cell adds **multi-incident** defaults and **`run_id`** for scoped learning curves. Override with:\n",
|
| 63 |
+
"\n",
|
| 64 |
+
"- `NEXUS_INCIDENT_POOL` — e.g. `INC003,INC008,INC004` (comma-separated). Ignored if `NEXUS_MULTI_INCIDENT=false`.\n",
|
| 65 |
+
"- `NEXUS_MULTI_INCIDENT` — `false` to train against **`NEXUS_INCIDENT_ID`** only (v2-style).\n",
|
| 66 |
+
"- `NEXUS_INCIDENT_ROTATION` — `round_robin` (default) or `random`.\n",
|
| 67 |
+
"- `NEXUS_GRPO_RUN_ID` — string passed to `POST /reset` as `run_id` (default `colab_grpo_enhanced`).\n",
|
| 68 |
+
"\n",
|
| 69 |
+
"`REWARD_MAX_STEPS` default is **35** so mixed pools have headroom vs INC003-only (28).\n",
|
| 70 |
+
"**Checkpoints:** On Colab with Drive mounted, weights save under `.../NEXUS_GRPO_backups/active_grpo_checkpoints` by default so a disconnect does not wipe them. Re-run the training cell to **resume** from the latest step (`NEXUS_FORCE_FRESH=true` starts over). Set `NEXUS_GRPO_OUTPUT_DIR` to override the directory.\n",
|
| 71 |
+
"\n",
|
| 72 |
+
"---\n",
|
| 73 |
+
"\n",
|
| 74 |
+
"## Colab free tier — reducing disconnect / expiry pain\n",
|
| 75 |
+
"\n",
|
| 76 |
+
"**Free Colab cannot be guaranteed** to stay alive for hours (idle limits, preemption, daily caps). Mitigations:\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"1. **Drive checkpoints + resume** (this notebook): mount Drive in the config cell; **`GRPO_OUTPUT_DIR`** defaults to **`.../NEXUS_GRPO_backups/active_grpo_checkpoints`** on Colab when Drive is present. After a disconnect, **re-run setup cells in order**, then training — **`trainer.train(resume_from_checkpoint=...)`** picks up the latest step unless **`NEXUS_FORCE_FRESH=true`**.\n",
|
| 79 |
+
"2. **Shorter runs:** **`ONE_ROUND_TRAINING = True`** or fewer prompts per session; continue later with resume instead of one very long run.\n",
|
| 80 |
+
"3. **Frequent `save_steps`** on the quick path so less work is lost.\n",
|
| 81 |
+
"4. **Stable browser session:** avoid laptop sleep; keep the Colab tab in a focused window on a reliable network while training runs.\n",
|
| 82 |
+
"5. **Colab Pro / Pro+** if you need longer single sessions.\n",
|
| 83 |
+
"\n",
|
| 84 |
+
"**Disable HF checkpoints entirely:** set **`NEXUS_ENABLE_CHECKPOINTS=false`** (no checkpoint files, no resume; saves Drive space).\n",
|
| 85 |
+
"\n"
|
| 86 |
]
|
| 87 |
},
|
| 88 |
{
|
|
|
|
| 115 |
"IN_COLAB = _in_colab()\n",
|
| 116 |
"\n",
|
| 117 |
"# ---------------------------------------------------------------------------\n",
|
| 118 |
+
"# Notebook configuration — edit defaults here or set environment variables.\n",
|
| 119 |
"# Enhanced notebook extras:\n",
|
| 120 |
+
"# NEXUS_INCIDENT_POOL — comma-separated case ids (default: INC003,INC008,INC001)\n",
|
| 121 |
+
"# NEXUS_MULTI_INCIDENT — false → use only NEXUS_INCIDENT_ID (v2-style single task)\n",
|
| 122 |
+
"# NEXUS_INCIDENT_ROTATION — round_robin | random\n",
|
| 123 |
+
"# NEXUS_GRPO_RUN_ID — POST /reset run_id for scoped /learning-curve and /metrics\n",
|
| 124 |
"#\n",
|
| 125 |
"# See grpo_colab_v2.ipynb for the full list of NEXUS_* vars (same as here).\n",
|
| 126 |
+
"# NEXUS_ENABLE_CHECKPOINTS (true/false) — false: no HF checkpoint files, no resume\n",
|
| 127 |
"# ---------------------------------------------------------------------------\n",
|
| 128 |
"\n",
|
| 129 |
"BASE_URL = _env(\n",
|
|
|
|
| 209 |
" if not BACKUP_TO_GOOGLE_DRIVE:\n",
|
| 210 |
" return\n",
|
| 211 |
" if not IN_COLAB:\n",
|
| 212 |
+
" print(\"⚠️ BACKUP_TO_GOOGLE_DRIVE is True but not in Colab — skipping Drive mount.\")\n",
|
| 213 |
" return\n",
|
| 214 |
" if os.path.isdir(\"/content/drive/MyDrive\"):\n",
|
| 215 |
+
" print(\"✅ Google Drive already mounted.\")\n",
|
| 216 |
" return\n",
|
| 217 |
" from google.colab import drive\n",
|
| 218 |
+
" print(\"📂 Mount Google Drive when prompted (artifacts copy to My Drive / NEXUS_GRPO_backups).\")\n",
|
| 219 |
" drive.mount(\"/content/drive\")\n",
|
| 220 |
"\n",
|
| 221 |
"\n",
|
|
|
|
| 266 |
"try:\n",
|
| 267 |
" resp = requests.get(f\"{BASE_URL}/health\", timeout=5)\n",
|
| 268 |
" if resp.status_code == 200:\n",
|
| 269 |
+
" print(\"✅ HF Space is reachable\")\n",
|
| 270 |
" print(f\"Response: {resp.json()}\")\n",
|
| 271 |
" else:\n",
|
| 272 |
+
" print(f\"❌ HF Space returned status {resp.status_code}\")\n",
|
| 273 |
"except Exception as e:\n",
|
| 274 |
+
" print(f\"❌ Error connecting to HF Space: {e}\")\n",
|
| 275 |
" print(f\"URL: {BASE_URL}\")\n",
|
| 276 |
" print(\"Make sure HF Space is deployed and running\")\n"
|
| 277 |
]
|
|
|
|
| 331 |
" return data[\"observation\"], data[\"reward\"], data[\"done\"], data[\"info\"]\n",
|
| 332 |
"\n",
|
| 333 |
" def get_learning_curve(self, run_id=None):\n",
|
| 334 |
+
" \"\"\"GET /learning-curve — optional run_id scopes metrics to this Colab run.\"\"\"\n",
|
| 335 |
" params = {}\n",
|
| 336 |
" rid = run_id if run_id is not None else GRPO_RUN_ID\n",
|
| 337 |
" if rid:\n",
|
|
|
|
| 346 |
"\n",
|
| 347 |
"\n",
|
| 348 |
"env = NexusRemoteEnv()\n",
|
| 349 |
+
"print(\"✅ Environment interface ready (enhanced: run_id + scoped learning curve)\")\n",
|
| 350 |
"\n",
|
| 351 |
"_NEXUS_DRIVE_RUN_DIR = None\n",
|
| 352 |
"\n",
|
|
|
|
| 370 |
"\n",
|
| 371 |
"\n",
|
| 372 |
"def backup_nexus_artifacts_to_drive(reason=\"manual\", *, include_learning_curve=True):\n",
|
| 373 |
+
" \"\"\"Copy GRPO checkpoints, PNG plots (if present), learning curve JSON, manifest → Google Drive.\"\"\"\n",
|
| 374 |
" if not BACKUP_TO_GOOGLE_DRIVE:\n",
|
| 375 |
" print(f\"[Drive backup:{reason}] skipped (BACKUP_TO_GOOGLE_DRIVE=False)\")\n",
|
| 376 |
" return None\n",
|
|
|
|
| 378 |
" print(f\"[Drive backup:{reason}] skipped (not Colab)\")\n",
|
| 379 |
" return None\n",
|
| 380 |
" if not os.path.isdir(\"/content/drive/MyDrive\"):\n",
|
| 381 |
+
" print(f\"[Drive backup:{reason}] skipped — mount Drive in the config cell first\")\n",
|
| 382 |
" return None\n",
|
| 383 |
" dest = _nexus_google_drive_run_dir()\n",
|
| 384 |
" if dest is None:\n",
|
|
|
|
| 388 |
" if os.path.isdir(GRPO_OUTPUT_DIR):\n",
|
| 389 |
" tgt = os.path.join(dest, \"grpo_checkpoints\")\n",
|
| 390 |
" shutil.copytree(GRPO_OUTPUT_DIR, tgt, dirs_exist_ok=True)\n",
|
| 391 |
+
" print(f\" ✅ checkpoints → {tgt}\")\n",
|
| 392 |
"\n",
|
| 393 |
" for name in (\"training_analysis.png\", \"reward_curves_hires.png\"):\n",
|
| 394 |
" src = os.path.join(PLOT_OUTPUT_DIR, name)\n",
|
| 395 |
" if os.path.isfile(src):\n",
|
| 396 |
" shutil.copy2(src, os.path.join(dest, name))\n",
|
| 397 |
+
" print(f\" ✅ plot {name}\")\n",
|
| 398 |
"\n",
|
| 399 |
" if include_learning_curve:\n",
|
| 400 |
" try:\n",
|
| 401 |
" curve = env.get_learning_curve()\n",
|
| 402 |
" with open(os.path.join(dest, \"learning_curve.json\"), \"w\") as f:\n",
|
| 403 |
" _json_backup.dump(curve, f, indent=2)\n",
|
| 404 |
+
" print(\" ✅ learning_curve.json\")\n",
|
| 405 |
" except Exception as e:\n",
|
| 406 |
+
" print(f\" ⚠️ learning curve fetch failed: {e}\")\n",
|
| 407 |
"\n",
|
| 408 |
" manifest = {\n",
|
| 409 |
" \"reason\": reason,\n",
|
|
|
|
| 419 |
" }\n",
|
| 420 |
" with open(os.path.join(dest, \"run_manifest.json\"), \"w\") as f:\n",
|
| 421 |
" _json_backup.dump(manifest, f, indent=2)\n",
|
| 422 |
+
" print(f\"\\n📦 Drive backup ({reason}): {dest}\\n\")\n",
|
| 423 |
" return dest\n"
|
| 424 |
]
|
| 425 |
},
|
|
|
|
| 505 |
"\n",
|
| 506 |
"\n",
|
| 507 |
"print(\n",
|
| 508 |
+
" \"✅ Reward function defined (pool=\",\n",
|
| 509 |
" TRAINING_INCIDENT_POOL,\n",
|
| 510 |
" \", rotation=\",\n",
|
| 511 |
" INCIDENT_ROTATION,\n",
|
|
|
|
| 564 |
" save_steps=GRPO_SAVE_STEPS_QUICK if ONE_ROUND_TRAINING else GRPO_SAVE_STEPS_FULL,\n",
|
| 565 |
")\n",
|
| 566 |
"\n",
|
| 567 |
+
"print(\"✅ Model loaded and GRPO configured\")\n"
|
| 568 |
]
|
| 569 |
},
|
| 570 |
{
|
|
|
|
| 583 |
},
|
| 584 |
{
|
| 585 |
"cell_type": "code",
|
| 586 |
+
"execution_count": null,
|
| 587 |
+
"metadata": {},
|
| 588 |
+
"outputs": [],
|
| 589 |
"source": [
|
| 590 |
"from datasets import Dataset\n",
|
| 591 |
"import os\n",
|
|
|
|
| 595 |
"\n",
|
| 596 |
"print(\"\\n\" + \"=\" * 70)\n",
|
| 597 |
"if ONE_ROUND_TRAINING:\n",
|
| 598 |
+
" print(f\"🚀 GRPO — ONE ROUND ({n_target} prompts, fast path)\")\n",
|
| 599 |
"else:\n",
|
| 600 |
+
" print(f\"🚀 GRPO — FULL RUN ({n_target} prompts)\")\n",
|
| 601 |
"print(\"=\" * 70)\n",
|
| 602 |
"print(\"\\nConfiguration:\")\n",
|
| 603 |
+
"print(f\" • Model: {MODEL_NAME}\")\n",
|
| 604 |
+
"print(f\" • Dataset rows: {n_target}\")\n",
|
| 605 |
+
"print(f\" • Environment: {BASE_URL}\")\n",
|
| 606 |
+
"print(f\" • Incident pool: {TRAINING_INCIDENT_POOL} ({INCIDENT_ROTATION})\")\n",
|
| 607 |
+
"print(f\" • GRPO_RUN_ID (metrics scope): {GRPO_RUN_ID}\")\n",
|
| 608 |
+
"print(f\" • Checkpoints dir: {GRPO_OUTPUT_DIR}\")\n",
|
| 609 |
+
"print(\" • Adjust settings in the configuration cell (or NEXUS_* env vars).\")\n",
|
| 610 |
"print(\"\\nMonitor dashboard:\")\n",
|
| 611 |
"print(f\" {BASE_URL}/\")\n",
|
| 612 |
"print(\"=\" * 70 + \"\\n\")\n",
|
|
|
|
| 636 |
"if ENABLE_CHECKPOINTS and CHECKPOINT_RESUME and not FORCE_FRESH_RUN:\n",
|
| 637 |
" resume_ckpt = get_last_checkpoint(GRPO_OUTPUT_DIR)\n",
|
| 638 |
"if resume_ckpt:\n",
|
| 639 |
+
" print(f\"📂 Resuming training from: {resume_ckpt}\")\n",
|
| 640 |
"else:\n",
|
| 641 |
+
" print(\"📂 Starting training fresh (no checkpoint, or NEXUS_RESUME=false, or NEXUS_FORCE_FRESH=true)\")\n",
|
| 642 |
"\n",
|
| 643 |
+
"print(f\"📊 Dataset: {len(train_dataset)} prompts\")\n",
|
| 644 |
+
"print(\"⏳ Training started...\")\n",
|
| 645 |
"trainer.train(resume_from_checkpoint=resume_ckpt)\n",
|
| 646 |
"\n",
|
| 647 |
"print(\"\\n\" + \"=\" * 70)\n",
|
| 648 |
+
"print(\"✅ Training step finished\")\n",
|
| 649 |
"print(\"=\" * 70)\n",
|
| 650 |
"print(f\"Dashboard: {BASE_URL}/\")\n",
|
| 651 |
"print(f\"Learning curve API: {BASE_URL}/learning-curve\")\n",
|
| 652 |
+
"print(\"▶️ Run next cell to plot results.\")\n",
|
| 653 |
"print(\"=\" * 70)\n",
|
| 654 |
"\n",
|
| 655 |
"backup_nexus_artifacts_to_drive(\"post_training\", include_learning_curve=True)\n"
|
| 656 |
+
]
|
|
|
|
|
|
|
|
|
|
| 657 |
},
|
| 658 |
{
|
| 659 |
"cell_type": "code",
|
| 660 |
+
"execution_count": null,
|
| 661 |
+
"metadata": {},
|
| 662 |
+
"outputs": [],
|
| 663 |
"source": [
|
| 664 |
"import os\n",
|
| 665 |
"import matplotlib.pyplot as plt\n",
|
|
|
|
| 667 |
"os.makedirs(PLOT_OUTPUT_DIR, exist_ok=True)\n",
|
| 668 |
"\n",
|
| 669 |
"print(\"\\n\" + \"=\" * 70)\n",
|
| 670 |
+
"print(\"📊 FETCHING REAL TRAINING DATA FROM HF SPACE\")\n",
|
| 671 |
"print(\"=\" * 70)\n",
|
| 672 |
"print(f\"Using run_id filter: {GRPO_RUN_ID}\")\n",
|
| 673 |
"\n",
|
|
|
|
| 733 |
" summary_text = f\"\"\"\n",
|
| 734 |
"TRAINING SUMMARY\n",
|
| 735 |
"\n",
|
| 736 |
+
"📊 Episodes: {len(rewards)}\n",
|
| 737 |
+
"🔵 Baseline: {baseline:.4f}\n",
|
| 738 |
+
"📈 Average: {avg_reward:.4f}\n",
|
| 739 |
+
"⭐ Best: {best_reward:.4f}\n",
|
| 740 |
+
"📉 Worst: {min(rewards):.4f}\n",
|
| 741 |
"\n",
|
| 742 |
+
"📊 Improvement: +{improvement_from_baseline:.1f}%\n",
|
| 743 |
+
"📌 Last 5 Avg: {last_5_avg:.4f}\n",
|
| 744 |
" \"\"\"\n",
|
| 745 |
"\n",
|
| 746 |
" ax4.text(\n",
|
|
|
|
| 754 |
" bbox=dict(boxstyle=\"round\", facecolor=\"#1e293b\", alpha=0.8, edgecolor=\"#0ea5e9\"),\n",
|
| 755 |
" )\n",
|
| 756 |
"\n",
|
| 757 |
+
" plt.suptitle(\"NEXUS Enhanced — Complete Training Analysis\", fontsize=14, fontweight=\"bold\", y=0.995)\n",
|
| 758 |
" plt.tight_layout()\n",
|
| 759 |
"\n",
|
| 760 |
+
" print(\"\\n📁 Saving visualizations...\")\n",
|
| 761 |
" p1 = os.path.join(PLOT_OUTPUT_DIR, \"training_analysis.png\")\n",
|
| 762 |
" p2 = os.path.join(PLOT_OUTPUT_DIR, \"reward_curves_hires.png\")\n",
|
| 763 |
" plt.savefig(p1, dpi=150, bbox_inches=\"tight\")\n",
|
| 764 |
+
" print(f\" ✅ {p1} (4-panel comprehensive view)\")\n",
|
| 765 |
"\n",
|
| 766 |
" fig_single, ax = plt.subplots(figsize=(14, 7))\n",
|
| 767 |
" ax.plot(episodes, rewards, \"o-\", label=\"Episode Reward\", color=\"#0ea5e9\", markersize=7, linewidth=2.5, alpha=0.8)\n",
|
|
|
|
| 770 |
" ax.axhline(y=baseline, color=\"#ef4444\", linestyle=\"--\", linewidth=2.5, label=f\"Baseline: {baseline:.3f}\")\n",
|
| 771 |
" ax.set_xlabel(\"Episode\", fontsize=12, fontweight=\"bold\")\n",
|
| 772 |
" ax.set_ylabel(\"Reward Score\", fontsize=12, fontweight=\"bold\")\n",
|
| 773 |
+
" ax.set_title(\"NEXUS Enhanced GRPO Training — Reward Progression\", fontsize=13, fontweight=\"bold\")\n",
|
| 774 |
" ax.legend(fontsize=11, loc=\"lower right\")\n",
|
| 775 |
" ax.grid(True, alpha=0.3, linestyle=\"--\")\n",
|
| 776 |
" ax.set_ylim(-0.05, 1.05)\n",
|
| 777 |
" plt.tight_layout()\n",
|
| 778 |
" plt.savefig(p2, dpi=200, bbox_inches=\"tight\")\n",
|
| 779 |
+
" print(f\" ✅ {p2} (high-res)\")\n",
|
| 780 |
"\n",
|
| 781 |
" plt.show()\n",
|
| 782 |
"\n",
|
| 783 |
" print(\"\\n\" + \"=\" * 70)\n",
|
| 784 |
+
" print(\"📈 FINAL TRAINING RESULTS\")\n",
|
| 785 |
" print(\"=\" * 70)\n",
|
| 786 |
" print(f\"\\n{'Metric':<35} {'Value':<20}\")\n",
|
| 787 |
" print(\"-\" * 55)\n",
|
|
|
|
| 799 |
" late_avg = sum(rewards[-5:]) / 5\n",
|
| 800 |
" print(f\"\\n{'Early Phase (Ep 1-5) Avg':<35} {early_avg:.4f}\")\n",
|
| 801 |
" print(f\"{'Late Phase (Ep -5) Avg':<35} {late_avg:.4f}\")\n",
|
| 802 |
+
" learning_status = \"✅ Learning\" if late_avg > early_avg else \"⚠️ Plateau\"\n",
|
| 803 |
" print(f\"{'Status':<35} {learning_status:<20}\")\n",
|
| 804 |
"\n",
|
| 805 |
" print(\"\\n\" + \"=\" * 70)\n",
|
| 806 |
+
" print(\"✅ COMPLETE!\")\n",
|
| 807 |
" print(\"=\" * 70)\n",
|
| 808 |
"\n",
|
| 809 |
" backup_nexus_artifacts_to_drive(\"post_plots\", include_learning_curve=True)\n",
|
| 810 |
"\n",
|
| 811 |
"else:\n",
|
| 812 |
+
" print(\"\\n❌ No episode data found\")\n",
|
| 813 |
+
" print(\"⏳ Training may still be running...\")\n",
|
| 814 |
+
" print(\"💡 Rerun this cell in a few minutes\")\n",
|
| 815 |
+
" print(f\"📊 Live: {BASE_URL}/learning-curve\")\n",
|
| 816 |
" backup_nexus_artifacts_to_drive(\"no_rewards_yet\", include_learning_curve=True)\n"
|
| 817 |
+
]
|
|
|
|
|
|
|
|
|
|
| 818 |
}
|
| 819 |
],
|
| 820 |
"metadata": {
|
notebooks/grpo_colab_v2.ipynb
CHANGED
|
@@ -33,14 +33,14 @@
|
|
| 33 |
"metadata": {},
|
| 34 |
"outputs": [],
|
| 35 |
"source": [
|
| 36 |
-
"# Validate OpenEnv installation (
|
| 37 |
"try:\n",
|
| 38 |
" import openenv\n",
|
| 39 |
" print(f\"✅ OpenEnv {openenv.__version__} installed\")\n",
|
| 40 |
"except ImportError:\n",
|
| 41 |
" print(\"⚠️ OpenEnv not yet installed (will be installed in next cell)\")\n",
|
| 42 |
"\n",
|
| 43 |
-
"print(\"✅
|
| 44 |
]
|
| 45 |
},
|
| 46 |
{
|
|
|
|
| 33 |
"metadata": {},
|
| 34 |
"outputs": [],
|
| 35 |
"source": [
|
| 36 |
+
"# Validate OpenEnv installation (hackathon compliance)\n",
|
| 37 |
"try:\n",
|
| 38 |
" import openenv\n",
|
| 39 |
" print(f\"✅ OpenEnv {openenv.__version__} installed\")\n",
|
| 40 |
"except ImportError:\n",
|
| 41 |
" print(\"⚠️ OpenEnv not yet installed (will be installed in next cell)\")\n",
|
| 42 |
"\n",
|
| 43 |
+
"print(\"✅ OpenEnv (latest release) check passed — hackathon compliance\")"
|
| 44 |
]
|
| 45 |
},
|
| 46 |
{
|
server/app.py
CHANGED
|
@@ -546,7 +546,7 @@ def get_episodes(run_id: Optional[str] = None):
|
|
| 546 |
|
| 547 |
@app.get("/learning-curve")
|
| 548 |
def get_learning_curve(run_id: Optional[str] = None):
|
| 549 |
-
"""Rolling reward average — for
|
| 550 |
run_key = _normalize_run_id_filter(run_id)
|
| 551 |
scoped_records = _get_records_for_run(run_key)
|
| 552 |
rewards = [float(rec.get("reward", 0.0)) for rec in scoped_records]
|
|
@@ -561,7 +561,7 @@ def get_learning_curve(run_id: Optional[str] = None):
|
|
| 561 |
"run_id": run_key or "all",
|
| 562 |
"rewards": rewards,
|
| 563 |
"rolling_avg": rolling,
|
| 564 |
-
"baseline": 0.265, # Pre-event scripted baseline avg (
|
| 565 |
"episode_count": len(rewards),
|
| 566 |
"current_avg": round(sum(rewards) / len(rewards), 4),
|
| 567 |
"improvement": round(sum(rewards) / len(rewards) - 0.265, 4),
|
|
|
|
| 546 |
|
| 547 |
@app.get("/learning-curve")
|
| 548 |
def get_learning_curve(run_id: Optional[str] = None):
|
| 549 |
+
"""Rolling reward average — for observable training-progress evidence (judging rubric)."""
|
| 550 |
run_key = _normalize_run_id_filter(run_id)
|
| 551 |
scoped_records = _get_records_for_run(run_key)
|
| 552 |
rewards = [float(rec.get("reward", 0.0)) for rec in scoped_records]
|
|
|
|
| 561 |
"run_id": run_key or "all",
|
| 562 |
"rewards": rewards,
|
| 563 |
"rolling_avg": rolling,
|
| 564 |
+
"baseline": 0.265, # Pre-event scripted baseline avg (observable improvement baseline)
|
| 565 |
"episode_count": len(rewards),
|
| 566 |
"current_avg": round(sum(rewards) / len(rewards), 4),
|
| 567 |
"improvement": round(sum(rewards) / len(rewards) - 0.265, 4),
|
server/data_models.py
CHANGED
|
@@ -9,7 +9,7 @@ class IncidentType(Enum):
|
|
| 9 |
CASCADE = "cascade"
|
| 10 |
SECURITY = "security"
|
| 11 |
DATA = "data"
|
| 12 |
-
# Theme 3.2 — personalized delegation / conflicting priorities (
|
| 13 |
PERSONAL_ASSISTANT = "personal_assistant"
|
| 14 |
|
| 15 |
|
|
|
|
| 9 |
CASCADE = "cascade"
|
| 10 |
SECURITY = "security"
|
| 11 |
DATA = "data"
|
| 12 |
+
# Theme 3.2 — personalized delegation / conflicting priorities (hackathon personalized track)
|
| 13 |
PERSONAL_ASSISTANT = "personal_assistant"
|
| 14 |
|
| 15 |
|
server/reward.py
CHANGED
|
@@ -299,7 +299,7 @@ def compute_oversight_score(state: EpisodeState) -> float:
|
|
| 299 |
def compute_depth_bonus(state: EpisodeState) -> float:
|
| 300 |
"""
|
| 301 |
Mercor sub-theme: reward longer, better-structured IC reasoning.
|
| 302 |
-
UNCAPPED — per
|
| 303 |
|
| 304 |
Calibration principle (per Mercor requirement):
|
| 305 |
- Short canned strings (<30 words) earn 0 — they do not represent "reasoning"
|
|
|
|
| 299 |
def compute_depth_bonus(state: EpisodeState) -> float:
|
| 300 |
"""
|
| 301 |
Mercor sub-theme: reward longer, better-structured IC reasoning.
|
| 302 |
+
UNCAPPED — per Mercor sub-theme: rewards scale with token output without ceiling.
|
| 303 |
|
| 304 |
Calibration principle (per Mercor requirement):
|
| 305 |
- Short canned strings (<30 words) earn 0 — they do not represent "reasoning"
|
training_artifacts/pre_event_benchmark.json
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
{
|
| 2 |
-
"description": "Untrained scripted baseline on INC003 \u2014 establishes reward floor for GRPO improvement (
|
| 3 |
"incident_id": "INC003",
|
| 4 |
"policy": "scripted_baseline",
|
| 5 |
"n_trials": 5,
|
|
|
|
| 1 |
{
|
| 2 |
+
"description": "Untrained scripted baseline on INC003 \u2014 establishes reward floor for GRPO improvement (observable improvement evidence)",
|
| 3 |
"incident_id": "INC003",
|
| 4 |
"policy": "scripted_baseline",
|
| 5 |
"n_trials": 5,
|