File size: 35,256 Bytes
5dd1bb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 | # Implementation Specification
**Change:** F006 -- GRPO Training Pipeline
**Date:** 2026-03-27
**Research Summary:** [specs/F006-RESEARCH_SUMMARY.md](F006-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Delta:** Archived to [specs/behavior/training.md](behavior/training.md)
**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed
---
## Core Intent (Immutable)
> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you are describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.
**User Problem:**
Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.
**Success Criteria:**
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results (deterministic given seed)
**Avoid:**
- Training that does not converge at all (no learning signal)
- Requiring an expensive GPU for hours to see any signal
- Notebook with hidden dependencies that break on fresh setup
**Out of Scope:**
- wandb / TensorBoard integration (MVP: print metrics)
- vLLM inference (use HF generate for simplicity)
- Hard-difficulty questions in training set (add later)
- WebSocket-based training (use local env)
- Multi-GPU / distributed training
- Custom RLHF algorithms beyond GRPO
---
## 0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in **small, mergeable increments**.
### Scope Budget
- Target: **3 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**
### Slice Definition
| Slice | Name | Value |
|-------|------|-------|
| S1 | Training Config + Prompts | Configurable training setup, system prompt for SQL agent |
| S2 | Rollout + Rewards | TRL-compatible rollout function and reward callables |
| S3 | Training Notebook | End-to-end notebook with learning curve and comparison |
## Status Icons
**Step Status:**
- !! Not Started
- >> In Progress
- OK Completed
- XX Blocked/Failed
**Result Outcome:**
- OK Fully Successful (all tests passed, no issues)
- !! Completed with Issues (needs follow-up)
- XX Failed/Blocked
---
## 1. Implementation Overview
### Summary
Add a `training/` subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a `notebooks/train_grpo.ipynb` notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.
### Scope
**In Scope:**
- `training/config.py` -- dataclass with all hyperparameters and model name
- `training/prompts.py` -- system prompt for SQL exploration agent
- `training/rollout.py` -- `rollout_func` that plays SQLEnv episodes via HF generate
- `training/rewards.py` -- reward callables matching TRL `reward_funcs` signature
- `notebooks/train_grpo.ipynb` -- end-to-end training notebook
- `training/__init__.py` -- public exports
**Out of Scope:**
- vLLM inference backend
- wandb/TensorBoard logging
- Training on hard-difficulty questions
- Distributed or multi-GPU training
---
## 1a. Execution Status
**Progress:** 6/6 steps complete
**Current Step:** None (implementation complete)
**Last Updated:** 2026-03-28T07:37:20Z
**Latest Result:** OK Fully Successful - Step 3.1 complete, 68/68 tests passed
**Blockers:** None
---
## 1b. Risk Assessment
**Risk Tier:** Medium
**Risk Tier Definitions:**
- **Low:** Pure logic, non-user-facing, no security implications
- **Medium:** User input handling, data validation, API changes
- **High:** Authentication, payments, secrets management, untrusted input
**High-Risk Indicators Present:** None
**Security Review Required:** No
**Justification:**
External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.
---
## 2. Change Manifest
### Files to Create
| File | Purpose |
|------|---------|
| `training/__init__.py` | Package init, public exports |
| `training/config.py` | `GRPOConfig` dataclass with hyperparameters |
| `training/prompts.py` | System prompt for SQL exploration agent |
| `training/rollout.py` | `rollout_func` for TRL GRPOTrainer |
| `training/rewards.py` | Reward callables: correctness, progress, operational |
| `training/data_loading.py` | Model/question loading helpers for notebook runtime and tests |
| `training/notebook_pipeline.py` | Notebook orchestration helpers for trainer setup, baseline, and metrics |
| `notebooks/train_grpo.ipynb` | End-to-end training notebook |
| `tests/integration/test_training_pipeline.py` | Integration verification for rollout + rewards pipeline |
| `tests/e2e/test_training_e2e.py` | Notebook smoke verification and pipeline behavior checks |
| `tests/unit/test_error_handling.py` | Error-path verification for model/questions loading and fallback logging |
### Files to Modify
| File | Changes |
|------|---------|
| `pyproject.toml` | Add `trl` and training optional dependency group |
### Files to Delete
None.
---
## 3. Interface Specifications
### New Types
```python
# Location: training/config.py
from dataclasses import dataclass, field
@dataclass
class GRPOConfig:
"""All hyperparameters for GRPO training on SQLEnv."""
# Model
model_name: str = "Qwen/Qwen3-1.7B"
max_new_tokens: int = 256
# Training
num_train_epochs: int = 1
per_device_train_batch_size: int = 2
gradient_accumulation_steps: int = 4
learning_rate: float = 5e-6
num_generations: int = 4 # G in GRPO (completions per prompt)
# Environment
questions_path: str = "data/questions/questions_train.json"
db_dir: str = "data/databases"
step_budget: int = 10 # Shorter budget for training
difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])
# Reproducibility
seed: int = 42
# Output
output_dir: str = "outputs/grpo_run"
logging_steps: int = 10
```
### New Functions
```python
# Location: training/prompts.py
def get_system_prompt() -> str:
"""Return the system prompt for the SQL exploration agent.
Returns:
System prompt string instructing the model on SQLEnv action format.
"""
def format_observation(obs: "SQLObservation") -> str:
"""Format an SQLObservation into a user-turn string for the model.
Args:
obs: The observation from the environment.
Returns:
Formatted string suitable as a user message in chat history.
"""
```
```python
# Location: training/rollout.py
from typing import Any
def rollout_func(
prompts: list[str],
model: Any,
tokenizer: Any,
config: "GRPOConfig",
) -> list[dict[str, Any]]:
"""Play SQLEnv episodes for a batch of question prompts.
Each prompt is a question text. The function:
1. Creates a local SQLEnvironment
2. Resets with the question
3. Loops: model.generate() -> parse action -> env.step()
4. Collects completions and metadata
Args:
prompts: List of question texts (from training dataset).
model: HuggingFace model for generation.
tokenizer: HuggingFace tokenizer.
config: Training configuration.
Returns:
List of dicts with keys:
- "prompt": str (the input prompt)
- "completion": str (full model output trajectory)
- "metadata": dict with episode_id, steps, done, answer_correct
"""
```
```python
# Location: training/rewards.py
def reward_correctness(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout (includes 'metadata' key).
Returns:
List of float rewards, one per completion.
"""
def reward_progress(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Progress reward: cumulative progress score from environment.
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout.
Returns:
List of float rewards, one per completion.
"""
def reward_operational(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout.
Returns:
List of float rewards, one per completion.
"""
```
---
## 4. Data Flow
### Primary Flow (Training Loop)
```
1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
- Input: config.model_name
- Output: model, tokenizer, config
2. Load training questions filtered by difficulty
- Input: config.questions_path, config.difficulty_filter
- Output: list[str] of question texts as prompts
3. GRPOTrainer calls rollout_func for each batch of prompts
- Input: prompts, model, tokenizer, config
- Action: For each prompt, play a full SQLEnv episode
a. Create local SQLEnvironment
b. env.reset(question) -> initial observation
c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
d. Collect full trajectory as completion string
- Output: completions + metadata (correctness, progress, operational signals)
4. GRPOTrainer calls each reward_func on completions
- Input: completions list, metadata kwargs
- Output: list[float] per reward function
5. GRPOTrainer computes GRPO loss and updates model weights
- Input: completions, rewards, model
- Output: updated model weights, logged metrics
6. Repeat steps 3-5 for num_train_epochs
```
### Alternative Flow: Unparseable Model Output
```
1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)
```
### Alternative Flow: Episode Exceeds Token Budget
```
1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
a. System prompt (always)
b. Most recent 3 observation-action pairs
3. Episode continues with truncated context
```
---
## 5. Error Handling
### Error Types
| Error | When | Strategy |
|-------|------|----------|
| `ModelLoadError` | Model not found on HuggingFace | Fail fast with clear message naming model_name |
| `ActionParseError` | Model output not parseable as SQLAction | Default to QUERY with raw text, log warning |
| `OOMError` | GPU out of memory during training | Print guidance: reduce batch_size or num_generations |
| `QuestionLoadError` | Questions file missing or empty | Fail fast with path in error message |
| `EnvironmentError` | SQLEnv database missing | Fail fast pointing to data download instructions |
### Error Handling Strategy
```python
# In rollout_func: graceful degradation
try:
action = parse_action(model_output)
except ActionParseError:
action = SQLAction(action_type="QUERY", argument=model_output)
# In notebook: fail-fast on setup
try:
model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")
```
### Retry Strategy
| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Model download | No | Fail fast, user must fix network/model name |
| Episode rollout | No | Single attempt per episode, errors become low-reward signal |
| Training step | No | OOM is fatal for that config, must adjust params |
---
## 6. Slice Plan (What we will ship, in order)
### Slice S1 -- Training Config + Prompts
**Value:** Centralized, documented configuration and system prompt ready for training integration
**User-visible change:** No (internal infrastructure)
**Interfaces introduced/changed:** `GRPOConfig`, `get_system_prompt()`, `format_observation()`
**Rollback safety:** Additive only -- new files, no existing code changed
### Slice S2 -- Rollout + Rewards
**Value:** TRL-compatible rollout and reward functions that can drive GRPO training
**User-visible change:** No (library code)
**Interfaces introduced/changed:** `rollout_func()`, `reward_correctness()`, `reward_progress()`, `reward_operational()`
**Rollback safety:** Additive only -- new files in training/ package
### Slice S3 -- Training Notebook
**Value:** Users can run one notebook to train a model and see before/after results
**User-visible change:** Yes -- the notebook is the primary deliverable
**Interfaces introduced/changed:** `notebooks/train_grpo.ipynb`, `pyproject.toml` training deps
**Rollback safety:** Notebook is standalone; pyproject.toml change is additive (optional deps group)
---
## 7. Implementation Steps
> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.
### Step 1.1: Training Config Dataclass
**Slice:** S1
**Goal:** Create `training/config.py` with `GRPOConfig` dataclass holding all hyperparameters.
**Files:**
- `training/__init__.py` - create - package init with public exports
- `training/config.py` - create - GRPOConfig dataclass
**Interface Changes:**
- New type: `GRPOConfig` with fields as specified in Section 3
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T06:44:31Z
**Changes Made:**
- Created `training/config.py` with `GRPOConfig` dataclass and input validation in `__post_init__`
- Created `training/__init__.py` exporting `GRPOConfig`
- Added `tests/unit/test_grpo_config.py` covering defaults, overrides, required fields, and validation failures
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
Result: 7 passed in 17.06s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py -v`
- **Notes:**
- Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
- `uv run pytest ...` failed because pytest is not installed by default; used `uv run --with pytest pytest ...` for scoped test dependency
- Kept config required fields (`questions_path`, `db_dir`, `output_dir`) positional/required per verification criteria
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- GRPOConfig available for import by prompts.py and rollout.py
---
### Step 1.2: System Prompt and Observation Formatter
**Slice:** S1
**Goal:** Create `training/prompts.py` with system prompt and observation formatting for model input.
**Files:**
- `training/prompts.py` - create - system prompt and observation formatter
**Interface Changes:**
- New functions: `get_system_prompt() -> str`, `format_observation(obs: SQLObservation) -> str`
**Details:**
- System prompt should instruct the model on:
- Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
- Action format: `ACTION_TYPE: argument`
- Exploration strategy guidance (describe tables first, then query, then answer)
- Budget awareness
- `format_observation` converts SQLObservation fields into a readable user-turn string
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T06:47:49Z
**Changes Made:**
- Created `training/prompts.py` with deterministic `get_system_prompt()` and `format_observation()` helpers
- Added truncation guard for long observation results to keep prompt payload bounded
- Updated `training/__init__.py` exports to include prompt helpers
- Added `tests/unit/test_prompts.py` covering prompt content and observation formatting edge cases
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_prompts.py -v
Result: 8 passed in 2.92s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_prompts.py -v`
- **Notes:**
- `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Prompt module ready for use in rollout.py
---
### Step 2.1: Action Parser Utility
**Slice:** S2
**Goal:** Create a robust parser that extracts `SQLAction` from free-form model output text.
**Files:**
- `training/rollout.py` - create - contains `parse_model_output(text: str) -> SQLAction`
**Interface Changes:**
- New function: `parse_model_output(text: str) -> SQLAction`
- Parses `ACTION_TYPE: argument` format from model text
- Falls back to `SQLAction(action_type="QUERY", argument=text)` on parse failure
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T06:51:50Z
**Changes Made:**
- Created `training/rollout.py` with `parse_model_output(text)` and a focused line parser helper
- Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
- Added robust fallback behavior to `SQLAction(action_type="QUERY", argument=<raw_text>)` on parse failure
- Added `tests/unit/test_rollout.py` with coverage for happy path, edge cases, multiline output, and fallback behavior
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
Result: 11 passed in 2.44s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
- `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- parse_model_output is available in `training/rollout.py` for Step 2.2 rollout integration
---
### Step 2.2: Rollout Function
**Slice:** S2
**Goal:** Implement `rollout_func` that plays full SQLEnv episodes using HF generate.
**Files:**
- `training/rollout.py` - modify - add `rollout_func` and `play_episode` helper
**Interface Changes:**
- New function: `rollout_func(prompts, model, tokenizer, config) -> list[dict]`
- New helper: `play_episode(question_text, model, tokenizer, config, env) -> dict`
- Creates local SQLEnvironment for the episode
- Loops: format obs -> generate -> parse -> step until done or budget exhausted
- Returns completion string and metadata dict
**Details:**
- Use `model.generate()` (HF native, not vLLM) for inference
- Build chat messages using tokenizer.apply_chat_template
- Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
- Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Medium
> Core integration point between model and environment -- most likely source of bugs.
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T07:04:59Z
**Changes Made:**
- Expanded `training/rollout.py` with `rollout_func`, `play_episode`, message-history truncation, prompt-aware environment reset, and HF `model.generate()` integration paths for both list and tensor-like outputs.
- Added rollout metadata fields (`episode_id`, `step_count`, `done`, `answer_correct`, `cumulative_progress`, `operational_signals`) and top-level compatibility keys (`content`, `correct`, `progress`, `operational`).
- Extended `tests/unit/test_rollout.py` with Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
Result: 21 passed in 2.58s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
- Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
- Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- rollout metadata now carries correctness/progress/operational signals needed by `training/rewards.py` in Step 2.3
---
### Step 2.3: Reward Functions
**Slice:** S2
**Goal:** Implement three TRL-compatible reward callables that consume rollout metadata.
**Files:**
- `training/rewards.py` - create - reward_correctness, reward_progress, reward_operational
**Interface Changes:**
- New functions (all with TRL reward_func signature):
- `reward_correctness(completions, **kwargs) -> list[float]`
- `reward_progress(completions, **kwargs) -> list[float]`
- `reward_operational(completions, **kwargs) -> list[float]`
**Details:**
- `reward_correctness`: Binary 1.0/0.0 based on metadata["answer_correct"]
- `reward_progress`: Float from metadata["cumulative_progress"], normalized to [0, 1]
- `reward_operational`: Sum of per-step operational signals from metadata["operational_signals"]
- All functions access metadata via kwargs (TRL passes extra data from rollout return)
- Each function must handle missing metadata gracefully (return 0.0)
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T07:07:32Z
**Changes Made:**
- Created `training/rewards.py` with TRL-compatible `reward_correctness`, `reward_progress`, and `reward_operational` callables
- Added robust metadata extraction paths so reward functions support both nested `metadata` payloads and flattened rollout kwargs
- Updated `training/__init__.py` exports for reward helper imports from the package root
- Added `tests/unit/test_rewards.py` covering correctness/progress/operational behavior across happy path, edge, and batch scenarios
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_rewards.py -v
Result: 19 passed in 3.35s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rewards.py -v`
- **Notes:**
- Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- `training/` now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies in `pyproject.toml`
---
### Step 3.1: Training Notebook
**Slice:** S3
**Goal:** Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.
**Files:**
- `notebooks/train_grpo.ipynb` - create - end-to-end training notebook
- `pyproject.toml` - modify - add `[project.optional-dependencies] training` group
**Interface Changes:**
- New optional dependency group: `training = ["trl>=0.12.0", "accelerate>=0.34.0"]`
**Details:**
Notebook cells (linear flow):
1. **Setup**: Install dependencies, import modules, set seed
2. **Config**: Instantiate GRPOConfig (users can override model_name here)
3. **Load Model**: `AutoModelForCausalLM.from_pretrained(config.model_name)`
4. **Load Dataset**: Load questions, filter by difficulty, format as prompts
5. **Initialize GRPOTrainer**: Pass model, tokenizer, rollout_func, reward_funcs, config
6. **Train**: `trainer.train()` with progress bar and metric printing
7. **Learning Curve**: Plot reward over training steps (matplotlib)
8. **Comparison**: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
9. **Save**: Save trained model to config.output_dir
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Medium
> User-facing deliverable; must work on fresh setup.
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T07:37:20Z
**Changes Made:**
- Created `notebooks/train_grpo.ipynb` as the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps.
- Added `[project.optional-dependencies].training` in `pyproject.toml` with `trl>=0.14.0,<0.15.0` and `accelerate>=0.34.0` to keep TRL/torch compatibility stable for this repository.
- Added `training/data_loading.py` to centralize notebook error handling for model loading and question filtering/loading.
- Added `training/notebook_pipeline.py` to centralize trainer wiring, random baseline generation, training execution, and metrics extraction.
- Updated `training/__init__.py` exports to include notebook-facing helpers.
- Added `tests/e2e/test_training_e2e.py` for notebook smoke structure + pipeline behavior checks.
- Added `tests/integration/test_training_pipeline.py` for rollout/reward integration scenarios.
- Added `tests/unit/test_error_handling.py` for model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
Result: 68 passed in 5.79s
Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"
Result: ok
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`
- **Notes:**
- Added concrete integration/e2e/error test files that were listed in `VERIFICATION_SPEC.md` but missing from repository.
- Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
- Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.
---
## 8. Rollout Considerations
### Feature Flags
- [ ] Required: No
### Migration
- [ ] Data migration needed: No
### Rollback Plan
All changes are additive (new `training/` package and `notebooks/` directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.
---
## 9. Execution Tracking
All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history
---
## 9a. Slice Completion Protocol
After all steps in a slice pass verification:
1. **Run verifier subagent** for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
2. **Run compound-engineer subagent** to extract learnings
- **Mandatory invocation** after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
3. **Commit** the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
4. **Continue to next slice** (if more slices remain)
- Or proceed to final verification if all slices complete
**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.
---
## 10. User Value Summary
**Status:** Generated
### What Users Can Now Do
Users can now run a single notebook (`notebooks/train_grpo.ipynb`) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.
### How to Access/Test
1. Install training extras: `uv sync --extra training`
2. Open `notebooks/train_grpo.ipynb`
3. Run all cells to train and save artifacts to `outputs/grpo_run`
### Demo
- **Command:** `jupyter notebook notebooks/train_grpo.ipynb`
- **Verification command:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`
### Release Notes Snippet
Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.
---
## 11. PR Contract (Auto-Generated by autocode-next-step)
**Status:** Generated
### Scope
- Finalized Step 3.1 (Training Notebook) for F006.
- Added training optional dependency group in `pyproject.toml` with TRL pin compatible with repo torch version.
- Added notebook support helpers for model/question loading and trainer orchestration.
- Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.
### Files Changed
- `pyproject.toml`
- `notebooks/train_grpo.ipynb`
- `training/__init__.py`
- `training/data_loading.py`
- `training/notebook_pipeline.py`
- `training/rollout.py`
- `tests/e2e/test_training_e2e.py`
- `tests/integration/test_training_pipeline.py`
- `tests/unit/test_error_handling.py`
- `specs/F006-IMPLEMENTATION_SPEC.md`
- `specs/behavior/training.md`
### Verification Evidence
- `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v` -> 68 passed
- `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"` -> ok
- Verifier verdict: APPROVED (`specs/F006-VERIFICATION_REPORT.md`)
### Risk and Rollback
- Risk tier: Medium (training dependencies and user-facing notebook workflow).
- Rollback: remove notebook/training helper additions and revert `pyproject.toml` training extra.
### Ready for Next Command
All implementation and verification criteria for F006 are complete. Run `/commit-push-pr` when ready.
---
## Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
---
## Human Checkpoint
**Before handing to AI agent:**
- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated
**Questions:**
1. Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
2. Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.
---
## Handoff Notes
**For the implementing AI agent:**
```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
- HF generate (not vLLM) for inference
- Model name is a config parameter (default Qwen3-1.7B)
- Start with easy+medium questions only
- Follow TRL GRPOTrainer Wordle tutorial pattern
- reward_funcs are separate callables
```
---
*Specification completed: 2026-03-27*
*Approved by: [pending]*
*Verification spec: VERIFICATION_SPEC.md*
*Target agent: Claude Code*
|