Spaces:

hjerpe
/

sql_env

Sleeping

File size: 35,256 Bytes

5dd1bb4

# Implementation Specification

**Change:** F006 -- GRPO Training Pipeline
**Date:** 2026-03-27
**Research Summary:** [specs/F006-RESEARCH_SUMMARY.md](F006-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Delta:** Archived to [specs/behavior/training.md](behavior/training.md)

**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed

---

## Core Intent (Immutable)

> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you are describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.

**User Problem:**
Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.

**Success Criteria:**
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results (deterministic given seed)

**Avoid:**
- Training that does not converge at all (no learning signal)
- Requiring an expensive GPU for hours to see any signal
- Notebook with hidden dependencies that break on fresh setup

**Out of Scope:**
- wandb / TensorBoard integration (MVP: print metrics)
- vLLM inference (use HF generate for simplicity)
- Hard-difficulty questions in training set (add later)
- WebSocket-based training (use local env)
- Multi-GPU / distributed training
- Custom RLHF algorithms beyond GRPO

---

## 0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in **small, mergeable increments**.

### Scope Budget
- Target: **3 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**

### Slice Definition

| Slice | Name | Value |
|-------|------|-------|
| S1 | Training Config + Prompts | Configurable training setup, system prompt for SQL agent |
| S2 | Rollout + Rewards | TRL-compatible rollout function and reward callables |
| S3 | Training Notebook | End-to-end notebook with learning curve and comparison |

## Status Icons

**Step Status:**
- !! Not Started
- >> In Progress
- OK Completed
- XX Blocked/Failed

**Result Outcome:**
- OK Fully Successful (all tests passed, no issues)
- !! Completed with Issues (needs follow-up)
- XX Failed/Blocked

---

## 1. Implementation Overview

### Summary

Add a `training/` subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a `notebooks/train_grpo.ipynb` notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.

### Scope

**In Scope:**
- `training/config.py` -- dataclass with all hyperparameters and model name
- `training/prompts.py` -- system prompt for SQL exploration agent
- `training/rollout.py` -- `rollout_func` that plays SQLEnv episodes via HF generate
- `training/rewards.py` -- reward callables matching TRL `reward_funcs` signature
- `notebooks/train_grpo.ipynb` -- end-to-end training notebook
- `training/__init__.py` -- public exports

**Out of Scope:**
- vLLM inference backend
- wandb/TensorBoard logging
- Training on hard-difficulty questions
- Distributed or multi-GPU training

---

## 1a. Execution Status

**Progress:** 6/6 steps complete
**Current Step:** None (implementation complete)
**Last Updated:** 2026-03-28T07:37:20Z
**Latest Result:** OK Fully Successful - Step 3.1 complete, 68/68 tests passed
**Blockers:** None

---

## 1b. Risk Assessment

**Risk Tier:** Medium

**Risk Tier Definitions:**
- **Low:** Pure logic, non-user-facing, no security implications
- **Medium:** User input handling, data validation, API changes
- **High:** Authentication, payments, secrets management, untrusted input

**High-Risk Indicators Present:** None

**Security Review Required:** No

**Justification:**
External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.

---

## 2. Change Manifest

### Files to Create

| File | Purpose |
|------|---------|
| `training/__init__.py` | Package init, public exports |
| `training/config.py` | `GRPOConfig` dataclass with hyperparameters |
| `training/prompts.py` | System prompt for SQL exploration agent |
| `training/rollout.py` | `rollout_func` for TRL GRPOTrainer |
| `training/rewards.py` | Reward callables: correctness, progress, operational |
| `training/data_loading.py` | Model/question loading helpers for notebook runtime and tests |
| `training/notebook_pipeline.py` | Notebook orchestration helpers for trainer setup, baseline, and metrics |
| `notebooks/train_grpo.ipynb` | End-to-end training notebook |
| `tests/integration/test_training_pipeline.py` | Integration verification for rollout + rewards pipeline |
| `tests/e2e/test_training_e2e.py` | Notebook smoke verification and pipeline behavior checks |
| `tests/unit/test_error_handling.py` | Error-path verification for model/questions loading and fallback logging |

### Files to Modify

| File | Changes |
|------|---------|
| `pyproject.toml` | Add `trl` and training optional dependency group |

### Files to Delete

None.

---

## 3. Interface Specifications

### New Types

```python
# Location: training/config.py

from dataclasses import dataclass, field

@dataclass
class GRPOConfig:
    """All hyperparameters for GRPO training on SQLEnv."""

    # Model
    model_name: str = "Qwen/Qwen3-1.7B"
    max_new_tokens: int = 256

    # Training
    num_train_epochs: int = 1
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 4
    learning_rate: float = 5e-6
    num_generations: int = 4          # G in GRPO (completions per prompt)

    # Environment
    questions_path: str = "data/questions/questions_train.json"
    db_dir: str = "data/databases"
    step_budget: int = 10             # Shorter budget for training
    difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])

    # Reproducibility
    seed: int = 42

    # Output
    output_dir: str = "outputs/grpo_run"
    logging_steps: int = 10
```

### New Functions

```python
# Location: training/prompts.py

def get_system_prompt() -> str:
    """Return the system prompt for the SQL exploration agent.

    Returns:
        System prompt string instructing the model on SQLEnv action format.
    """


def format_observation(obs: "SQLObservation") -> str:
    """Format an SQLObservation into a user-turn string for the model.

    Args:
        obs: The observation from the environment.

    Returns:
        Formatted string suitable as a user message in chat history.
    """
```

```python
# Location: training/rollout.py

from typing import Any

def rollout_func(
    prompts: list[str],
    model: Any,
    tokenizer: Any,
    config: "GRPOConfig",
) -> list[dict[str, Any]]:
    """Play SQLEnv episodes for a batch of question prompts.

    Each prompt is a question text. The function:
    1. Creates a local SQLEnvironment
    2. Resets with the question
    3. Loops: model.generate() -> parse action -> env.step()
    4. Collects completions and metadata

    Args:
        prompts: List of question texts (from training dataset).
        model: HuggingFace model for generation.
        tokenizer: HuggingFace tokenizer.
        config: Training configuration.

    Returns:
        List of dicts with keys:
          - "prompt": str (the input prompt)
          - "completion": str (full model output trajectory)
          - "metadata": dict with episode_id, steps, done, answer_correct
    """
```

```python
# Location: training/rewards.py

def reward_correctness(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout (includes 'metadata' key).

    Returns:
        List of float rewards, one per completion.
    """


def reward_progress(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Progress reward: cumulative progress score from environment.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """


def reward_operational(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """
```

---

## 4. Data Flow

### Primary Flow (Training Loop)

```
1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
   - Input: config.model_name
   - Output: model, tokenizer, config

2. Load training questions filtered by difficulty
   - Input: config.questions_path, config.difficulty_filter
   - Output: list[str] of question texts as prompts

3. GRPOTrainer calls rollout_func for each batch of prompts
   - Input: prompts, model, tokenizer, config
   - Action: For each prompt, play a full SQLEnv episode
     a. Create local SQLEnvironment
     b. env.reset(question) -> initial observation
     c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
     d. Collect full trajectory as completion string
   - Output: completions + metadata (correctness, progress, operational signals)

4. GRPOTrainer calls each reward_func on completions
   - Input: completions list, metadata kwargs
   - Output: list[float] per reward function

5. GRPOTrainer computes GRPO loss and updates model weights
   - Input: completions, rewards, model
   - Output: updated model weights, logged metrics

6. Repeat steps 3-5 for num_train_epochs
```

### Alternative Flow: Unparseable Model Output

```
1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)
```

### Alternative Flow: Episode Exceeds Token Budget

```
1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
   a. System prompt (always)
   b. Most recent 3 observation-action pairs
3. Episode continues with truncated context
```

---

## 5. Error Handling

### Error Types

| Error | When | Strategy |
|-------|------|----------|
| `ModelLoadError` | Model not found on HuggingFace | Fail fast with clear message naming model_name |
| `ActionParseError` | Model output not parseable as SQLAction | Default to QUERY with raw text, log warning |
| `OOMError` | GPU out of memory during training | Print guidance: reduce batch_size or num_generations |
| `QuestionLoadError` | Questions file missing or empty | Fail fast with path in error message |
| `EnvironmentError` | SQLEnv database missing | Fail fast pointing to data download instructions |

### Error Handling Strategy

```python
# In rollout_func: graceful degradation
try:
    action = parse_action(model_output)
except ActionParseError:
    action = SQLAction(action_type="QUERY", argument=model_output)

# In notebook: fail-fast on setup
try:
    model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
    raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")
```

### Retry Strategy

| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Model download | No | Fail fast, user must fix network/model name |
| Episode rollout | No | Single attempt per episode, errors become low-reward signal |
| Training step | No | OOM is fatal for that config, must adjust params |

---

## 6. Slice Plan (What we will ship, in order)

### Slice S1 -- Training Config + Prompts
**Value:** Centralized, documented configuration and system prompt ready for training integration
**User-visible change:** No (internal infrastructure)
**Interfaces introduced/changed:** `GRPOConfig`, `get_system_prompt()`, `format_observation()`
**Rollback safety:** Additive only -- new files, no existing code changed

### Slice S2 -- Rollout + Rewards
**Value:** TRL-compatible rollout and reward functions that can drive GRPO training
**User-visible change:** No (library code)
**Interfaces introduced/changed:** `rollout_func()`, `reward_correctness()`, `reward_progress()`, `reward_operational()`
**Rollback safety:** Additive only -- new files in training/ package

### Slice S3 -- Training Notebook
**Value:** Users can run one notebook to train a model and see before/after results
**User-visible change:** Yes -- the notebook is the primary deliverable
**Interfaces introduced/changed:** `notebooks/train_grpo.ipynb`, `pyproject.toml` training deps
**Rollback safety:** Notebook is standalone; pyproject.toml change is additive (optional deps group)

---

## 7. Implementation Steps

> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.

### Step 1.1: Training Config Dataclass
**Slice:** S1
**Goal:** Create `training/config.py` with `GRPOConfig` dataclass holding all hyperparameters.

**Files:**
- `training/__init__.py` - create - package init with public exports
- `training/config.py` - create - GRPOConfig dataclass

**Interface Changes:**
- New type: `GRPOConfig` with fields as specified in Section 3

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T06:44:31Z
**Changes Made:**
- Created `training/config.py` with `GRPOConfig` dataclass and input validation in `__post_init__`
- Created `training/__init__.py` exporting `GRPOConfig`
- Added `tests/unit/test_grpo_config.py` covering defaults, overrides, required fields, and validation failures

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
  Result: 7 passed in 17.06s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py -v`
- **Notes:**
  - Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
  - `uv run pytest ...` failed because pytest is not installed by default; used `uv run --with pytest pytest ...` for scoped test dependency
  - Kept config required fields (`questions_path`, `db_dir`, `output_dir`) positional/required per verification criteria
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- GRPOConfig available for import by prompts.py and rollout.py

---

### Step 1.2: System Prompt and Observation Formatter
**Slice:** S1
**Goal:** Create `training/prompts.py` with system prompt and observation formatting for model input.

**Files:**
- `training/prompts.py` - create - system prompt and observation formatter

**Interface Changes:**
- New functions: `get_system_prompt() -> str`, `format_observation(obs: SQLObservation) -> str`

**Details:**
- System prompt should instruct the model on:
  - Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
  - Action format: `ACTION_TYPE: argument`
  - Exploration strategy guidance (describe tables first, then query, then answer)
  - Budget awareness
- `format_observation` converts SQLObservation fields into a readable user-turn string

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T06:47:49Z
**Changes Made:**
- Created `training/prompts.py` with deterministic `get_system_prompt()` and `format_observation()` helpers
- Added truncation guard for long observation results to keep prompt payload bounded
- Updated `training/__init__.py` exports to include prompt helpers
- Added `tests/unit/test_prompts.py` covering prompt content and observation formatting edge cases

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_prompts.py -v
  Result: 8 passed in 2.92s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_prompts.py -v`
- **Notes:**
  - `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- Prompt module ready for use in rollout.py

---

### Step 2.1: Action Parser Utility
**Slice:** S2
**Goal:** Create a robust parser that extracts `SQLAction` from free-form model output text.

**Files:**
- `training/rollout.py` - create - contains `parse_model_output(text: str) -> SQLAction`

**Interface Changes:**
- New function: `parse_model_output(text: str) -> SQLAction`
  - Parses `ACTION_TYPE: argument` format from model text
  - Falls back to `SQLAction(action_type="QUERY", argument=text)` on parse failure

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T06:51:50Z
**Changes Made:**
- Created `training/rollout.py` with `parse_model_output(text)` and a focused line parser helper
- Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
- Added robust fallback behavior to `SQLAction(action_type="QUERY", argument=<raw_text>)` on parse failure
- Added `tests/unit/test_rollout.py` with coverage for happy path, edge cases, multiline output, and fallback behavior

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
  Result: 11 passed in 2.44s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
  - `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- parse_model_output is available in `training/rollout.py` for Step 2.2 rollout integration

---

### Step 2.2: Rollout Function
**Slice:** S2
**Goal:** Implement `rollout_func` that plays full SQLEnv episodes using HF generate.

**Files:**
- `training/rollout.py` - modify - add `rollout_func` and `play_episode` helper

**Interface Changes:**
- New function: `rollout_func(prompts, model, tokenizer, config) -> list[dict]`
- New helper: `play_episode(question_text, model, tokenizer, config, env) -> dict`
  - Creates local SQLEnvironment for the episode
  - Loops: format obs -> generate -> parse -> step until done or budget exhausted
  - Returns completion string and metadata dict

**Details:**
- Use `model.generate()` (HF native, not vLLM) for inference
- Build chat messages using tokenizer.apply_chat_template
- Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
- Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Medium
> Core integration point between model and environment -- most likely source of bugs.

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T07:04:59Z
**Changes Made:**
- Expanded `training/rollout.py` with `rollout_func`, `play_episode`, message-history truncation, prompt-aware environment reset, and HF `model.generate()` integration paths for both list and tensor-like outputs.
- Added rollout metadata fields (`episode_id`, `step_count`, `done`, `answer_correct`, `cumulative_progress`, `operational_signals`) and top-level compatibility keys (`content`, `correct`, `progress`, `operational`).
- Extended `tests/unit/test_rollout.py` with Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
  Result: 21 passed in 2.58s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
  - Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
  - Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- rollout metadata now carries correctness/progress/operational signals needed by `training/rewards.py` in Step 2.3

---

### Step 2.3: Reward Functions
**Slice:** S2
**Goal:** Implement three TRL-compatible reward callables that consume rollout metadata.

**Files:**
- `training/rewards.py` - create - reward_correctness, reward_progress, reward_operational

**Interface Changes:**
- New functions (all with TRL reward_func signature):
  - `reward_correctness(completions, **kwargs) -> list[float]`
  - `reward_progress(completions, **kwargs) -> list[float]`
  - `reward_operational(completions, **kwargs) -> list[float]`

**Details:**
- `reward_correctness`: Binary 1.0/0.0 based on metadata["answer_correct"]
- `reward_progress`: Float from metadata["cumulative_progress"], normalized to [0, 1]
- `reward_operational`: Sum of per-step operational signals from metadata["operational_signals"]
- All functions access metadata via kwargs (TRL passes extra data from rollout return)
- Each function must handle missing metadata gracefully (return 0.0)

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T07:07:32Z
**Changes Made:**
- Created `training/rewards.py` with TRL-compatible `reward_correctness`, `reward_progress`, and `reward_operational` callables
- Added robust metadata extraction paths so reward functions support both nested `metadata` payloads and flattened rollout kwargs
- Updated `training/__init__.py` exports for reward helper imports from the package root
- Added `tests/unit/test_rewards.py` covering correctness/progress/operational behavior across happy path, edge, and batch scenarios

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_rewards.py -v
  Result: 19 passed in 3.35s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rewards.py -v`
- **Notes:**
  - Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- `training/` now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies in `pyproject.toml`

---

### Step 3.1: Training Notebook
**Slice:** S3
**Goal:** Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.

**Files:**
- `notebooks/train_grpo.ipynb` - create - end-to-end training notebook
- `pyproject.toml` - modify - add `[project.optional-dependencies] training` group

**Interface Changes:**
- New optional dependency group: `training = ["trl>=0.12.0", "accelerate>=0.34.0"]`

**Details:**
Notebook cells (linear flow):
1. **Setup**: Install dependencies, import modules, set seed
2. **Config**: Instantiate GRPOConfig (users can override model_name here)
3. **Load Model**: `AutoModelForCausalLM.from_pretrained(config.model_name)`
4. **Load Dataset**: Load questions, filter by difficulty, format as prompts
5. **Initialize GRPOTrainer**: Pass model, tokenizer, rollout_func, reward_funcs, config
6. **Train**: `trainer.train()` with progress bar and metric printing
7. **Learning Curve**: Plot reward over training steps (matplotlib)
8. **Comparison**: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
9. **Save**: Save trained model to config.output_dir

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Medium
> User-facing deliverable; must work on fresh setup.

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T07:37:20Z
**Changes Made:**
- Created `notebooks/train_grpo.ipynb` as the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps.
- Added `[project.optional-dependencies].training` in `pyproject.toml` with `trl>=0.14.0,<0.15.0` and `accelerate>=0.34.0` to keep TRL/torch compatibility stable for this repository.
- Added `training/data_loading.py` to centralize notebook error handling for model loading and question filtering/loading.
- Added `training/notebook_pipeline.py` to centralize trainer wiring, random baseline generation, training execution, and metrics extraction.
- Updated `training/__init__.py` exports to include notebook-facing helpers.
- Added `tests/e2e/test_training_e2e.py` for notebook smoke structure + pipeline behavior checks.
- Added `tests/integration/test_training_pipeline.py` for rollout/reward integration scenarios.
- Added `tests/unit/test_error_handling.py` for model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
  Result: 68 passed in 5.79s
  Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"
  Result: ok
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`
- **Notes:**
  - Added concrete integration/e2e/error test files that were listed in `VERIFICATION_SPEC.md` but missing from repository.
  - Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
  - Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.

---

## 8. Rollout Considerations

### Feature Flags
- [ ] Required: No

### Migration
- [ ] Data migration needed: No

### Rollback Plan
All changes are additive (new `training/` package and `notebooks/` directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.

---

## 9. Execution Tracking

All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

---

## 9a. Slice Completion Protocol

After all steps in a slice pass verification:

1. **Run verifier subagent** for spec compliance
   - Validates against VERIFICATION_SPEC.md criteria
   - Ensures no TODOs or incomplete work in slice

2. **Run compound-engineer subagent** to extract learnings
   - **Mandatory invocation** after every slice completion
   - Updates CLAUDE.md Learnings section (if durable patterns found)
   - May exit with "no update needed" (valid for routine work)

3. **Commit** the slice changes
   - Follow commit message format in CLAUDE.md
   - Each slice gets its own atomic commit

4. **Continue to next slice** (if more slices remain)
   - Or proceed to final verification if all slices complete

**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

---

## 10. User Value Summary

**Status:** Generated

### What Users Can Now Do
Users can now run a single notebook (`notebooks/train_grpo.ipynb`) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.

### How to Access/Test
1. Install training extras: `uv sync --extra training`
2. Open `notebooks/train_grpo.ipynb`
3. Run all cells to train and save artifacts to `outputs/grpo_run`

### Demo
- **Command:** `jupyter notebook notebooks/train_grpo.ipynb`
- **Verification command:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`

### Release Notes Snippet
Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.

---

## 11. PR Contract (Auto-Generated by autocode-next-step)

**Status:** Generated

### Scope
- Finalized Step 3.1 (Training Notebook) for F006.
- Added training optional dependency group in `pyproject.toml` with TRL pin compatible with repo torch version.
- Added notebook support helpers for model/question loading and trainer orchestration.
- Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.

### Files Changed
- `pyproject.toml`
- `notebooks/train_grpo.ipynb`
- `training/__init__.py`
- `training/data_loading.py`
- `training/notebook_pipeline.py`
- `training/rollout.py`
- `tests/e2e/test_training_e2e.py`
- `tests/integration/test_training_pipeline.py`
- `tests/unit/test_error_handling.py`
- `specs/F006-IMPLEMENTATION_SPEC.md`
- `specs/behavior/training.md`

### Verification Evidence
- `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v` -> 68 passed
- `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"` -> ok
- Verifier verdict: APPROVED (`specs/F006-VERIFICATION_REPORT.md`)

### Risk and Rollback
- Risk tier: Medium (training dependencies and user-facing notebook workflow).
- Rollback: remove notebook/training helper additions and revert `pyproject.toml` training extra.

### Ready for Next Command
All implementation and verification criteria for F006 are complete. Run `/commit-push-pr` when ready.

---

## Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

---

## Human Checkpoint

**Before handing to AI agent:**

- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated

**Questions:**
1. Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
2. Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.

---

## Handoff Notes

**For the implementing AI agent:**

```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
  - HF generate (not vLLM) for inference
  - Model name is a config parameter (default Qwen3-1.7B)
  - Start with easy+medium questions only
  - Follow TRL GRPOTrainer Wordle tutorial pattern
  - reward_funcs are separate callables
```

---

*Specification completed: 2026-03-27*
*Approved by: [pending]*
*Verification spec: VERIFICATION_SPEC.md*
*Target agent: Claude Code*