File size: 35,256 Bytes
5dd1bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
# Implementation Specification

**Change:** F006 -- GRPO Training Pipeline
**Date:** 2026-03-27
**Research Summary:** [specs/F006-RESEARCH_SUMMARY.md](F006-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Delta:** Archived to [specs/behavior/training.md](behavior/training.md)

**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed

---

## Core Intent (Immutable)

> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you are describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.

**User Problem:**
Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.

**Success Criteria:**
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results (deterministic given seed)

**Avoid:**
- Training that does not converge at all (no learning signal)
- Requiring an expensive GPU for hours to see any signal
- Notebook with hidden dependencies that break on fresh setup

**Out of Scope:**
- wandb / TensorBoard integration (MVP: print metrics)
- vLLM inference (use HF generate for simplicity)
- Hard-difficulty questions in training set (add later)
- WebSocket-based training (use local env)
- Multi-GPU / distributed training
- Custom RLHF algorithms beyond GRPO

---

## 0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in **small, mergeable increments**.

### Scope Budget
- Target: **3 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**

### Slice Definition

| Slice | Name | Value |
|-------|------|-------|
| S1 | Training Config + Prompts | Configurable training setup, system prompt for SQL agent |
| S2 | Rollout + Rewards | TRL-compatible rollout function and reward callables |
| S3 | Training Notebook | End-to-end notebook with learning curve and comparison |

## Status Icons

**Step Status:**
- !! Not Started
- >> In Progress
- OK Completed
- XX Blocked/Failed

**Result Outcome:**
- OK Fully Successful (all tests passed, no issues)
- !! Completed with Issues (needs follow-up)
- XX Failed/Blocked

---

## 1. Implementation Overview

### Summary

Add a `training/` subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a `notebooks/train_grpo.ipynb` notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.

### Scope

**In Scope:**
- `training/config.py` -- dataclass with all hyperparameters and model name
- `training/prompts.py` -- system prompt for SQL exploration agent
- `training/rollout.py` -- `rollout_func` that plays SQLEnv episodes via HF generate
- `training/rewards.py` -- reward callables matching TRL `reward_funcs` signature
- `notebooks/train_grpo.ipynb` -- end-to-end training notebook
- `training/__init__.py` -- public exports

**Out of Scope:**
- vLLM inference backend
- wandb/TensorBoard logging
- Training on hard-difficulty questions
- Distributed or multi-GPU training

---

## 1a. Execution Status

**Progress:** 6/6 steps complete
**Current Step:** None (implementation complete)
**Last Updated:** 2026-03-28T07:37:20Z
**Latest Result:** OK Fully Successful - Step 3.1 complete, 68/68 tests passed
**Blockers:** None

---

## 1b. Risk Assessment

**Risk Tier:** Medium

**Risk Tier Definitions:**
- **Low:** Pure logic, non-user-facing, no security implications
- **Medium:** User input handling, data validation, API changes
- **High:** Authentication, payments, secrets management, untrusted input

**High-Risk Indicators Present:** None

**Security Review Required:** No

**Justification:**
External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.

---

## 2. Change Manifest

### Files to Create

| File | Purpose |
|------|---------|
| `training/__init__.py` | Package init, public exports |
| `training/config.py` | `GRPOConfig` dataclass with hyperparameters |
| `training/prompts.py` | System prompt for SQL exploration agent |
| `training/rollout.py` | `rollout_func` for TRL GRPOTrainer |
| `training/rewards.py` | Reward callables: correctness, progress, operational |
| `training/data_loading.py` | Model/question loading helpers for notebook runtime and tests |
| `training/notebook_pipeline.py` | Notebook orchestration helpers for trainer setup, baseline, and metrics |
| `notebooks/train_grpo.ipynb` | End-to-end training notebook |
| `tests/integration/test_training_pipeline.py` | Integration verification for rollout + rewards pipeline |
| `tests/e2e/test_training_e2e.py` | Notebook smoke verification and pipeline behavior checks |
| `tests/unit/test_error_handling.py` | Error-path verification for model/questions loading and fallback logging |

### Files to Modify

| File | Changes |
|------|---------|
| `pyproject.toml` | Add `trl` and training optional dependency group |

### Files to Delete

None.

---

## 3. Interface Specifications

### New Types

```python
# Location: training/config.py

from dataclasses import dataclass, field

@dataclass
class GRPOConfig:
    """All hyperparameters for GRPO training on SQLEnv."""

    # Model
    model_name: str = "Qwen/Qwen3-1.7B"
    max_new_tokens: int = 256

    # Training
    num_train_epochs: int = 1
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 4
    learning_rate: float = 5e-6
    num_generations: int = 4          # G in GRPO (completions per prompt)

    # Environment
    questions_path: str = "data/questions/questions_train.json"
    db_dir: str = "data/databases"
    step_budget: int = 10             # Shorter budget for training
    difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])

    # Reproducibility
    seed: int = 42

    # Output
    output_dir: str = "outputs/grpo_run"
    logging_steps: int = 10
```

### New Functions

```python
# Location: training/prompts.py

def get_system_prompt() -> str:
    """Return the system prompt for the SQL exploration agent.

    Returns:
        System prompt string instructing the model on SQLEnv action format.
    """


def format_observation(obs: "SQLObservation") -> str:
    """Format an SQLObservation into a user-turn string for the model.

    Args:
        obs: The observation from the environment.

    Returns:
        Formatted string suitable as a user message in chat history.
    """
```

```python
# Location: training/rollout.py

from typing import Any

def rollout_func(
    prompts: list[str],
    model: Any,
    tokenizer: Any,
    config: "GRPOConfig",
) -> list[dict[str, Any]]:
    """Play SQLEnv episodes for a batch of question prompts.

    Each prompt is a question text. The function:
    1. Creates a local SQLEnvironment
    2. Resets with the question
    3. Loops: model.generate() -> parse action -> env.step()
    4. Collects completions and metadata

    Args:
        prompts: List of question texts (from training dataset).
        model: HuggingFace model for generation.
        tokenizer: HuggingFace tokenizer.
        config: Training configuration.

    Returns:
        List of dicts with keys:
          - "prompt": str (the input prompt)
          - "completion": str (full model output trajectory)
          - "metadata": dict with episode_id, steps, done, answer_correct
    """
```

```python
# Location: training/rewards.py

def reward_correctness(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout (includes 'metadata' key).

    Returns:
        List of float rewards, one per completion.
    """


def reward_progress(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Progress reward: cumulative progress score from environment.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """


def reward_operational(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """
```

---

## 4. Data Flow

### Primary Flow (Training Loop)

```
1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
   - Input: config.model_name
   - Output: model, tokenizer, config

2. Load training questions filtered by difficulty
   - Input: config.questions_path, config.difficulty_filter
   - Output: list[str] of question texts as prompts

3. GRPOTrainer calls rollout_func for each batch of prompts
   - Input: prompts, model, tokenizer, config
   - Action: For each prompt, play a full SQLEnv episode
     a. Create local SQLEnvironment
     b. env.reset(question) -> initial observation
     c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
     d. Collect full trajectory as completion string
   - Output: completions + metadata (correctness, progress, operational signals)

4. GRPOTrainer calls each reward_func on completions
   - Input: completions list, metadata kwargs
   - Output: list[float] per reward function

5. GRPOTrainer computes GRPO loss and updates model weights
   - Input: completions, rewards, model
   - Output: updated model weights, logged metrics

6. Repeat steps 3-5 for num_train_epochs
```

### Alternative Flow: Unparseable Model Output

```
1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)
```

### Alternative Flow: Episode Exceeds Token Budget

```
1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
   a. System prompt (always)
   b. Most recent 3 observation-action pairs
3. Episode continues with truncated context
```

---

## 5. Error Handling

### Error Types

| Error | When | Strategy |
|-------|------|----------|
| `ModelLoadError` | Model not found on HuggingFace | Fail fast with clear message naming model_name |
| `ActionParseError` | Model output not parseable as SQLAction | Default to QUERY with raw text, log warning |
| `OOMError` | GPU out of memory during training | Print guidance: reduce batch_size or num_generations |
| `QuestionLoadError` | Questions file missing or empty | Fail fast with path in error message |
| `EnvironmentError` | SQLEnv database missing | Fail fast pointing to data download instructions |

### Error Handling Strategy

```python
# In rollout_func: graceful degradation
try:
    action = parse_action(model_output)
except ActionParseError:
    action = SQLAction(action_type="QUERY", argument=model_output)

# In notebook: fail-fast on setup
try:
    model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
    raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")
```

### Retry Strategy

| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Model download | No | Fail fast, user must fix network/model name |
| Episode rollout | No | Single attempt per episode, errors become low-reward signal |
| Training step | No | OOM is fatal for that config, must adjust params |

---

## 6. Slice Plan (What we will ship, in order)

### Slice S1 -- Training Config + Prompts
**Value:** Centralized, documented configuration and system prompt ready for training integration
**User-visible change:** No (internal infrastructure)
**Interfaces introduced/changed:** `GRPOConfig`, `get_system_prompt()`, `format_observation()`
**Rollback safety:** Additive only -- new files, no existing code changed

### Slice S2 -- Rollout + Rewards
**Value:** TRL-compatible rollout and reward functions that can drive GRPO training
**User-visible change:** No (library code)
**Interfaces introduced/changed:** `rollout_func()`, `reward_correctness()`, `reward_progress()`, `reward_operational()`
**Rollback safety:** Additive only -- new files in training/ package

### Slice S3 -- Training Notebook
**Value:** Users can run one notebook to train a model and see before/after results
**User-visible change:** Yes -- the notebook is the primary deliverable
**Interfaces introduced/changed:** `notebooks/train_grpo.ipynb`, `pyproject.toml` training deps
**Rollback safety:** Notebook is standalone; pyproject.toml change is additive (optional deps group)

---

## 7. Implementation Steps

> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.

### Step 1.1: Training Config Dataclass
**Slice:** S1
**Goal:** Create `training/config.py` with `GRPOConfig` dataclass holding all hyperparameters.

**Files:**
- `training/__init__.py` - create - package init with public exports
- `training/config.py` - create - GRPOConfig dataclass

**Interface Changes:**
- New type: `GRPOConfig` with fields as specified in Section 3

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T06:44:31Z
**Changes Made:**
- Created `training/config.py` with `GRPOConfig` dataclass and input validation in `__post_init__`
- Created `training/__init__.py` exporting `GRPOConfig`
- Added `tests/unit/test_grpo_config.py` covering defaults, overrides, required fields, and validation failures

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
  Result: 7 passed in 17.06s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py -v`
- **Notes:**
  - Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
  - `uv run pytest ...` failed because pytest is not installed by default; used `uv run --with pytest pytest ...` for scoped test dependency
  - Kept config required fields (`questions_path`, `db_dir`, `output_dir`) positional/required per verification criteria
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- GRPOConfig available for import by prompts.py and rollout.py

---

### Step 1.2: System Prompt and Observation Formatter
**Slice:** S1
**Goal:** Create `training/prompts.py` with system prompt and observation formatting for model input.

**Files:**
- `training/prompts.py` - create - system prompt and observation formatter

**Interface Changes:**
- New functions: `get_system_prompt() -> str`, `format_observation(obs: SQLObservation) -> str`

**Details:**
- System prompt should instruct the model on:
  - Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
  - Action format: `ACTION_TYPE: argument`
  - Exploration strategy guidance (describe tables first, then query, then answer)
  - Budget awareness
- `format_observation` converts SQLObservation fields into a readable user-turn string

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T06:47:49Z
**Changes Made:**
- Created `training/prompts.py` with deterministic `get_system_prompt()` and `format_observation()` helpers
- Added truncation guard for long observation results to keep prompt payload bounded
- Updated `training/__init__.py` exports to include prompt helpers
- Added `tests/unit/test_prompts.py` covering prompt content and observation formatting edge cases

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_prompts.py -v
  Result: 8 passed in 2.92s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_prompts.py -v`
- **Notes:**
  - `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- Prompt module ready for use in rollout.py

---

### Step 2.1: Action Parser Utility
**Slice:** S2
**Goal:** Create a robust parser that extracts `SQLAction` from free-form model output text.

**Files:**
- `training/rollout.py` - create - contains `parse_model_output(text: str) -> SQLAction`

**Interface Changes:**
- New function: `parse_model_output(text: str) -> SQLAction`
  - Parses `ACTION_TYPE: argument` format from model text
  - Falls back to `SQLAction(action_type="QUERY", argument=text)` on parse failure

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T06:51:50Z
**Changes Made:**
- Created `training/rollout.py` with `parse_model_output(text)` and a focused line parser helper
- Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
- Added robust fallback behavior to `SQLAction(action_type="QUERY", argument=<raw_text>)` on parse failure
- Added `tests/unit/test_rollout.py` with coverage for happy path, edge cases, multiline output, and fallback behavior

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
  Result: 11 passed in 2.44s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
  - `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- parse_model_output is available in `training/rollout.py` for Step 2.2 rollout integration

---

### Step 2.2: Rollout Function
**Slice:** S2
**Goal:** Implement `rollout_func` that plays full SQLEnv episodes using HF generate.

**Files:**
- `training/rollout.py` - modify - add `rollout_func` and `play_episode` helper

**Interface Changes:**
- New function: `rollout_func(prompts, model, tokenizer, config) -> list[dict]`
- New helper: `play_episode(question_text, model, tokenizer, config, env) -> dict`
  - Creates local SQLEnvironment for the episode
  - Loops: format obs -> generate -> parse -> step until done or budget exhausted
  - Returns completion string and metadata dict

**Details:**
- Use `model.generate()` (HF native, not vLLM) for inference
- Build chat messages using tokenizer.apply_chat_template
- Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
- Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Medium
> Core integration point between model and environment -- most likely source of bugs.

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T07:04:59Z
**Changes Made:**
- Expanded `training/rollout.py` with `rollout_func`, `play_episode`, message-history truncation, prompt-aware environment reset, and HF `model.generate()` integration paths for both list and tensor-like outputs.
- Added rollout metadata fields (`episode_id`, `step_count`, `done`, `answer_correct`, `cumulative_progress`, `operational_signals`) and top-level compatibility keys (`content`, `correct`, `progress`, `operational`).
- Extended `tests/unit/test_rollout.py` with Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
  Result: 21 passed in 2.58s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
  - Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
  - Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- rollout metadata now carries correctness/progress/operational signals needed by `training/rewards.py` in Step 2.3

---

### Step 2.3: Reward Functions
**Slice:** S2
**Goal:** Implement three TRL-compatible reward callables that consume rollout metadata.

**Files:**
- `training/rewards.py` - create - reward_correctness, reward_progress, reward_operational

**Interface Changes:**
- New functions (all with TRL reward_func signature):
  - `reward_correctness(completions, **kwargs) -> list[float]`
  - `reward_progress(completions, **kwargs) -> list[float]`
  - `reward_operational(completions, **kwargs) -> list[float]`

**Details:**
- `reward_correctness`: Binary 1.0/0.0 based on metadata["answer_correct"]
- `reward_progress`: Float from metadata["cumulative_progress"], normalized to [0, 1]
- `reward_operational`: Sum of per-step operational signals from metadata["operational_signals"]
- All functions access metadata via kwargs (TRL passes extra data from rollout return)
- Each function must handle missing metadata gracefully (return 0.0)

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Low

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T07:07:32Z
**Changes Made:**
- Created `training/rewards.py` with TRL-compatible `reward_correctness`, `reward_progress`, and `reward_operational` callables
- Added robust metadata extraction paths so reward functions support both nested `metadata` payloads and flattened rollout kwargs
- Updated `training/__init__.py` exports for reward helper imports from the package root
- Added `tests/unit/test_rewards.py` covering correctness/progress/operational behavior across happy path, edge, and batch scenarios

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_rewards.py -v
  Result: 19 passed in 3.35s
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rewards.py -v`
- **Notes:**
  - Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- `training/` now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies in `pyproject.toml`

---

### Step 3.1: Training Notebook
**Slice:** S3
**Goal:** Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.

**Files:**
- `notebooks/train_grpo.ipynb` - create - end-to-end training notebook
- `pyproject.toml` - modify - add `[project.optional-dependencies] training` group

**Interface Changes:**
- New optional dependency group: `training = ["trl>=0.12.0", "accelerate>=0.34.0"]`

**Details:**
Notebook cells (linear flow):
1. **Setup**: Install dependencies, import modules, set seed
2. **Config**: Instantiate GRPOConfig (users can override model_name here)
3. **Load Model**: `AutoModelForCausalLM.from_pretrained(config.model_name)`
4. **Load Dataset**: Load questions, filter by difficulty, format as prompts
5. **Initialize GRPOTrainer**: Pass model, tokenizer, rollout_func, reward_funcs, config
6. **Train**: `trainer.train()` with progress bar and metric printing
7. **Learning Curve**: Plot reward over training steps (matplotlib)
8. **Comparison**: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
9. **Save**: Save trained model to config.output_dir

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** Medium
> User-facing deliverable; must work on fresh setup.

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** OK Completed

**Completed:** 2026-03-28T07:37:20Z
**Changes Made:**
- Created `notebooks/train_grpo.ipynb` as the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps.
- Added `[project.optional-dependencies].training` in `pyproject.toml` with `trl>=0.14.0,<0.15.0` and `accelerate>=0.34.0` to keep TRL/torch compatibility stable for this repository.
- Added `training/data_loading.py` to centralize notebook error handling for model loading and question filtering/loading.
- Added `training/notebook_pipeline.py` to centralize trainer wiring, random baseline generation, training execution, and metrics extraction.
- Updated `training/__init__.py` exports to include notebook-facing helpers.
- Added `tests/e2e/test_training_e2e.py` for notebook smoke structure + pipeline behavior checks.
- Added `tests/integration/test_training_pipeline.py` for rollout/reward integration scenarios.
- Added `tests/unit/test_error_handling.py` for model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.

**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
  Result: 68 passed in 5.79s
  Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"
  Result: ok
  ```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`
- **Notes:**
  - Added concrete integration/e2e/error test files that were listed in `VERIFICATION_SPEC.md` but missing from repository.
  - Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
  - Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.

---

## 8. Rollout Considerations

### Feature Flags
- [ ] Required: No

### Migration
- [ ] Data migration needed: No

### Rollback Plan
All changes are additive (new `training/` package and `notebooks/` directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.

---

## 9. Execution Tracking

All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

---

## 9a. Slice Completion Protocol

After all steps in a slice pass verification:

1. **Run verifier subagent** for spec compliance
   - Validates against VERIFICATION_SPEC.md criteria
   - Ensures no TODOs or incomplete work in slice

2. **Run compound-engineer subagent** to extract learnings
   - **Mandatory invocation** after every slice completion
   - Updates CLAUDE.md Learnings section (if durable patterns found)
   - May exit with "no update needed" (valid for routine work)

3. **Commit** the slice changes
   - Follow commit message format in CLAUDE.md
   - Each slice gets its own atomic commit

4. **Continue to next slice** (if more slices remain)
   - Or proceed to final verification if all slices complete

**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

---

## 10. User Value Summary

**Status:** Generated

### What Users Can Now Do
Users can now run a single notebook (`notebooks/train_grpo.ipynb`) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.

### How to Access/Test
1. Install training extras: `uv sync --extra training`
2. Open `notebooks/train_grpo.ipynb`
3. Run all cells to train and save artifacts to `outputs/grpo_run`

### Demo
- **Command:** `jupyter notebook notebooks/train_grpo.ipynb`
- **Verification command:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`

### Release Notes Snippet
Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.

---

## 11. PR Contract (Auto-Generated by autocode-next-step)

**Status:** Generated

### Scope
- Finalized Step 3.1 (Training Notebook) for F006.
- Added training optional dependency group in `pyproject.toml` with TRL pin compatible with repo torch version.
- Added notebook support helpers for model/question loading and trainer orchestration.
- Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.

### Files Changed
- `pyproject.toml`
- `notebooks/train_grpo.ipynb`
- `training/__init__.py`
- `training/data_loading.py`
- `training/notebook_pipeline.py`
- `training/rollout.py`
- `tests/e2e/test_training_e2e.py`
- `tests/integration/test_training_pipeline.py`
- `tests/unit/test_error_handling.py`
- `specs/F006-IMPLEMENTATION_SPEC.md`
- `specs/behavior/training.md`

### Verification Evidence
- `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v` -> 68 passed
- `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"` -> ok
- Verifier verdict: APPROVED (`specs/F006-VERIFICATION_REPORT.md`)

### Risk and Rollback
- Risk tier: Medium (training dependencies and user-facing notebook workflow).
- Rollback: remove notebook/training helper additions and revert `pyproject.toml` training extra.

### Ready for Next Command
All implementation and verification criteria for F006 are complete. Run `/commit-push-pr` when ready.

---

## Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

---

## Human Checkpoint

**Before handing to AI agent:**

- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated

**Questions:**
1. Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
2. Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.

---

## Handoff Notes

**For the implementing AI agent:**

```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
  - HF generate (not vLLM) for inference
  - Model name is a config parameter (default Qwen3-1.7B)
  - Start with easy+medium questions only
  - Follow TRL GRPOTrainer Wordle tutorial pattern
  - reward_funcs are separate callables
```

---

*Specification completed: 2026-03-27*
*Approved by: [pending]*
*Verification spec: VERIFICATION_SPEC.md*
*Target agent: Claude Code*