File size: 8,494 Bytes
5dd1bb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | # Research Summary
**Project:** SQLEnv
**Change:** F003 β Dense Reward System (3-layer reward architecture)
**Date:** 2026-03-27
**Status:** Draft
---
## 1. Change Overview
### What We're Changing
Implement the 3-layer reward architecture in `server/reward.py`:
- **Layer 1 (Operational):** exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
- **Layer 2 (Progress):** Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, Γ0.15
- **Layer 3 (Terminal):** +1.0 correct, 0.0 incorrect/timeout
Wire into `step()` so non-terminal steps return meaningful reward signals.
### Why We're Changing It
Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.
### Success Criteria
- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
- Anti-gaming: can't farm rewards by describing everything or repeating queries
- Progress signal coarsened (5 bins) to prevent reward hill-climbing
- Total step rewards clamped to [-0.2, +0.5]
---
## 2. System Context
### Current Behavior
- `server/reward.py` is a docstring-only stub β all reward logic needs to be built from scratch
- `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions
- `_handle_answer()` returns 1.0 or 0.0 β the only reward signal
- `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators
### Architecture Context
```
step(action)
βββ DESCRIBE β _handle_describe() β result string
βββ SAMPLE β _handle_sample() β result string
βββ QUERY β _handle_query() β result string
βββ ANSWER β _handle_answer() β (bool, reward)
After action execution (NEW):
reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
βββ Layer 1: operational signals
βββ Layer 2: progress-to-target (QUERY only)
βββ clamp to [-0.2, 0.5] running total
```
### Entry Points
| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| `step()` | Every agent action | Action dispatch β observation (reward=None) |
| `compute_step_reward()` | **To be created** β called from `step()` | Per-step reward from layers 1+2 |
### Data Flow
| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| Action type + result | `step()` dispatch | `str`, `list[tuple]` | Layer 1 |
| Query result rows | `_execute_sql()` | `list[tuple]` | Layer 2 progress |
| Gold result rows | `_execute_gold_sql()` at reset | `list[tuple]` | Layer 2 reference β **must store in EpisodeContext** |
| Described tables | `EpisodeContext.described_tables` | `set[str]` | Layer 1 new_info |
| Query hashes | **Need to add** to EpisodeContext | `set[str]` | Layer 1 repeat detection |
| Best progress | **Need to add** to EpisodeContext | `float` | Layer 2 improvement tracking |
| Cumulative reward | **Need to add** to EpisodeContext | `float` | Clamping |
**Critical gap:** `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`.
---
## 3. Dependencies
### Code We Depend On
| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `models.py:EpisodeContext` | Episode state β needs new fields | Must add reward tracking fields |
| `sql_environment.py:_execute_sql()` | Returns `list[tuple]` for QUERY | Need raw rows passed to reward |
| `sql_environment.py:_execute_gold_sql()` | Returns `list[tuple]` at reset | Already returns raw rows β just store them |
| F002 (verifier.py) | Terminal correctness | Being built in parallel β Layer 3 can use naive check initially |
### Code That Depends On Us
| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| `sql_environment.py:step()` | Calls `compute_step_reward()` | Must integrate into step flow |
| F006 (GRPO Training) | `reward_funcs` for TRL trainer | Components exposed as separate functions |
| `tests/test_smoke.py` | Asserts `reward=None` for non-ANSWER | **Will break** β tests need updating |
---
## 4. Risks & Edge Cases
### Identified Risks
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Reward hacking via progress signal | Medium | Agent exploits shaping | Coarsen to 5 bins, cap step rewards, small magnitudes |
| Test breakage | High | 25 existing tests | Update test assertions for non-None rewards |
| Gold rows unavailable | Low | Layer 2 can't compute | Fallback: Layer 1 only |
### Edge Cases to Handle
| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| QUERY returns empty result | reward=None | Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 |
| QUERY fails with SQL error | reward=None, error set | Layer 1: step_cost only (-0.005) |
| DESCRIBE same table twice | reward=None | repeat penalty (-0.01), no new_info |
| Gold answer is empty | reward=None | Skip Layer 2, Layer 1 only |
| Budget exhausted without ANSWER | reward=0.0 | Terminal: 0.0 + clamped step rewards |
### Invariants to Preserve
- [ ] Terminal correctness always dominates β correct answer β₯ 1.0
- [ ] Step rewards clamped to [-0.2, +0.5] total
- [ ] Reward is deterministic given same episode state
---
## 4b. Code Shape & Design Target
### Existing Vocabulary
| Concept | Existing Name | Location |
|---------|---------------|----------|
| Episode state | `EpisodeContext` | `models.py:135` |
| Described tables | `described_tables: set[str]` | `models.py:143` |
| Action log | `action_log: list[str]` | `models.py:144` |
### Target Shape
| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `compute_step_reward(ctx, action_type, rows, error)` | Main entry | Single public entry for step() |
| `_layer1_operational(ctx, action_type, sql, rows, error)` | Operational signals | Stateless except episode tracking |
| `_layer2_progress(ctx, rows)` | Progress-to-target (QUERY only) | Needs gold_rows comparison |
| `_cardinality_score(pred_rows, gold_rows)` | Row count comparison | Tier 1 metric |
| `_value_overlap_score(pred_rows, gold_rows)` | Jaccard set overlap | Tier 1 metric |
| `_numeric_range_score(pred_rows, gold_rows)` | Log-distance for numbers | Tier 1 metric |
| `_bin_progress(raw_score)` | Bin to {0, 0.25, 0.5, 0.75, 1.0} | Anti-gaming |
### Abstraction Level
- **Current level:** Flat β server modules with plain functions
- **Recommendation:** Match flat style. `server/reward.py` with plain functions.
### Anti-Patterns to Avoid
- Don't create reward strategy classes
- Don't add row-wise best match initially (add if training shows need)
- Don't import numpy/scipy β pure Python
- Don't re-execute gold SQL per step β cache at reset()
---
## 5. Constraints
### Technical Constraints
| Constraint | Requirement | Notes |
|------------|-------------|-------|
| Performance | < 5ms per reward computation | ~15 calls per episode |
| No heavy deps | Pure Python | No numpy/scipy |
| Deterministic | Same inputs β same reward | Required for reproducible training |
### Testing Constraints
| Test Suite | Coverage Area | Notes |
|------------|---------------|-------|
| `tests/test_smoke.py` | 25 tests, some assert `reward=None` | Must update for non-None step rewards |
---
## 6. Open Questions
| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? | Affects reward quality | Default: Method 1 per reward_design.md |
| Store `gold_rows` in EpisodeContext or separate cache? | Design coupling | Recommend EpisodeContext field |
---
## 7. Context Sources
| Source | Type | Notes |
|--------|------|-------|
| `server/reward.py` | Code (stub) | Docstring describes 3-layer architecture |
| `server/sql_environment.py` | Code | step() flow, _execute_sql() |
| `models.py:EpisodeContext` | Code | Needs new reward-tracking fields |
| `docs_draft/SQLEnv_Concept_v1.md` Section 3 | Doc | Complete reward spec |
| `docs_draft/reward_design.md` | Doc | Distance metrics, combination methods |
| `docs_draft/reward-research_gpt-5-2.md` | Doc | Reward research |
|