File size: 8,494 Bytes
5dd1bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# Research Summary

**Project:** SQLEnv
**Change:** F003 β€” Dense Reward System (3-layer reward architecture)
**Date:** 2026-03-27
**Status:** Draft

---

## 1. Change Overview

### What We're Changing
Implement the 3-layer reward architecture in `server/reward.py`:
- **Layer 1 (Operational):** exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
- **Layer 2 (Progress):** Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, Γ—0.15
- **Layer 3 (Terminal):** +1.0 correct, 0.0 incorrect/timeout

Wire into `step()` so non-terminal steps return meaningful reward signals.

### Why We're Changing It
Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.

### Success Criteria
- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
- Anti-gaming: can't farm rewards by describing everything or repeating queries
- Progress signal coarsened (5 bins) to prevent reward hill-climbing
- Total step rewards clamped to [-0.2, +0.5]

---

## 2. System Context

### Current Behavior
- `server/reward.py` is a docstring-only stub β€” all reward logic needs to be built from scratch
- `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions
- `_handle_answer()` returns 1.0 or 0.0 β€” the only reward signal
- `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators

### Architecture Context
```
step(action)
  β”œβ”€β”€ DESCRIBE β†’ _handle_describe() β†’ result string
  β”œβ”€β”€ SAMPLE   β†’ _handle_sample()   β†’ result string
  β”œβ”€β”€ QUERY    β†’ _handle_query()    β†’ result string
  └── ANSWER   β†’ _handle_answer()   β†’ (bool, reward)

  After action execution (NEW):
  reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
    β”œβ”€β”€ Layer 1: operational signals
    β”œβ”€β”€ Layer 2: progress-to-target (QUERY only)
    └── clamp to [-0.2, 0.5] running total
```

### Entry Points

| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| `step()` | Every agent action | Action dispatch β†’ observation (reward=None) |
| `compute_step_reward()` | **To be created** β€” called from `step()` | Per-step reward from layers 1+2 |

### Data Flow

| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| Action type + result | `step()` dispatch | `str`, `list[tuple]` | Layer 1 |
| Query result rows | `_execute_sql()` | `list[tuple]` | Layer 2 progress |
| Gold result rows | `_execute_gold_sql()` at reset | `list[tuple]` | Layer 2 reference β€” **must store in EpisodeContext** |
| Described tables | `EpisodeContext.described_tables` | `set[str]` | Layer 1 new_info |
| Query hashes | **Need to add** to EpisodeContext | `set[str]` | Layer 1 repeat detection |
| Best progress | **Need to add** to EpisodeContext | `float` | Layer 2 improvement tracking |
| Cumulative reward | **Need to add** to EpisodeContext | `float` | Clamping |

**Critical gap:** `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`.

---

## 3. Dependencies

### Code We Depend On

| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `models.py:EpisodeContext` | Episode state β€” needs new fields | Must add reward tracking fields |
| `sql_environment.py:_execute_sql()` | Returns `list[tuple]` for QUERY | Need raw rows passed to reward |
| `sql_environment.py:_execute_gold_sql()` | Returns `list[tuple]` at reset | Already returns raw rows β€” just store them |
| F002 (verifier.py) | Terminal correctness | Being built in parallel β€” Layer 3 can use naive check initially |

### Code That Depends On Us

| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| `sql_environment.py:step()` | Calls `compute_step_reward()` | Must integrate into step flow |
| F006 (GRPO Training) | `reward_funcs` for TRL trainer | Components exposed as separate functions |
| `tests/test_smoke.py` | Asserts `reward=None` for non-ANSWER | **Will break** β€” tests need updating |

---

## 4. Risks & Edge Cases

### Identified Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Reward hacking via progress signal | Medium | Agent exploits shaping | Coarsen to 5 bins, cap step rewards, small magnitudes |
| Test breakage | High | 25 existing tests | Update test assertions for non-None rewards |
| Gold rows unavailable | Low | Layer 2 can't compute | Fallback: Layer 1 only |

### Edge Cases to Handle

| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| QUERY returns empty result | reward=None | Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 |
| QUERY fails with SQL error | reward=None, error set | Layer 1: step_cost only (-0.005) |
| DESCRIBE same table twice | reward=None | repeat penalty (-0.01), no new_info |
| Gold answer is empty | reward=None | Skip Layer 2, Layer 1 only |
| Budget exhausted without ANSWER | reward=0.0 | Terminal: 0.0 + clamped step rewards |

### Invariants to Preserve

- [ ] Terminal correctness always dominates β€” correct answer β‰₯ 1.0
- [ ] Step rewards clamped to [-0.2, +0.5] total
- [ ] Reward is deterministic given same episode state

---

## 4b. Code Shape & Design Target

### Existing Vocabulary

| Concept | Existing Name | Location |
|---------|---------------|----------|
| Episode state | `EpisodeContext` | `models.py:135` |
| Described tables | `described_tables: set[str]` | `models.py:143` |
| Action log | `action_log: list[str]` | `models.py:144` |

### Target Shape

| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `compute_step_reward(ctx, action_type, rows, error)` | Main entry | Single public entry for step() |
| `_layer1_operational(ctx, action_type, sql, rows, error)` | Operational signals | Stateless except episode tracking |
| `_layer2_progress(ctx, rows)` | Progress-to-target (QUERY only) | Needs gold_rows comparison |
| `_cardinality_score(pred_rows, gold_rows)` | Row count comparison | Tier 1 metric |
| `_value_overlap_score(pred_rows, gold_rows)` | Jaccard set overlap | Tier 1 metric |
| `_numeric_range_score(pred_rows, gold_rows)` | Log-distance for numbers | Tier 1 metric |
| `_bin_progress(raw_score)` | Bin to {0, 0.25, 0.5, 0.75, 1.0} | Anti-gaming |

### Abstraction Level

- **Current level:** Flat β€” server modules with plain functions
- **Recommendation:** Match flat style. `server/reward.py` with plain functions.

### Anti-Patterns to Avoid

- Don't create reward strategy classes
- Don't add row-wise best match initially (add if training shows need)
- Don't import numpy/scipy β€” pure Python
- Don't re-execute gold SQL per step β€” cache at reset()

---

## 5. Constraints

### Technical Constraints

| Constraint | Requirement | Notes |
|------------|-------------|-------|
| Performance | < 5ms per reward computation | ~15 calls per episode |
| No heavy deps | Pure Python | No numpy/scipy |
| Deterministic | Same inputs β†’ same reward | Required for reproducible training |

### Testing Constraints

| Test Suite | Coverage Area | Notes |
|------------|---------------|-------|
| `tests/test_smoke.py` | 25 tests, some assert `reward=None` | Must update for non-None step rewards |

---

## 6. Open Questions

| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? | Affects reward quality | Default: Method 1 per reward_design.md |
| Store `gold_rows` in EpisodeContext or separate cache? | Design coupling | Recommend EpisodeContext field |

---

## 7. Context Sources

| Source | Type | Notes |
|--------|------|-------|
| `server/reward.py` | Code (stub) | Docstring describes 3-layer architecture |
| `server/sql_environment.py` | Code | step() flow, _execute_sql() |
| `models.py:EpisodeContext` | Code | Needs new reward-tracking fields |
| `docs_draft/SQLEnv_Concept_v1.md` Section 3 | Doc | Complete reward spec |
| `docs_draft/reward_design.md` | Doc | Distance metrics, combination methods |
| `docs_draft/reward-research_gpt-5-2.md` | Doc | Reward research |