| # Research Summary |
|
|
| **Project:** SQLEnv |
| **Change:** F003 β Dense Reward System (3-layer reward architecture) |
| **Date:** 2026-03-27 |
| **Status:** Draft |
|
|
| --- |
|
|
| ## 1. Change Overview |
|
|
| ### What We're Changing |
| Implement the 3-layer reward architecture in `server/reward.py`: |
| - **Layer 1 (Operational):** exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005 |
| - **Layer 2 (Progress):** Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, Γ0.15 |
| - **Layer 3 (Terminal):** +1.0 correct, 0.0 incorrect/timeout |
| |
| Wire into `step()` so non-terminal steps return meaningful reward signals. |
| |
| ### Why We're Changing It |
| Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge. |
| |
| ### Success Criteria |
| - Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3 |
| - Anti-gaming: can't farm rewards by describing everything or repeating queries |
| - Progress signal coarsened (5 bins) to prevent reward hill-climbing |
| - Total step rewards clamped to [-0.2, +0.5] |
| |
| --- |
| |
| ## 2. System Context |
| |
| ### Current Behavior |
| - `server/reward.py` is a docstring-only stub β all reward logic needs to be built from scratch |
| - `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions |
| - `_handle_answer()` returns 1.0 or 0.0 β the only reward signal |
| - `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators |
| |
| ### Architecture Context |
| ``` |
| step(action) |
| βββ DESCRIBE β _handle_describe() β result string |
| βββ SAMPLE β _handle_sample() β result string |
| βββ QUERY β _handle_query() β result string |
| βββ ANSWER β _handle_answer() β (bool, reward) |
| |
| After action execution (NEW): |
| reward.compute_step_reward(episode_ctx, action_type, query_rows, error) |
| βββ Layer 1: operational signals |
| βββ Layer 2: progress-to-target (QUERY only) |
| βββ clamp to [-0.2, 0.5] running total |
| ``` |
| |
| ### Entry Points |
|
|
| | Entry Point | Trigger | Current Flow | |
| |-------------|---------|--------------| |
| | `step()` | Every agent action | Action dispatch β observation (reward=None) | |
| | `compute_step_reward()` | **To be created** β called from `step()` | Per-step reward from layers 1+2 | |
|
|
| ### Data Flow |
|
|
| | Data | Source | Shape/Type | Destination | |
| |------|--------|------------|-------------| |
| | Action type + result | `step()` dispatch | `str`, `list[tuple]` | Layer 1 | |
| | Query result rows | `_execute_sql()` | `list[tuple]` | Layer 2 progress | |
| | Gold result rows | `_execute_gold_sql()` at reset | `list[tuple]` | Layer 2 reference β **must store in EpisodeContext** | |
| | Described tables | `EpisodeContext.described_tables` | `set[str]` | Layer 1 new_info | |
| | Query hashes | **Need to add** to EpisodeContext | `set[str]` | Layer 1 repeat detection | |
| | Best progress | **Need to add** to EpisodeContext | `float` | Layer 2 improvement tracking | |
| | Cumulative reward | **Need to add** to EpisodeContext | `float` | Clamping | |
| |
| **Critical gap:** `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`. |
| |
| --- |
| |
| ## 3. Dependencies |
| |
| ### Code We Depend On |
| |
| | Dependency | What We Use | Risk if Changed | |
| |------------|-------------|-----------------| |
| | `models.py:EpisodeContext` | Episode state β needs new fields | Must add reward tracking fields | |
| | `sql_environment.py:_execute_sql()` | Returns `list[tuple]` for QUERY | Need raw rows passed to reward | |
| | `sql_environment.py:_execute_gold_sql()` | Returns `list[tuple]` at reset | Already returns raw rows β just store them | |
| | F002 (verifier.py) | Terminal correctness | Being built in parallel β Layer 3 can use naive check initially | |
|
|
| ### Code That Depends On Us |
|
|
| | Dependent | How They Use Us | Impact of Our Change | |
| |-----------|-----------------|---------------------| |
| | `sql_environment.py:step()` | Calls `compute_step_reward()` | Must integrate into step flow | |
| | F006 (GRPO Training) | `reward_funcs` for TRL trainer | Components exposed as separate functions | |
| | `tests/test_smoke.py` | Asserts `reward=None` for non-ANSWER | **Will break** β tests need updating | |
|
|
| --- |
|
|
| ## 4. Risks & Edge Cases |
|
|
| ### Identified Risks |
|
|
| | Risk | Likelihood | Impact | Mitigation | |
| |------|------------|--------|------------| |
| | Reward hacking via progress signal | Medium | Agent exploits shaping | Coarsen to 5 bins, cap step rewards, small magnitudes | |
| | Test breakage | High | 25 existing tests | Update test assertions for non-None rewards | |
| | Gold rows unavailable | Low | Layer 2 can't compute | Fallback: Layer 1 only | |
|
|
| ### Edge Cases to Handle |
|
|
| | Edge Case | Current Behavior | Required Behavior | |
| |-----------|------------------|-------------------| |
| | QUERY returns empty result | reward=None | Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 | |
| | QUERY fails with SQL error | reward=None, error set | Layer 1: step_cost only (-0.005) | |
| | DESCRIBE same table twice | reward=None | repeat penalty (-0.01), no new_info | |
| | Gold answer is empty | reward=None | Skip Layer 2, Layer 1 only | |
| | Budget exhausted without ANSWER | reward=0.0 | Terminal: 0.0 + clamped step rewards | |
| |
| ### Invariants to Preserve |
| |
| - [ ] Terminal correctness always dominates β correct answer β₯ 1.0 |
| - [ ] Step rewards clamped to [-0.2, +0.5] total |
| - [ ] Reward is deterministic given same episode state |
| |
| --- |
| |
| ## 4b. Code Shape & Design Target |
| |
| ### Existing Vocabulary |
| |
| | Concept | Existing Name | Location | |
| |---------|---------------|----------| |
| | Episode state | `EpisodeContext` | `models.py:135` | |
| | Described tables | `described_tables: set[str]` | `models.py:143` | |
| | Action log | `action_log: list[str]` | `models.py:144` | |
|
|
| ### Target Shape |
|
|
| | Component | Purpose | Why This Boundary | |
| |-----------|---------|-------------------| |
| | `compute_step_reward(ctx, action_type, rows, error)` | Main entry | Single public entry for step() | |
| | `_layer1_operational(ctx, action_type, sql, rows, error)` | Operational signals | Stateless except episode tracking | |
| | `_layer2_progress(ctx, rows)` | Progress-to-target (QUERY only) | Needs gold_rows comparison | |
| | `_cardinality_score(pred_rows, gold_rows)` | Row count comparison | Tier 1 metric | |
| | `_value_overlap_score(pred_rows, gold_rows)` | Jaccard set overlap | Tier 1 metric | |
| | `_numeric_range_score(pred_rows, gold_rows)` | Log-distance for numbers | Tier 1 metric | |
| | `_bin_progress(raw_score)` | Bin to {0, 0.25, 0.5, 0.75, 1.0} | Anti-gaming | |
|
|
| ### Abstraction Level |
|
|
| - **Current level:** Flat β server modules with plain functions |
| - **Recommendation:** Match flat style. `server/reward.py` with plain functions. |
|
|
| ### Anti-Patterns to Avoid |
|
|
| - Don't create reward strategy classes |
| - Don't add row-wise best match initially (add if training shows need) |
| - Don't import numpy/scipy β pure Python |
| - Don't re-execute gold SQL per step β cache at reset() |
|
|
| --- |
|
|
| ## 5. Constraints |
|
|
| ### Technical Constraints |
|
|
| | Constraint | Requirement | Notes | |
| |------------|-------------|-------| |
| | Performance | < 5ms per reward computation | ~15 calls per episode | |
| | No heavy deps | Pure Python | No numpy/scipy | |
| | Deterministic | Same inputs β same reward | Required for reproducible training | |
|
|
| ### Testing Constraints |
|
|
| | Test Suite | Coverage Area | Notes | |
| |------------|---------------|-------| |
| | `tests/test_smoke.py` | 25 tests, some assert `reward=None` | Must update for non-None step rewards | |
|
|
| --- |
|
|
| ## 6. Open Questions |
|
|
| | Question | Why It Matters | Who Can Answer | |
| |----------|----------------|----------------| |
| | Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? | Affects reward quality | Default: Method 1 per reward_design.md | |
| | Store `gold_rows` in EpisodeContext or separate cache? | Design coupling | Recommend EpisodeContext field | |
|
|
| --- |
|
|
| ## 7. Context Sources |
|
|
| | Source | Type | Notes | |
| |--------|------|-------| |
| | `server/reward.py` | Code (stub) | Docstring describes 3-layer architecture | |
| | `server/sql_environment.py` | Code | step() flow, _execute_sql() | |
| | `models.py:EpisodeContext` | Code | Needs new reward-tracking fields | |
| | `docs_draft/SQLEnv_Concept_v1.md` Section 3 | Doc | Complete reward spec | |
| | `docs_draft/reward_design.md` | Doc | Distance metrics, combination methods | |
| | `docs_draft/reward-research_gpt-5-2.md` | Doc | Reward research | |
|
|