sql_env / specs /F003-RESEARCH_SUMMARY.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
|
Raw
History Blame Contribute Delete
8.49 kB
# Research Summary
**Project:** SQLEnv
**Change:** F003 β€” Dense Reward System (3-layer reward architecture)
**Date:** 2026-03-27
**Status:** Draft
---
## 1. Change Overview
### What We're Changing
Implement the 3-layer reward architecture in `server/reward.py`:
- **Layer 1 (Operational):** exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
- **Layer 2 (Progress):** Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, Γ—0.15
- **Layer 3 (Terminal):** +1.0 correct, 0.0 incorrect/timeout
Wire into `step()` so non-terminal steps return meaningful reward signals.
### Why We're Changing It
Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.
### Success Criteria
- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
- Anti-gaming: can't farm rewards by describing everything or repeating queries
- Progress signal coarsened (5 bins) to prevent reward hill-climbing
- Total step rewards clamped to [-0.2, +0.5]
---
## 2. System Context
### Current Behavior
- `server/reward.py` is a docstring-only stub β€” all reward logic needs to be built from scratch
- `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions
- `_handle_answer()` returns 1.0 or 0.0 β€” the only reward signal
- `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators
### Architecture Context
```
step(action)
β”œβ”€β”€ DESCRIBE β†’ _handle_describe() β†’ result string
β”œβ”€β”€ SAMPLE β†’ _handle_sample() β†’ result string
β”œβ”€β”€ QUERY β†’ _handle_query() β†’ result string
└── ANSWER β†’ _handle_answer() β†’ (bool, reward)
After action execution (NEW):
reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
β”œβ”€β”€ Layer 1: operational signals
β”œβ”€β”€ Layer 2: progress-to-target (QUERY only)
└── clamp to [-0.2, 0.5] running total
```
### Entry Points
| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| `step()` | Every agent action | Action dispatch β†’ observation (reward=None) |
| `compute_step_reward()` | **To be created** β€” called from `step()` | Per-step reward from layers 1+2 |
### Data Flow
| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| Action type + result | `step()` dispatch | `str`, `list[tuple]` | Layer 1 |
| Query result rows | `_execute_sql()` | `list[tuple]` | Layer 2 progress |
| Gold result rows | `_execute_gold_sql()` at reset | `list[tuple]` | Layer 2 reference β€” **must store in EpisodeContext** |
| Described tables | `EpisodeContext.described_tables` | `set[str]` | Layer 1 new_info |
| Query hashes | **Need to add** to EpisodeContext | `set[str]` | Layer 1 repeat detection |
| Best progress | **Need to add** to EpisodeContext | `float` | Layer 2 improvement tracking |
| Cumulative reward | **Need to add** to EpisodeContext | `float` | Clamping |
**Critical gap:** `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`.
---
## 3. Dependencies
### Code We Depend On
| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `models.py:EpisodeContext` | Episode state β€” needs new fields | Must add reward tracking fields |
| `sql_environment.py:_execute_sql()` | Returns `list[tuple]` for QUERY | Need raw rows passed to reward |
| `sql_environment.py:_execute_gold_sql()` | Returns `list[tuple]` at reset | Already returns raw rows β€” just store them |
| F002 (verifier.py) | Terminal correctness | Being built in parallel β€” Layer 3 can use naive check initially |
### Code That Depends On Us
| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| `sql_environment.py:step()` | Calls `compute_step_reward()` | Must integrate into step flow |
| F006 (GRPO Training) | `reward_funcs` for TRL trainer | Components exposed as separate functions |
| `tests/test_smoke.py` | Asserts `reward=None` for non-ANSWER | **Will break** β€” tests need updating |
---
## 4. Risks & Edge Cases
### Identified Risks
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Reward hacking via progress signal | Medium | Agent exploits shaping | Coarsen to 5 bins, cap step rewards, small magnitudes |
| Test breakage | High | 25 existing tests | Update test assertions for non-None rewards |
| Gold rows unavailable | Low | Layer 2 can't compute | Fallback: Layer 1 only |
### Edge Cases to Handle
| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| QUERY returns empty result | reward=None | Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 |
| QUERY fails with SQL error | reward=None, error set | Layer 1: step_cost only (-0.005) |
| DESCRIBE same table twice | reward=None | repeat penalty (-0.01), no new_info |
| Gold answer is empty | reward=None | Skip Layer 2, Layer 1 only |
| Budget exhausted without ANSWER | reward=0.0 | Terminal: 0.0 + clamped step rewards |
### Invariants to Preserve
- [ ] Terminal correctness always dominates β€” correct answer β‰₯ 1.0
- [ ] Step rewards clamped to [-0.2, +0.5] total
- [ ] Reward is deterministic given same episode state
---
## 4b. Code Shape & Design Target
### Existing Vocabulary
| Concept | Existing Name | Location |
|---------|---------------|----------|
| Episode state | `EpisodeContext` | `models.py:135` |
| Described tables | `described_tables: set[str]` | `models.py:143` |
| Action log | `action_log: list[str]` | `models.py:144` |
### Target Shape
| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `compute_step_reward(ctx, action_type, rows, error)` | Main entry | Single public entry for step() |
| `_layer1_operational(ctx, action_type, sql, rows, error)` | Operational signals | Stateless except episode tracking |
| `_layer2_progress(ctx, rows)` | Progress-to-target (QUERY only) | Needs gold_rows comparison |
| `_cardinality_score(pred_rows, gold_rows)` | Row count comparison | Tier 1 metric |
| `_value_overlap_score(pred_rows, gold_rows)` | Jaccard set overlap | Tier 1 metric |
| `_numeric_range_score(pred_rows, gold_rows)` | Log-distance for numbers | Tier 1 metric |
| `_bin_progress(raw_score)` | Bin to {0, 0.25, 0.5, 0.75, 1.0} | Anti-gaming |
### Abstraction Level
- **Current level:** Flat β€” server modules with plain functions
- **Recommendation:** Match flat style. `server/reward.py` with plain functions.
### Anti-Patterns to Avoid
- Don't create reward strategy classes
- Don't add row-wise best match initially (add if training shows need)
- Don't import numpy/scipy β€” pure Python
- Don't re-execute gold SQL per step β€” cache at reset()
---
## 5. Constraints
### Technical Constraints
| Constraint | Requirement | Notes |
|------------|-------------|-------|
| Performance | < 5ms per reward computation | ~15 calls per episode |
| No heavy deps | Pure Python | No numpy/scipy |
| Deterministic | Same inputs β†’ same reward | Required for reproducible training |
### Testing Constraints
| Test Suite | Coverage Area | Notes |
|------------|---------------|-------|
| `tests/test_smoke.py` | 25 tests, some assert `reward=None` | Must update for non-None step rewards |
---
## 6. Open Questions
| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? | Affects reward quality | Default: Method 1 per reward_design.md |
| Store `gold_rows` in EpisodeContext or separate cache? | Design coupling | Recommend EpisodeContext field |
---
## 7. Context Sources
| Source | Type | Notes |
|--------|------|-------|
| `server/reward.py` | Code (stub) | Docstring describes 3-layer architecture |
| `server/sql_environment.py` | Code | step() flow, _execute_sql() |
| `models.py:EpisodeContext` | Code | Needs new reward-tracking fields |
| `docs_draft/SQLEnv_Concept_v1.md` Section 3 | Doc | Complete reward spec |
| `docs_draft/reward_design.md` | Doc | Distance metrics, combination methods |
| `docs_draft/reward-research_gpt-5-2.md` | Doc | Reward research |