Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F003-RESEARCH_SUMMARY.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

8.49 kB

	# Research Summary

	Project: SQLEnv
	Change: F003 — Dense Reward System (3-layer reward architecture)
	Date: 2026-03-27
	Status: Draft

	---

	## 1. Change Overview

	### What We're Changing
	Implement the 3-layer reward architecture in `server/reward.py`:
	- Layer 1 (Operational): exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
	- Layer 2 (Progress): Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, ×0.15
	- Layer 3 (Terminal): +1.0 correct, 0.0 incorrect/timeout

	Wire into `step()` so non-terminal steps return meaningful reward signals.

	### Why We're Changing It
	Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.

	### Success Criteria
	- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
	- Anti-gaming: can't farm rewards by describing everything or repeating queries
	- Progress signal coarsened (5 bins) to prevent reward hill-climbing
	- Total step rewards clamped to [-0.2, +0.5]

	---

	## 2. System Context

	### Current Behavior
	- `server/reward.py` is a docstring-only stub — all reward logic needs to be built from scratch
	- `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions
	- `_handle_answer()` returns 1.0 or 0.0 — the only reward signal
	- `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators

	### Architecture Context
	```
	step(action)
	├── DESCRIBE → _handle_describe() → result string
	├── SAMPLE → _handle_sample() → result string
	├── QUERY → _handle_query() → result string
	└── ANSWER → _handle_answer() → (bool, reward)

	After action execution (NEW):
	reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
	├── Layer 1: operational signals
	├── Layer 2: progress-to-target (QUERY only)
	└── clamp to [-0.2, 0.5] running total
	```

	### Entry Points

	\| Entry Point \| Trigger \| Current Flow \|
	\|-------------\|---------\|--------------\|
	\| `step()` \| Every agent action \| Action dispatch → observation (reward=None) \|
	\| `compute_step_reward()` \| To be created — called from `step()` \| Per-step reward from layers 1+2 \|

	### Data Flow

	\| Data \| Source \| Shape/Type \| Destination \|
	\|------\|--------\|------------\|-------------\|
	\| Action type + result \| `step()` dispatch \| `str`, `list[tuple]` \| Layer 1 \|
	\| Query result rows \| `_execute_sql()` \| `list[tuple]` \| Layer 2 progress \|
	\| Gold result rows \| `_execute_gold_sql()` at reset \| `list[tuple]` \| Layer 2 reference — must store in EpisodeContext \|
	\| Described tables \| `EpisodeContext.described_tables` \| `set[str]` \| Layer 1 new_info \|
	\| Query hashes \| Need to add to EpisodeContext \| `set[str]` \| Layer 1 repeat detection \|
	\| Best progress \| Need to add to EpisodeContext \| `float` \| Layer 2 improvement tracking \|
	\| Cumulative reward \| Need to add to EpisodeContext \| `float` \| Clamping \|

	Critical gap: `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`.

	---

	## 3. Dependencies

	### Code We Depend On

	\| Dependency \| What We Use \| Risk if Changed \|
	\|------------\|-------------\|-----------------\|
	\| `models.py:EpisodeContext` \| Episode state — needs new fields \| Must add reward tracking fields \|
	\| `sql_environment.py:_execute_sql()` \| Returns `list[tuple]` for QUERY \| Need raw rows passed to reward \|
	\| `sql_environment.py:_execute_gold_sql()` \| Returns `list[tuple]` at reset \| Already returns raw rows — just store them \|
	\| F002 (verifier.py) \| Terminal correctness \| Being built in parallel — Layer 3 can use naive check initially \|

	### Code That Depends On Us

	\| Dependent \| How They Use Us \| Impact of Our Change \|
	\|-----------\|-----------------\|---------------------\|
	\| `sql_environment.py:step()` \| Calls `compute_step_reward()` \| Must integrate into step flow \|
	\| F006 (GRPO Training) \| `reward_funcs` for TRL trainer \| Components exposed as separate functions \|
	\| `tests/test_smoke.py` \| Asserts `reward=None` for non-ANSWER \| Will break — tests need updating \|

	---

	## 4. Risks & Edge Cases

	### Identified Risks

	\| Risk \| Likelihood \| Impact \| Mitigation \|
	\|------\|------------\|--------\|------------\|
	\| Reward hacking via progress signal \| Medium \| Agent exploits shaping \| Coarsen to 5 bins, cap step rewards, small magnitudes \|
	\| Test breakage \| High \| 25 existing tests \| Update test assertions for non-None rewards \|
	\| Gold rows unavailable \| Low \| Layer 2 can't compute \| Fallback: Layer 1 only \|

	### Edge Cases to Handle

	\| Edge Case \| Current Behavior \| Required Behavior \|
	\|-----------\|------------------\|-------------------\|
	\| QUERY returns empty result \| reward=None \| Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 \|
	\| QUERY fails with SQL error \| reward=None, error set \| Layer 1: step_cost only (-0.005) \|
	\| DESCRIBE same table twice \| reward=None \| repeat penalty (-0.01), no new_info \|
	\| Gold answer is empty \| reward=None \| Skip Layer 2, Layer 1 only \|
	\| Budget exhausted without ANSWER \| reward=0.0 \| Terminal: 0.0 + clamped step rewards \|

	### Invariants to Preserve

	- [ ] Terminal correctness always dominates — correct answer ≥ 1.0
	- [ ] Step rewards clamped to [-0.2, +0.5] total
	- [ ] Reward is deterministic given same episode state

	---

	## 4b. Code Shape & Design Target

	### Existing Vocabulary

	\| Concept \| Existing Name \| Location \|
	\|---------\|---------------\|----------\|
	\| Episode state \| `EpisodeContext` \| `models.py:135` \|
	\| Described tables \| `described_tables: set[str]` \| `models.py:143` \|
	\| Action log \| `action_log: list[str]` \| `models.py:144` \|

	### Target Shape

	\| Component \| Purpose \| Why This Boundary \|
	\|-----------\|---------\|-------------------\|
	\| `compute_step_reward(ctx, action_type, rows, error)` \| Main entry \| Single public entry for step() \|
	\| `_layer1_operational(ctx, action_type, sql, rows, error)` \| Operational signals \| Stateless except episode tracking \|
	\| `_layer2_progress(ctx, rows)` \| Progress-to-target (QUERY only) \| Needs gold_rows comparison \|
	\| `_cardinality_score(pred_rows, gold_rows)` \| Row count comparison \| Tier 1 metric \|
	\| `_value_overlap_score(pred_rows, gold_rows)` \| Jaccard set overlap \| Tier 1 metric \|
	\| `_numeric_range_score(pred_rows, gold_rows)` \| Log-distance for numbers \| Tier 1 metric \|
	\| `_bin_progress(raw_score)` \| Bin to {0, 0.25, 0.5, 0.75, 1.0} \| Anti-gaming \|

	### Abstraction Level

	- Current level: Flat — server modules with plain functions
	- Recommendation: Match flat style. `server/reward.py` with plain functions.

	### Anti-Patterns to Avoid

	- Don't create reward strategy classes
	- Don't add row-wise best match initially (add if training shows need)
	- Don't import numpy/scipy — pure Python
	- Don't re-execute gold SQL per step — cache at reset()

	---

	## 5. Constraints

	### Technical Constraints

	\| Constraint \| Requirement \| Notes \|
	\|------------\|-------------\|-------\|
	\| Performance \| < 5ms per reward computation \| ~15 calls per episode \|
	\| No heavy deps \| Pure Python \| No numpy/scipy \|
	\| Deterministic \| Same inputs → same reward \| Required for reproducible training \|

	### Testing Constraints

	\| Test Suite \| Coverage Area \| Notes \|
	\|------------\|---------------\|-------\|
	\| `tests/test_smoke.py` \| 25 tests, some assert `reward=None` \| Must update for non-None step rewards \|

	---

	## 6. Open Questions

	\| Question \| Why It Matters \| Who Can Answer \|
	\|----------\|----------------\|----------------\|
	\| Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? \| Affects reward quality \| Default: Method 1 per reward_design.md \|
	\| Store `gold_rows` in EpisodeContext or separate cache? \| Design coupling \| Recommend EpisodeContext field \|

	---

	## 7. Context Sources

	\| Source \| Type \| Notes \|
	\|--------\|------\|-------\|
	\| `server/reward.py` \| Code (stub) \| Docstring describes 3-layer architecture \|
	\| `server/sql_environment.py` \| Code \| step() flow, _execute_sql() \|
	\| `models.py:EpisodeContext` \| Code \| Needs new reward-tracking fields \|
	\| `docs_draft/SQLEnv_Concept_v1.md` Section 3 \| Doc \| Complete reward spec \|
	\| `docs_draft/reward_design.md` \| Doc \| Distance metrics, combination methods \|
	\| `docs_draft/reward-research_gpt-5-2.md` \| Doc \| Reward research \|

	# Research Summary

	Project: SQLEnv
	Change: F003 — Dense Reward System (3-layer reward architecture)
	Date: 2026-03-27
	Status: Draft

	---

	## 1. Change Overview

	### What We're Changing
	Implement the 3-layer reward architecture in `server/reward.py`:
	- Layer 1 (Operational): exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
	- Layer 2 (Progress): Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, ×0.15
	- Layer 3 (Terminal): +1.0 correct, 0.0 incorrect/timeout

	Wire into `step()` so non-terminal steps return meaningful reward signals.

	### Why We're Changing It
	Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.

	### Success Criteria
	- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
	- Anti-gaming: can't farm rewards by describing everything or repeating queries
	- Progress signal coarsened (5 bins) to prevent reward hill-climbing
	- Total step rewards clamped to [-0.2, +0.5]

	---

	## 2. System Context

	### Current Behavior
	- `server/reward.py` is a docstring-only stub — all reward logic needs to be built from scratch
	- `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions
	- `_handle_answer()` returns 1.0 or 0.0 — the only reward signal
	- `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators

	### Architecture Context
	```
	step(action)
	├── DESCRIBE → _handle_describe() → result string
	├── SAMPLE → _handle_sample() → result string
	├── QUERY → _handle_query() → result string
	└── ANSWER → _handle_answer() → (bool, reward)

	After action execution (NEW):
	reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
	├── Layer 1: operational signals
	├── Layer 2: progress-to-target (QUERY only)
	└── clamp to [-0.2, 0.5] running total
	```

	### Entry Points

	\| Entry Point \| Trigger \| Current Flow \|
	\|-------------\|---------\|--------------\|
	\| `step()` \| Every agent action \| Action dispatch → observation (reward=None) \|
	\| `compute_step_reward()` \| To be created — called from `step()` \| Per-step reward from layers 1+2 \|

	### Data Flow

	\| Data \| Source \| Shape/Type \| Destination \|
	\|------\|--------\|------------\|-------------\|
	\| Action type + result \| `step()` dispatch \| `str`, `list[tuple]` \| Layer 1 \|
	\| Query result rows \| `_execute_sql()` \| `list[tuple]` \| Layer 2 progress \|
	\| Gold result rows \| `_execute_gold_sql()` at reset \| `list[tuple]` \| Layer 2 reference — must store in EpisodeContext \|
	\| Described tables \| `EpisodeContext.described_tables` \| `set[str]` \| Layer 1 new_info \|
	\| Query hashes \| Need to add to EpisodeContext \| `set[str]` \| Layer 1 repeat detection \|
	\| Best progress \| Need to add to EpisodeContext \| `float` \| Layer 2 improvement tracking \|
	\| Cumulative reward \| Need to add to EpisodeContext \| `float` \| Clamping \|

	Critical gap: `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`.

	---

	## 3. Dependencies

	### Code We Depend On

	\| Dependency \| What We Use \| Risk if Changed \|
	\|------------\|-------------\|-----------------\|
	\| `models.py:EpisodeContext` \| Episode state — needs new fields \| Must add reward tracking fields \|
	\| `sql_environment.py:_execute_sql()` \| Returns `list[tuple]` for QUERY \| Need raw rows passed to reward \|
	\| `sql_environment.py:_execute_gold_sql()` \| Returns `list[tuple]` at reset \| Already returns raw rows — just store them \|
	\| F002 (verifier.py) \| Terminal correctness \| Being built in parallel — Layer 3 can use naive check initially \|

	### Code That Depends On Us

	\| Dependent \| How They Use Us \| Impact of Our Change \|
	\|-----------\|-----------------\|---------------------\|
	\| `sql_environment.py:step()` \| Calls `compute_step_reward()` \| Must integrate into step flow \|
	\| F006 (GRPO Training) \| `reward_funcs` for TRL trainer \| Components exposed as separate functions \|
	\| `tests/test_smoke.py` \| Asserts `reward=None` for non-ANSWER \| Will break — tests need updating \|

	---

	## 4. Risks & Edge Cases

	### Identified Risks

	\| Risk \| Likelihood \| Impact \| Mitigation \|
	\|------\|------------\|--------\|------------\|
	\| Reward hacking via progress signal \| Medium \| Agent exploits shaping \| Coarsen to 5 bins, cap step rewards, small magnitudes \|
	\| Test breakage \| High \| 25 existing tests \| Update test assertions for non-None rewards \|
	\| Gold rows unavailable \| Low \| Layer 2 can't compute \| Fallback: Layer 1 only \|

	### Edge Cases to Handle

	\| Edge Case \| Current Behavior \| Required Behavior \|
	\|-----------\|------------------\|-------------------\|
	\| QUERY returns empty result \| reward=None \| Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 \|
	\| QUERY fails with SQL error \| reward=None, error set \| Layer 1: step_cost only (-0.005) \|
	\| DESCRIBE same table twice \| reward=None \| repeat penalty (-0.01), no new_info \|
	\| Gold answer is empty \| reward=None \| Skip Layer 2, Layer 1 only \|
	\| Budget exhausted without ANSWER \| reward=0.0 \| Terminal: 0.0 + clamped step rewards \|

	### Invariants to Preserve

	- [ ] Terminal correctness always dominates — correct answer ≥ 1.0
	- [ ] Step rewards clamped to [-0.2, +0.5] total
	- [ ] Reward is deterministic given same episode state

	---

	## 4b. Code Shape & Design Target

	### Existing Vocabulary

	\| Concept \| Existing Name \| Location \|
	\|---------\|---------------\|----------\|
	\| Episode state \| `EpisodeContext` \| `models.py:135` \|
	\| Described tables \| `described_tables: set[str]` \| `models.py:143` \|
	\| Action log \| `action_log: list[str]` \| `models.py:144` \|

	### Target Shape

	\| Component \| Purpose \| Why This Boundary \|
	\|-----------\|---------\|-------------------\|
	\| `compute_step_reward(ctx, action_type, rows, error)` \| Main entry \| Single public entry for step() \|
	\| `_layer1_operational(ctx, action_type, sql, rows, error)` \| Operational signals \| Stateless except episode tracking \|
	\| `_layer2_progress(ctx, rows)` \| Progress-to-target (QUERY only) \| Needs gold_rows comparison \|
	\| `_cardinality_score(pred_rows, gold_rows)` \| Row count comparison \| Tier 1 metric \|
	\| `_value_overlap_score(pred_rows, gold_rows)` \| Jaccard set overlap \| Tier 1 metric \|
	\| `_numeric_range_score(pred_rows, gold_rows)` \| Log-distance for numbers \| Tier 1 metric \|
	\| `_bin_progress(raw_score)` \| Bin to {0, 0.25, 0.5, 0.75, 1.0} \| Anti-gaming \|

	### Abstraction Level

	- Current level: Flat — server modules with plain functions
	- Recommendation: Match flat style. `server/reward.py` with plain functions.

	### Anti-Patterns to Avoid

	- Don't create reward strategy classes
	- Don't add row-wise best match initially (add if training shows need)
	- Don't import numpy/scipy — pure Python
	- Don't re-execute gold SQL per step — cache at reset()

	---

	## 5. Constraints

	### Technical Constraints

	\| Constraint \| Requirement \| Notes \|
	\|------------\|-------------\|-------\|
	\| Performance \| < 5ms per reward computation \| ~15 calls per episode \|
	\| No heavy deps \| Pure Python \| No numpy/scipy \|
	\| Deterministic \| Same inputs → same reward \| Required for reproducible training \|

	### Testing Constraints

	\| Test Suite \| Coverage Area \| Notes \|
	\|------------\|---------------\|-------\|
	\| `tests/test_smoke.py` \| 25 tests, some assert `reward=None` \| Must update for non-None step rewards \|

	---

	## 6. Open Questions

	\| Question \| Why It Matters \| Who Can Answer \|
	\|----------\|----------------\|----------------\|
	\| Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? \| Affects reward quality \| Default: Method 1 per reward_design.md \|
	\| Store `gold_rows` in EpisodeContext or separate cache? \| Design coupling \| Recommend EpisodeContext field \|

	---

	## 7. Context Sources

	\| Source \| Type \| Notes \|
	\|--------\|------\|-------\|
	\| `server/reward.py` \| Code (stub) \| Docstring describes 3-layer architecture \|
	\| `server/sql_environment.py` \| Code \| step() flow, _execute_sql() \|
	\| `models.py:EpisodeContext` \| Code \| Needs new reward-tracking fields \|
	\| `docs_draft/SQLEnv_Concept_v1.md` Section 3 \| Doc \| Complete reward spec \|
	\| `docs_draft/reward_design.md` \| Doc \| Distance metrics, combination methods \|
	\| `docs_draft/reward-research_gpt-5-2.md` \| Doc \| Reward research \|