# Research Summary

**Project:** SQLEnv - Question Dataset Expansion
**Change:** F004 - Expand from 53 questions (one DB) to 100+ questions across 5-10 Spider databases with difficulty labels, answer_type metadata, gold_answer fields, and train/eval split
**Date:** 2026-03-24
**Status:** Draft

---

## 1. Change Overview

### What We're Changing

Expanding the question dataset from the current 53 Spider questions for a single database (`student_assessment`) to 100+ curated questions spanning 5-10 Spider databases. Each question will be enriched with metadata fields: `difficulty` (easy/medium/hard at 40/40/20 split), `answer_type` (integer/float/string/list/table), `gold_answer` (pre-computed), and `tables_involved`. The dataset will be split into train (70%) and eval (30%) partitions.

### Why We're Changing It

Training on a single database schema risks overfitting. The agent needs diverse schemas, question patterns, and difficulty levels to develop generalizable SQL exploration strategies. Pre-computed gold answers avoid re-executing gold SQL every RL episode, improving training throughput.

### Success Criteria

- 100+ questions across 5-10 Spider databases
- Difficulty distribution: ~40% easy (1-2 tables), ~40% medium (2-3 tables), ~20% hard (4+ tables)
- Every question has: `gold_answer`, `answer_type`, `difficulty`, `tables_involved`
- Train/eval split at 70/30 with no cross-contamination
- No questions requiring SQL features unsupported by SQLite
- Diverse answer types (integer, float, string, list) and SQL patterns (aggregation, joins, subqueries, grouping)

---

## 2. System Context

### Current Behavior

The system stores 53 raw Spider questions in `data/questions/student_assessment.json`. These questions use the Spider dataset's native format with fields: `db_id`, `query` (gold SQL), `question` (natural language), `query_toks`, `query_toks_no_value`, and `question_toks`. There are no `difficulty`, `answer_type`, `gold_answer`, or `tables_involved` fields. There is no train/eval split.

The `SQLEnvironment` class in `server/sql_environment.py` currently hardcodes the `student_assessment` schema: it imports all 9 ORM models by name and builds a static schema description string in `_build_schema_description()`. Questions are not yet loaded or used in the environment loop (the conceptual `QuestionRecord` is defined in comments in `models.py` but not implemented).

### Architecture Context

```
data/
  questions/
    student_assessment.json    <-- Current: raw Spider format, 53 questions
    (new files per database)   <-- Target: enriched format, 100+ questions
  databases/
    models.py                  <-- Current: student_assessment ORM only
    (new models per database)  <-- Target: ORM per database OR direct SQLite

scripts/
  download_spider_data.py      <-- Downloads questions from HuggingFace
  generate_models_from_schema.py  <-- Auto-generates ORM from Spider schema

server/
  sql_environment.py           <-- Hardcoded to student_assessment schema
  verifier.py                  <-- Stub; will use answer_type for comparison

models.py                      <-- QuestionRecord conceptual design (comments)
```

F004 is a **data-layer feature** that produces the enriched question files. It does not implement the environment's question-loading logic (that belongs to F001 Core Environment Loop), but it must produce data in the format that F001/F002/F003 will consume.

### Entry Points

| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| `data/questions/student_assessment.json` | Read at import/reset | Raw Spider format; 53 entries with `db_id`, `query`, `question`, tokenized variants |
| `scripts/download_spider_data.py` | Manual CLI invocation | Downloads Spider questions from `xlangai/spider` HuggingFace dataset, filters by `db_id`, saves raw JSON |
| `scripts/generate_models_from_schema.py` | Manual CLI invocation | Downloads schema from `richardr1126/spider-schema`, generates SQLAlchemy model `.py` files |
| `data/databases/models.py` | Imported by `sql_environment.py` | Hand-written SQLAlchemy ORM for student_assessment (9 tables) |

### Data Flow

| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| Raw Spider questions | HuggingFace `xlangai/spider` | `[{db_id, query, question, query_toks, query_toks_no_value, question_toks}]` | `data/questions/{db_id}.json` |
| Spider schema | HuggingFace `richardr1126/spider-schema` | `[{db_id, table: [{name, columns: [{name, type}]}], foreign_keys}]` | `data/models/{db_id}.py` (generated ORM) |
| Enriched questions (target) | Curation script (new) | `[{id, db_id, question, gold_sql, gold_answer, answer_type, difficulty, tables_involved, split}]` | `data/questions/` or single manifest file |
| SQLite database files | Spider dataset (to be downloaded) | `.sqlite` files | `data/databases/{db_id}/{db_id}.sqlite` |

---

## 3. Dependencies

### Code We Depend On

| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `datasets` (HuggingFace) | `load_dataset("xlangai/spider")` for questions, `load_dataset("richardr1126/spider-schema")` for schemas | Dataset API changes could break download scripts |
| Spider dataset (`xlangai/spider`) | Raw questions with gold SQL | Dataset structure is stable (academic benchmark) |
| Spider schema dataset (`richardr1126/spider-schema`) | Table/column definitions for ORM generation | Third-party dataset; less stable than official Spider |
| `sqlite3` (stdlib) | Execute gold SQL to compute `gold_answer` | Stable (stdlib) |
| SQLAlchemy | ORM model definitions used by environment | Already a project dependency |

### Code That Depends On Us

| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| F001 (Core Environment Loop) | Loads questions from JSON at `reset()`, selects question, opens SQLite database | Must produce questions in format matching `QuestionRecord` conceptual design in `models.py` |
| F002 (Answer Verification) | Uses `answer_type` and `gold_answer` to verify agent submissions | Must correctly classify answer types and pre-compute gold answers |
| F003 (Dense Reward) | Uses `gold_answer` for progress-to-target comparison (Layer 2) | Gold answers must be deterministic and correct |
| F006 (GRPO Training) | Uses train split for training, eval split for evaluation | Train/eval split must be clean |
| `server/sql_environment.py` | Currently hardcodes `student_assessment` ORM imports and schema description | Multi-database support will require changes to environment (F001 scope), but F004 must provide the data |

### External Systems

| System | Integration Point | Considerations |
|--------|-------------------|----------------|
| HuggingFace Hub | `datasets.load_dataset()` | Network required for initial download; cache locally |
| Spider SQLite databases | Direct file access | No `.sqlite` files exist in repo yet; must be downloaded or created |

---

## 4. Risks & Edge Cases

### Identified Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Gold SQL produces different results across SQLite versions | Low | Incorrect gold_answer, bad reward signal | Pin SQLite version; validate gold answers on target SQLite |
| Some Spider questions use SQL features not in SQLite (e.g., `ILIKE`, `DATEDIFF`) | Medium | Questions fail to execute | Filter questions by executing gold SQL against actual SQLite; exclude failures |
| Spider database `.sqlite` files not available via HuggingFace datasets API | Medium | Cannot execute gold SQL to compute gold_answer | Download `.sqlite` files from Spider GitHub repo or reconstruct from schema |
| Ambiguous gold answers (queries returning non-deterministic order) | Medium | Reward gives false negatives | For list/table answer types, use order-insensitive comparison; flag and review ORDER BY-dependent queries |
| Difficulty classification is subjective | Low | Uneven difficulty distribution | Use heuristic: count distinct tables in gold SQL to assign difficulty |
| Train/eval data leakage (same question rephrased across Spider train/validation) | Low | Overfitting on eval set | Use Spider's own train/validation split as basis; additionally deduplicate by gold SQL |

### Edge Cases to Handle

| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| Gold SQL returns empty result set | N/A | Include as valid question; gold_answer = empty list/table; answer_type = "list" or "table" |
| Gold SQL returns NULL values | N/A | Normalize NULLs to Python None; handle in answer_type classification |
| Multiple valid gold SQLs for same question | Only one gold SQL per Spider question | Accept Spider's single gold SQL; note that alternative SQL may produce same answer |
| Database has no questions in Spider | N/A | Skip database during curation |
| Question text contains typos/ambiguity (Spider known issue) | N/A | Accept as-is for MVP; flag obvious issues |

### Invariants to Preserve

- [ ] Every question in the dataset has all required fields populated (no partial records)
- [ ] Every `gold_sql` executes successfully against its corresponding SQLite database
- [ ] Every `gold_answer` matches the result of executing `gold_sql`
- [ ] Train and eval splits have no overlapping question IDs
- [ ] Difficulty distribution approximates 40/40/20 (easy/medium/hard)
- [ ] No question requires SQL features unsupported by SQLite

---

## 4b. Code Shape & Design Target

### Existing Vocabulary

| Concept | Existing Name | Location |
|---------|---------------|----------|
| Question metadata (conceptual) | `QuestionRecord` | `models.py` lines 224-235 (commented design) |
| Database identifier | `db_id` | Spider format field; used throughout `download_spider_data.py` |
| Gold SQL | `query` (Spider) / `gold_sql` (QuestionRecord) | `student_assessment.json` / `models.py` |
| Answer types | `integer, float, string, list, table` | `models.py` line 233, `server/verifier.py` docstring |
| Difficulty levels | `easy, medium, hard` | `models.py` line 234 |
| ORM models dictionary | `self.db_models` | `server/sql_environment.py` line 127 |
| Spider download function | `download_spider_questions()` | `scripts/download_spider_data.py` |
| Model generation function | `generate_simplified_models()` | `scripts/generate_models_from_schema.py` |

### Language/Framework Idioms

- Python scripts with argparse CLI in `scripts/` directory
- JSON files for data storage (not YAML, not CSV)
- SQLAlchemy declarative ORM for database schema (though direct SQLite may suffice for F004)
- Pydantic models for typed data contracts (`models.py`)
- HuggingFace `datasets` library for Spider data access
- Type hints throughout; `pathlib.Path` for file operations
- Docstrings in Google style with Args/Returns sections

### Target Shape

| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `scripts/curate_questions.py` | Main curation script: download questions for selected DBs, enrich with metadata, compute gold answers, assign difficulty, create splits, validate, output final dataset | Single script matching existing `scripts/` pattern; orchestrates the full pipeline |
| `data/questions/questions_train.json` | Training split (70%) of enriched questions | Consumed by F001 at reset(); separate file makes split explicit |
| `data/questions/questions_eval.json` | Evaluation split (30%) of enriched questions | Consumed by F005 Green Agent; prevents training on eval data |
| `data/databases/{db_id}/{db_id}.sqlite` | SQLite database files per Spider database | Required to execute gold SQL and compute gold_answer; also needed by F001 for live query execution |

### Abstraction Level

- **Current level:** Flat scripts in `scripts/`. Data as JSON files in `data/`. No abstraction layers for data loading.
- **Recommendation:** Match existing flat style. One script that does everything end-to-end. Output is JSON files. No data-loading library or ORM for the curation pipeline itself -- just `sqlite3` and `json`. The environment's question-loading code belongs to F001, not F004.

### Anti-Patterns to Avoid

- Do not create a complex data pipeline framework (e.g., classes like `QuestionCurator`, `DifficultyClassifier`, `AnswerTypeDetector`). A single script with clear functions is sufficient.
- Do not generate SQLAlchemy ORM models per database for F004 purposes. The curation script only needs `sqlite3` to execute gold SQL. Whether the environment needs ORM models per DB is an F001 decision.
- Do not embed the curation logic inside the server code. Keep it as a standalone script that produces static JSON files.
- Do not hardcode database selection. Use a configuration list (or CLI args) so databases can be added/removed easily.

---

## 5. Constraints

### Technical Constraints

| Constraint | Requirement | Notes |
|------------|-------------|-------|
| SQLite compatibility | All gold SQL must execute on SQLite | Spider was designed for SQLite; ~99% compatible, but verify edge cases |
| Dataset size | 100+ questions minimum | Quality over quantity; user specified 100 as sufficient for MVP |
| Difficulty split | ~40% easy / ~40% medium / ~20% hard | Hard questions (4+ tables) are rarer in Spider; may need to pull from more databases |
| Answer types | Cover integer, float, string, list at minimum | Table type can be deferred per F002 user interview |
| No network at runtime | Questions and SQLite files must be committed to repo or downloaded once | Curation script runs offline after initial download |

### Pattern Constraints

- Question ID format: Use `{db_id}_{split}_{index}` (e.g., `concert_singer_train_007`) to match the conceptual `QuestionRecord.question_id` format like `spider_dev_042`
- Output JSON must match the `QuestionRecord` fields defined in `models.py` comments: `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved`
- Spider's own train/validation split should be respected as the basis (train questions -> train split, validation questions -> eval split)

### Testing Constraints

| Test Suite | Coverage Area | Notes |
|------------|---------------|-------|
| `tests/test_smoke.py` | Environment instantiation, action detection, serialization | Must still pass; F004 should not change server code |
| New: dataset validation tests | All questions valid, gold SQL executes, splits clean | Should be part of curation script's `--validate` mode or a separate test |

---

## 6. Open Questions

| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Which specific Spider databases to include? | Determines schema diversity and question count | Researcher (see analysis below) |
| Where to get SQLite database files? | Spider HuggingFace datasets may not include `.sqlite` files directly | Technical investigation |
| Should the output be per-database JSON files or a single manifest? | Affects how F001 loads questions | Architecture decision |

**Spider Database Candidates (Research-Based Recommendations):**

Good candidates for diverse schemas and well-formed questions based on Spider dataset characteristics:
1. `student_assessment` (already have; 53 questions, 9 tables) -- education domain
2. `concert_singer` (popular Spider DB; ~30 questions, 4 tables) -- entertainment domain
3. `world_1` (~30 questions, 3 tables) -- geography domain
4. `car_1` (~20 questions, 4 tables) -- automotive domain
5. `employee_hire_evaluation` (~20 questions, 4 tables) -- HR domain
6. `pets_1` (~20 questions, 3 tables) -- simple schema, good for easy questions
7. `cre_Doc_Template_Mgt` (~25 questions, 6 tables) -- document management domain
8. `dog_kennels` (~25 questions, 7 tables) -- business domain
9. `flight_2` (~20 questions, 5 tables) -- transportation domain
10. `poker_player` (~15 questions, 2 tables) -- simple, good for easy questions

These span diverse domains, table counts (2-9), and would yield ~250+ raw questions to curate down to 100+ high-quality ones.

**SQLite Database Files:**

The Spider dataset's SQLite files are typically obtained from the official Spider GitHub release (`https://github.com/taoyds/spider`), not from the HuggingFace datasets API. The `xlangai/spider` HuggingFace dataset contains questions but likely not the `.sqlite` files themselves. The curation script will need to either:
1. Download `.sqlite` files from the Spider GitHub release
2. Reconstruct databases from the schema dataset using `CREATE TABLE` + `INSERT` statements

Option 1 is more reliable. The Spider GitHub release includes a `database/` directory with all SQLite files.

---

## 7. Context Sources

| Source | Type | Notes |
|--------|------|-------|
| `data/questions/student_assessment.json` | Code/Data | Current format: raw Spider with `db_id`, `query`, `question`, tokenized variants. Missing: difficulty, answer_type, gold_answer, tables_involved |
| `scripts/download_spider_data.py` | Code | Downloads from `xlangai/spider` HuggingFace dataset. Supports `--db-id` filter and `--split` (train/validation) |
| `scripts/generate_models_from_schema.py` | Code | Downloads from `richardr1126/spider-schema`. Generates SQLAlchemy ORM files. Uses `generate_simplified_models()` |
| `data/databases/models.py` | Code | Hand-written SQLAlchemy ORM for student_assessment. 9 tables with relationships. This is the reference quality for ORM models |
| `models.py` | Code | `QuestionRecord` conceptual design (lines 224-235): defines target fields `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved` |
| `server/sql_environment.py` | Code | Hardcoded to student_assessment: imports 9 specific ORM models, builds static schema string. `_build_schema_description()` must match ORM |
| `server/verifier.py` | Code | Stub with docstring defining 5 answer types: integer, float, string, list, table |
| `server/reward.py` | Code | Stub referencing 3-layer reward. Layer 2 needs gold_answer for progress comparison |
| `docs/ARCHITECTURE.md` | Doc | System map showing current single-DB architecture. Notes SQLite files not yet present |
| `specs/FEATURES.json` | Doc | F004 definition with user interview context |
| `docs_draft/sql_env_project_brief.md` | Doc | Project brief: 50-100 questions target, multi-hop insight, difficulty progression |

---

## Human Validation Checkpoint

**Before proceeding to planning, please confirm:**

- [ ] System context is accurate
- [ ] Dependencies are complete
- [ ] Risks are identified
- [ ] Constraints are correct
- [ ] Open questions can be resolved

**Questions for reviewer:**
1. Is anything incorrect or missing?
2. Are there risks I haven't identified?
3. Should we proceed to planning?

---

*Validated by: [NAME] on [DATE]*