# Research Summary **Project:** SQLEnv - Question Dataset Expansion **Change:** F004 - Expand from 53 questions (one DB) to 100+ questions across 5-10 Spider databases with difficulty labels, answer_type metadata, gold_answer fields, and train/eval split **Date:** 2026-03-24 **Status:** Draft --- ## 1. Change Overview ### What We're Changing Expanding the question dataset from the current 53 Spider questions for a single database (`student_assessment`) to 100+ curated questions spanning 5-10 Spider databases. Each question will be enriched with metadata fields: `difficulty` (easy/medium/hard at 40/40/20 split), `answer_type` (integer/float/string/list/table), `gold_answer` (pre-computed), and `tables_involved`. The dataset will be split into train (70%) and eval (30%) partitions. ### Why We're Changing It Training on a single database schema risks overfitting. The agent needs diverse schemas, question patterns, and difficulty levels to develop generalizable SQL exploration strategies. Pre-computed gold answers avoid re-executing gold SQL every RL episode, improving training throughput. ### Success Criteria - 100+ questions across 5-10 Spider databases - Difficulty distribution: ~40% easy (1-2 tables), ~40% medium (2-3 tables), ~20% hard (4+ tables) - Every question has: `gold_answer`, `answer_type`, `difficulty`, `tables_involved` - Train/eval split at 70/30 with no cross-contamination - No questions requiring SQL features unsupported by SQLite - Diverse answer types (integer, float, string, list) and SQL patterns (aggregation, joins, subqueries, grouping) --- ## 2. System Context ### Current Behavior The system stores 53 raw Spider questions in `data/questions/student_assessment.json`. These questions use the Spider dataset's native format with fields: `db_id`, `query` (gold SQL), `question` (natural language), `query_toks`, `query_toks_no_value`, and `question_toks`. There are no `difficulty`, `answer_type`, `gold_answer`, or `tables_involved` fields. There is no train/eval split. The `SQLEnvironment` class in `server/sql_environment.py` currently hardcodes the `student_assessment` schema: it imports all 9 ORM models by name and builds a static schema description string in `_build_schema_description()`. Questions are not yet loaded or used in the environment loop (the conceptual `QuestionRecord` is defined in comments in `models.py` but not implemented). ### Architecture Context ``` data/ questions/ student_assessment.json <-- Current: raw Spider format, 53 questions (new files per database) <-- Target: enriched format, 100+ questions databases/ models.py <-- Current: student_assessment ORM only (new models per database) <-- Target: ORM per database OR direct SQLite scripts/ download_spider_data.py <-- Downloads questions from HuggingFace generate_models_from_schema.py <-- Auto-generates ORM from Spider schema server/ sql_environment.py <-- Hardcoded to student_assessment schema verifier.py <-- Stub; will use answer_type for comparison models.py <-- QuestionRecord conceptual design (comments) ``` F004 is a **data-layer feature** that produces the enriched question files. It does not implement the environment's question-loading logic (that belongs to F001 Core Environment Loop), but it must produce data in the format that F001/F002/F003 will consume. ### Entry Points | Entry Point | Trigger | Current Flow | |-------------|---------|--------------| | `data/questions/student_assessment.json` | Read at import/reset | Raw Spider format; 53 entries with `db_id`, `query`, `question`, tokenized variants | | `scripts/download_spider_data.py` | Manual CLI invocation | Downloads Spider questions from `xlangai/spider` HuggingFace dataset, filters by `db_id`, saves raw JSON | | `scripts/generate_models_from_schema.py` | Manual CLI invocation | Downloads schema from `richardr1126/spider-schema`, generates SQLAlchemy model `.py` files | | `data/databases/models.py` | Imported by `sql_environment.py` | Hand-written SQLAlchemy ORM for student_assessment (9 tables) | ### Data Flow | Data | Source | Shape/Type | Destination | |------|--------|------------|-------------| | Raw Spider questions | HuggingFace `xlangai/spider` | `[{db_id, query, question, query_toks, query_toks_no_value, question_toks}]` | `data/questions/{db_id}.json` | | Spider schema | HuggingFace `richardr1126/spider-schema` | `[{db_id, table: [{name, columns: [{name, type}]}], foreign_keys}]` | `data/models/{db_id}.py` (generated ORM) | | Enriched questions (target) | Curation script (new) | `[{id, db_id, question, gold_sql, gold_answer, answer_type, difficulty, tables_involved, split}]` | `data/questions/` or single manifest file | | SQLite database files | Spider dataset (to be downloaded) | `.sqlite` files | `data/databases/{db_id}/{db_id}.sqlite` | --- ## 3. Dependencies ### Code We Depend On | Dependency | What We Use | Risk if Changed | |------------|-------------|-----------------| | `datasets` (HuggingFace) | `load_dataset("xlangai/spider")` for questions, `load_dataset("richardr1126/spider-schema")` for schemas | Dataset API changes could break download scripts | | Spider dataset (`xlangai/spider`) | Raw questions with gold SQL | Dataset structure is stable (academic benchmark) | | Spider schema dataset (`richardr1126/spider-schema`) | Table/column definitions for ORM generation | Third-party dataset; less stable than official Spider | | `sqlite3` (stdlib) | Execute gold SQL to compute `gold_answer` | Stable (stdlib) | | SQLAlchemy | ORM model definitions used by environment | Already a project dependency | ### Code That Depends On Us | Dependent | How They Use Us | Impact of Our Change | |-----------|-----------------|---------------------| | F001 (Core Environment Loop) | Loads questions from JSON at `reset()`, selects question, opens SQLite database | Must produce questions in format matching `QuestionRecord` conceptual design in `models.py` | | F002 (Answer Verification) | Uses `answer_type` and `gold_answer` to verify agent submissions | Must correctly classify answer types and pre-compute gold answers | | F003 (Dense Reward) | Uses `gold_answer` for progress-to-target comparison (Layer 2) | Gold answers must be deterministic and correct | | F006 (GRPO Training) | Uses train split for training, eval split for evaluation | Train/eval split must be clean | | `server/sql_environment.py` | Currently hardcodes `student_assessment` ORM imports and schema description | Multi-database support will require changes to environment (F001 scope), but F004 must provide the data | ### External Systems | System | Integration Point | Considerations | |--------|-------------------|----------------| | HuggingFace Hub | `datasets.load_dataset()` | Network required for initial download; cache locally | | Spider SQLite databases | Direct file access | No `.sqlite` files exist in repo yet; must be downloaded or created | --- ## 4. Risks & Edge Cases ### Identified Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Gold SQL produces different results across SQLite versions | Low | Incorrect gold_answer, bad reward signal | Pin SQLite version; validate gold answers on target SQLite | | Some Spider questions use SQL features not in SQLite (e.g., `ILIKE`, `DATEDIFF`) | Medium | Questions fail to execute | Filter questions by executing gold SQL against actual SQLite; exclude failures | | Spider database `.sqlite` files not available via HuggingFace datasets API | Medium | Cannot execute gold SQL to compute gold_answer | Download `.sqlite` files from Spider GitHub repo or reconstruct from schema | | Ambiguous gold answers (queries returning non-deterministic order) | Medium | Reward gives false negatives | For list/table answer types, use order-insensitive comparison; flag and review ORDER BY-dependent queries | | Difficulty classification is subjective | Low | Uneven difficulty distribution | Use heuristic: count distinct tables in gold SQL to assign difficulty | | Train/eval data leakage (same question rephrased across Spider train/validation) | Low | Overfitting on eval set | Use Spider's own train/validation split as basis; additionally deduplicate by gold SQL | ### Edge Cases to Handle | Edge Case | Current Behavior | Required Behavior | |-----------|------------------|-------------------| | Gold SQL returns empty result set | N/A | Include as valid question; gold_answer = empty list/table; answer_type = "list" or "table" | | Gold SQL returns NULL values | N/A | Normalize NULLs to Python None; handle in answer_type classification | | Multiple valid gold SQLs for same question | Only one gold SQL per Spider question | Accept Spider's single gold SQL; note that alternative SQL may produce same answer | | Database has no questions in Spider | N/A | Skip database during curation | | Question text contains typos/ambiguity (Spider known issue) | N/A | Accept as-is for MVP; flag obvious issues | ### Invariants to Preserve - [ ] Every question in the dataset has all required fields populated (no partial records) - [ ] Every `gold_sql` executes successfully against its corresponding SQLite database - [ ] Every `gold_answer` matches the result of executing `gold_sql` - [ ] Train and eval splits have no overlapping question IDs - [ ] Difficulty distribution approximates 40/40/20 (easy/medium/hard) - [ ] No question requires SQL features unsupported by SQLite --- ## 4b. Code Shape & Design Target ### Existing Vocabulary | Concept | Existing Name | Location | |---------|---------------|----------| | Question metadata (conceptual) | `QuestionRecord` | `models.py` lines 224-235 (commented design) | | Database identifier | `db_id` | Spider format field; used throughout `download_spider_data.py` | | Gold SQL | `query` (Spider) / `gold_sql` (QuestionRecord) | `student_assessment.json` / `models.py` | | Answer types | `integer, float, string, list, table` | `models.py` line 233, `server/verifier.py` docstring | | Difficulty levels | `easy, medium, hard` | `models.py` line 234 | | ORM models dictionary | `self.db_models` | `server/sql_environment.py` line 127 | | Spider download function | `download_spider_questions()` | `scripts/download_spider_data.py` | | Model generation function | `generate_simplified_models()` | `scripts/generate_models_from_schema.py` | ### Language/Framework Idioms - Python scripts with argparse CLI in `scripts/` directory - JSON files for data storage (not YAML, not CSV) - SQLAlchemy declarative ORM for database schema (though direct SQLite may suffice for F004) - Pydantic models for typed data contracts (`models.py`) - HuggingFace `datasets` library for Spider data access - Type hints throughout; `pathlib.Path` for file operations - Docstrings in Google style with Args/Returns sections ### Target Shape | Component | Purpose | Why This Boundary | |-----------|---------|-------------------| | `scripts/curate_questions.py` | Main curation script: download questions for selected DBs, enrich with metadata, compute gold answers, assign difficulty, create splits, validate, output final dataset | Single script matching existing `scripts/` pattern; orchestrates the full pipeline | | `data/questions/questions_train.json` | Training split (70%) of enriched questions | Consumed by F001 at reset(); separate file makes split explicit | | `data/questions/questions_eval.json` | Evaluation split (30%) of enriched questions | Consumed by F005 Green Agent; prevents training on eval data | | `data/databases/{db_id}/{db_id}.sqlite` | SQLite database files per Spider database | Required to execute gold SQL and compute gold_answer; also needed by F001 for live query execution | ### Abstraction Level - **Current level:** Flat scripts in `scripts/`. Data as JSON files in `data/`. No abstraction layers for data loading. - **Recommendation:** Match existing flat style. One script that does everything end-to-end. Output is JSON files. No data-loading library or ORM for the curation pipeline itself -- just `sqlite3` and `json`. The environment's question-loading code belongs to F001, not F004. ### Anti-Patterns to Avoid - Do not create a complex data pipeline framework (e.g., classes like `QuestionCurator`, `DifficultyClassifier`, `AnswerTypeDetector`). A single script with clear functions is sufficient. - Do not generate SQLAlchemy ORM models per database for F004 purposes. The curation script only needs `sqlite3` to execute gold SQL. Whether the environment needs ORM models per DB is an F001 decision. - Do not embed the curation logic inside the server code. Keep it as a standalone script that produces static JSON files. - Do not hardcode database selection. Use a configuration list (or CLI args) so databases can be added/removed easily. --- ## 5. Constraints ### Technical Constraints | Constraint | Requirement | Notes | |------------|-------------|-------| | SQLite compatibility | All gold SQL must execute on SQLite | Spider was designed for SQLite; ~99% compatible, but verify edge cases | | Dataset size | 100+ questions minimum | Quality over quantity; user specified 100 as sufficient for MVP | | Difficulty split | ~40% easy / ~40% medium / ~20% hard | Hard questions (4+ tables) are rarer in Spider; may need to pull from more databases | | Answer types | Cover integer, float, string, list at minimum | Table type can be deferred per F002 user interview | | No network at runtime | Questions and SQLite files must be committed to repo or downloaded once | Curation script runs offline after initial download | ### Pattern Constraints - Question ID format: Use `{db_id}_{split}_{index}` (e.g., `concert_singer_train_007`) to match the conceptual `QuestionRecord.question_id` format like `spider_dev_042` - Output JSON must match the `QuestionRecord` fields defined in `models.py` comments: `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved` - Spider's own train/validation split should be respected as the basis (train questions -> train split, validation questions -> eval split) ### Testing Constraints | Test Suite | Coverage Area | Notes | |------------|---------------|-------| | `tests/test_smoke.py` | Environment instantiation, action detection, serialization | Must still pass; F004 should not change server code | | New: dataset validation tests | All questions valid, gold SQL executes, splits clean | Should be part of curation script's `--validate` mode or a separate test | --- ## 6. Open Questions | Question | Why It Matters | Who Can Answer | |----------|----------------|----------------| | Which specific Spider databases to include? | Determines schema diversity and question count | Researcher (see analysis below) | | Where to get SQLite database files? | Spider HuggingFace datasets may not include `.sqlite` files directly | Technical investigation | | Should the output be per-database JSON files or a single manifest? | Affects how F001 loads questions | Architecture decision | **Spider Database Candidates (Research-Based Recommendations):** Good candidates for diverse schemas and well-formed questions based on Spider dataset characteristics: 1. `student_assessment` (already have; 53 questions, 9 tables) -- education domain 2. `concert_singer` (popular Spider DB; ~30 questions, 4 tables) -- entertainment domain 3. `world_1` (~30 questions, 3 tables) -- geography domain 4. `car_1` (~20 questions, 4 tables) -- automotive domain 5. `employee_hire_evaluation` (~20 questions, 4 tables) -- HR domain 6. `pets_1` (~20 questions, 3 tables) -- simple schema, good for easy questions 7. `cre_Doc_Template_Mgt` (~25 questions, 6 tables) -- document management domain 8. `dog_kennels` (~25 questions, 7 tables) -- business domain 9. `flight_2` (~20 questions, 5 tables) -- transportation domain 10. `poker_player` (~15 questions, 2 tables) -- simple, good for easy questions These span diverse domains, table counts (2-9), and would yield ~250+ raw questions to curate down to 100+ high-quality ones. **SQLite Database Files:** The Spider dataset's SQLite files are typically obtained from the official Spider GitHub release (`https://github.com/taoyds/spider`), not from the HuggingFace datasets API. The `xlangai/spider` HuggingFace dataset contains questions but likely not the `.sqlite` files themselves. The curation script will need to either: 1. Download `.sqlite` files from the Spider GitHub release 2. Reconstruct databases from the schema dataset using `CREATE TABLE` + `INSERT` statements Option 1 is more reliable. The Spider GitHub release includes a `database/` directory with all SQLite files. --- ## 7. Context Sources | Source | Type | Notes | |--------|------|-------| | `data/questions/student_assessment.json` | Code/Data | Current format: raw Spider with `db_id`, `query`, `question`, tokenized variants. Missing: difficulty, answer_type, gold_answer, tables_involved | | `scripts/download_spider_data.py` | Code | Downloads from `xlangai/spider` HuggingFace dataset. Supports `--db-id` filter and `--split` (train/validation) | | `scripts/generate_models_from_schema.py` | Code | Downloads from `richardr1126/spider-schema`. Generates SQLAlchemy ORM files. Uses `generate_simplified_models()` | | `data/databases/models.py` | Code | Hand-written SQLAlchemy ORM for student_assessment. 9 tables with relationships. This is the reference quality for ORM models | | `models.py` | Code | `QuestionRecord` conceptual design (lines 224-235): defines target fields `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved` | | `server/sql_environment.py` | Code | Hardcoded to student_assessment: imports 9 specific ORM models, builds static schema string. `_build_schema_description()` must match ORM | | `server/verifier.py` | Code | Stub with docstring defining 5 answer types: integer, float, string, list, table | | `server/reward.py` | Code | Stub referencing 3-layer reward. Layer 2 needs gold_answer for progress comparison | | `docs/ARCHITECTURE.md` | Doc | System map showing current single-DB architecture. Notes SQLite files not yet present | | `specs/FEATURES.json` | Doc | F004 definition with user interview context | | `docs_draft/sql_env_project_brief.md` | Doc | Project brief: 50-100 questions target, multi-hop insight, difficulty progression | --- ## Human Validation Checkpoint **Before proceeding to planning, please confirm:** - [ ] System context is accurate - [ ] Dependencies are complete - [ ] Risks are identified - [ ] Constraints are correct - [ ] Open questions can be resolved **Questions for reviewer:** 1. Is anything incorrect or missing? 2. Are there risks I haven't identified? 3. Should we proceed to planning? --- *Validated by: [NAME] on [DATE]*