# Verification Specification **Feature:** F006 **Generated from:** specs/F006-VERIFICATION_INPUT.json **Generated:** 2026-03-27 --- ## 1. Unit Tests ### 1.1 GRPOConfig | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_grpo_config_defaults | All defaults are populated when only required fields given | `GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/")` | `max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B"` | happy | | test_grpo_config_custom_values | Custom values override defaults | `GRPOConfig(model_name="gpt2", max_new_tokens=128, ...)` | Fields match custom values | happy | | test_grpo_config_required_fields | Missing required fields raise error | `GRPOConfig()` (no questions_path, db_dir, output_dir) | `TypeError` or validation error | error | | test_grpo_config_negative_batch_size | Negative or zero batch size | `per_device_train_batch_size=0` | Validation error or clear failure at training time | edge | | test_grpo_config_negative_learning_rate | Negative learning rate | `learning_rate=-1.0` | Validation error | edge | | test_grpo_config_empty_difficulty_filter | Empty difficulty filter list | `difficulty_filter=[]` | Empty training set or clear error | edge | | test_grpo_config_seed_reproducibility | Same seed produces same config state | `seed=42` twice | Identical configs | happy | **Run:** `uv run pytest tests/unit/test_grpo_config.py -v` --- ### 1.2 get_system_prompt (training/prompts.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_system_prompt_returns_string | Function returns non-empty string | None | `isinstance(result, str) and len(result) > 0` | happy | | test_system_prompt_mentions_action_types | Prompt documents all four action types | None | Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER" | happy | | test_system_prompt_is_deterministic | Multiple calls return identical string | None | `get_system_prompt() == get_system_prompt()` | happy | **Run:** `uv run pytest tests/unit/test_prompts.py -v` --- ### 1.3 format_observation (training/prompts.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_format_observation_happy | Formats a normal observation into user-turn string | `SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None)` | Non-empty string containing question, result, and budget info | happy | | test_format_observation_with_error | Error field is surfaced in formatted string | `SQLObservation(..., error="syntax error", result="")` | String contains "syntax error" or error indication | happy | | test_format_observation_done_state | Terminal observation is properly formatted | `SQLObservation(..., done=True, reward=1.0)` | String includes reward/done indication | happy | | test_format_observation_empty_result | Empty result is handled gracefully | `SQLObservation(..., result="", error="")` | Returns valid string without crashing | edge | | test_format_observation_long_result | Very long result string | `SQLObservation(..., result="x" * 10000)` | Returns string (may be truncated); no crash | edge | **Run:** `uv run pytest tests/unit/test_prompts.py -v` --- ### 1.4 parse_model_output (training/rollout.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_parse_describe | Parses DESCRIBE action | `"DESCRIBE employees"` | `SQLAction(action_type="DESCRIBE", argument="employees")` | happy | | test_parse_sample | Parses SAMPLE action | `"SAMPLE departments"` | `SQLAction(action_type="SAMPLE", argument="departments")` | happy | | test_parse_query | Parses QUERY action | `"QUERY SELECT COUNT(*) FROM employees"` | `SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employees")` | happy | | test_parse_answer | Parses ANSWER action | `"ANSWER 42"` | `SQLAction(action_type="ANSWER", argument="42")` | happy | | test_parse_case_insensitive | Case variations accepted | `"describe employees"` or `"Describe employees"` | Valid SQLAction with action_type="DESCRIBE" | edge | | test_parse_with_colon_separator | Colon-separated format | `"QUERY: SELECT 1"` | `SQLAction(action_type="QUERY", argument="SELECT 1")` | edge | | test_parse_garbage_fallback | Unparseable text falls back to QUERY | `"hello world random text"` | `SQLAction(action_type="QUERY", argument="hello world random text")` | error | | test_parse_empty_string_fallback | Empty string falls back to QUERY | `""` | `SQLAction(action_type="QUERY", argument="")` | edge | | test_parse_only_action_no_argument | Action keyword with no argument | `"DESCRIBE"` | Fallback or empty argument handled gracefully | edge | | test_parse_multiline_output | Model output with multiple lines | `"Let me think...\nQUERY SELECT 1"` | Extracts QUERY action or falls back to QUERY with raw text | edge | | test_parse_whitespace_padded | Leading/trailing whitespace | `" ANSWER 42 "` | `SQLAction(action_type="ANSWER", argument="42")` | edge | **Run:** `uv run pytest tests/unit/test_rollout.py -v` --- ### 1.5 reward_correctness (training/rewards.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_correctness_correct_answer | Episode ended with correct answer | Completions with correct=True metadata | `[1.0]` | happy | | test_correctness_wrong_answer | Episode ended with wrong answer | Completions with correct=False metadata | `[0.0]` | happy | | test_correctness_no_answer | Episode timed out without answering | Completions with no answer metadata | `[0.0]` | edge | | test_correctness_batch | Multiple episodes in batch | Mixed correct/wrong | `[1.0, 0.0, 1.0, 0.0]` matching per-episode correctness | happy | | test_correctness_empty_batch | Empty completions list | `[]` | `[]` | edge | | test_correctness_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy | **Run:** `uv run pytest tests/unit/test_rewards.py -v` --- ### 1.6 reward_progress (training/rewards.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_progress_full | Maximum progress (correct answer) | Completions with full progress metadata | Reward in `[0.0, 1.0]`, close to 1.0 | happy | | test_progress_none | No progress toward answer | Completions with zero progress | `[0.0]` | happy | | test_progress_partial | Partial progress | Completions with partial closeness | Reward in `(0.0, 1.0)` exclusive | happy | | test_progress_normalized | Output is always in [0, 1] range | Various inputs | `all(0.0 <= r <= 1.0 for r in result)` | happy | | test_progress_batch | Batch of varied progress | Multiple episodes | List of floats, length matches input | happy | | test_progress_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy | **Run:** `uv run pytest tests/unit/test_rewards.py -v` --- ### 1.7 reward_operational (training/rewards.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_operational_good_episode | All steps execute OK, discover new info, no repeats | Completions with exec_ok=True, new_info=True per step | Positive reward | happy | | test_operational_all_errors | Every step has execution errors | Completions with exec_ok=False per step | Low/negative reward | error | | test_operational_repeat_penalty | Episode with repeated identical actions | Completions with repeat=True per step | Lower reward than non-repeating | happy | | test_operational_mixed_signals | Mix of good and bad steps | Varied step signals | Reward between extremes | happy | | test_operational_single_step | Episode with only one step | Single step completions | Valid float returned | edge | | test_operational_batch | Multiple episodes | Batch input | List of floats, length matches | happy | | test_operational_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy | **Run:** `uv run pytest tests/unit/test_rewards.py -v` --- ### 1.8 rollout_func (training/rollout.py) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_rollout_returns_completions | Returns list of dicts with expected keys | Single prompt, mock model/tokenizer | List of dicts with "content" and metadata keys | happy | | test_rollout_batch_size | Output length matches input prompt count | N prompts | N completions returned | happy | | test_rollout_episode_terminates | Episodes terminate within step_budget | Config with step_budget=5 | All episodes have <= 5 steps | happy | | test_rollout_metadata_present | Completions include correctness, progress, operational metadata | Any valid input | Each completion dict has "correct", "progress", "operational" keys | happy | | test_rollout_unparseable_action | Model generates gibberish, fallback fires | Mock model returning garbage tokens | Episode continues; no crash | error | | test_rollout_truncation | Long history is truncated to system + last 3 pairs | Mock model, config with step_budget=20 | Context does not exceed token window | edge | **Run:** `uv run pytest tests/unit/test_rollout.py -v` --- ## 2. Integration Tests ### Flow: End-to-End Training Episode | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Create GRPOConfig with test questions and mock DB | Config object created | Config fields match inputs | | 2 | Load questions and filter by difficulty | Only easy+medium questions included | Assert filtered count < total if hard questions exist | | 3 | Call rollout_func with a real SQLEnvironment and mock model | Completions returned with metadata | Each completion has "content" key | | 4 | Pass completions to reward_correctness | Returns list[float] of 0.0/1.0 | Length matches batch size | | 5 | Pass completions to reward_progress | Returns list[float] in [0,1] | Length matches batch size | | 6 | Pass completions to reward_operational | Returns list[float] | Length matches batch size | **Run:** `uv run pytest tests/integration/test_training_pipeline.py -v` --- ### Flow: Unparseable Action Recovery | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Mock model generates unparseable text | parse_model_output returns QUERY fallback | action_type == "QUERY", argument == raw text | | 2 | SQLEnvironment.step receives fallback action | Returns error observation | observation.error is non-empty | | 3 | Episode continues with next step | Step count increments, budget decreases | step_count > previous, budget_remaining < previous | **Run:** `uv run pytest tests/integration/test_training_pipeline.py -v` --- ### Flow: History Truncation | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Run rollout with step_budget large enough to exceed token window | Truncation is triggered | History contains system prompt + last 3 observation-action pairs only | | 2 | Episode completes normally after truncation | No crash; completions returned | Valid completion dicts in output | **Run:** `uv run pytest tests/integration/test_training_pipeline.py -v` --- ## 3. API Tests No API endpoints defined for F006. All interfaces are Python function calls. --- ## 4. E2E Tests ### Scenario: Training Notebook Smoke Test **Setup:** Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock). **Actions:** 1. Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2). 2. Load model and tokenizer (use smallest available model or mock). 3. Create GRPOTrainer with reward functions. 4. Run trainer.train() for a single step. 5. Verify learning curve data is logged. 6. Run comparison episodes (before/after). **Expected:** - Training completes without error. - At least one metric is logged (loss, reward). - Comparison episodes produce valid SQLObservation sequences. **Run:** `uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300` --- ### Scenario: Question Filtering by Difficulty **Setup:** Questions file with easy, medium, and hard questions. **Actions:** 1. Create GRPOConfig with `difficulty_filter=["easy"]`. 2. Load and filter questions. **Expected:** Only easy questions are included in training set. **Run:** `uv run pytest tests/e2e/test_training_e2e.py -v` --- ## 5. Error Handling Tests ### ModelLoadError | Test | Description | Trigger | Expected | |------|-------------|---------|----------| | test_model_load_error_bad_name | Invalid HuggingFace model name | `GRPOConfig(model_name="nonexistent/model-xyz-999")` | Fails fast; error message contains "nonexistent/model-xyz-999" | ### ActionParseError (handled via fallback) | Test | Description | Trigger | Expected | |------|-------------|---------|----------| | test_action_parse_fallback_logged | Unparseable action triggers warning log | Model outputs `"¯\_(ツ)_/¯"` | Warning logged; returns QUERY fallback | ### QuestionLoadError | Test | Description | Trigger | Expected | |------|-------------|---------|----------| | test_question_load_missing_file | Questions path does not exist | `GRPOConfig(questions_path="/nonexistent/q.json")` | Fails fast; error message contains the path | | test_question_load_empty_file | Questions file is empty JSON array | `questions.json` containing `[]` | Fails fast; clear error about empty questions | | test_question_load_invalid_json | Questions file has invalid JSON | `questions.json` containing `{broken` | Fails fast; JSON parse error | ### OOMError | Test | Description | Trigger | Expected | |------|-------------|---------|----------| | test_oom_guidance | OOM during training prints guidance | (Cannot reliably trigger in test; verify message formatting only) | Error handler message mentions reducing batch_size or num_generations | **Run:** `uv run pytest tests/unit/test_error_handling.py -v` --- ## 6. Edge Cases Checklist - [ ] Null/None inputs to parse_model_output - [ ] Empty string inputs to parse_model_output - [ ] Empty completions list to all reward functions - [ ] Single-element completions list to all reward functions - [ ] Very large batch (100+ prompts) to rollout_func - [ ] Questions file with only hard questions and difficulty_filter=["easy"] (zero matches) - [ ] step_budget=1 (immediate budget exhaustion after one action) - [ ] step_budget=0 (zero budget) - [ ] Unicode characters in model output (e.g., CJK, emoji) - [ ] Model output exceeding max_new_tokens - [ ] learning_rate=0.0 (no weight updates) - [ ] num_generations=1 (minimum GRPO completions) - [ ] Concurrent calls to reward functions (thread safety) - [ ] Database with no tables (empty schema) - [ ] Database with very large tables (performance) --- ## 7. Evidence Requirements | Category | Evidence Type | Example | |----------|---------------|---------| | Unit tests | pytest output | `X passed` | | Integration | pytest output | `X passed` | | Error handling | pytest output | `X passed` | | E2E | pytest output + training metrics | `1 passed, loss=X.XX` | | Reward functions | pytest output showing correct values | `reward_correctness: [1.0, 0.0]` | | Parse fallback | pytest output + log capture | `WARNING: unparseable action, falling back to QUERY` |