# Verification Specification **Feature:** F002 **Generated from:** specs/F002-VERIFICATION_INPUT.json **Generated:** 2026-03-27 --- ## 1. Unit Tests ### verify_answer (dispatcher) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_verify_integer_exact_match | Dispatches to integer comparer for exact match | `predicted="42", gold="42", answer_type="integer"` | `True` | happy | | test_verify_float_within_tolerance | Dispatches to float comparer within 1% | `predicted="3.14", gold="3.15", answer_type="float"` | `True` | happy | | test_verify_string_case_insensitive | Dispatches to string comparer ignoring case | `predicted="Alice", gold="alice", answer_type="string"` | `True` | happy | | test_verify_list_order_insensitive | Dispatches to list comparer ignoring order | `predicted="a, b", gold="b, a", answer_type="list"` | `True` | happy | | test_verify_none_type_falls_back_to_string | Falls back to string comparison when answer_type is None | `predicted="hello", gold="hello", answer_type=None` | `True` | fallback | | test_verify_unknown_type_falls_back_to_string | Falls back to string comparison for unrecognized type | `predicted="foo", gold="foo", answer_type="table"` | `True` | fallback | | test_verify_empty_predicted_returns_false | Empty string after strip returns False immediately | `predicted=" ", gold="42", answer_type="integer"` | `False` | edge | | test_verify_none_predicted_returns_false | Handles None-like empty input | `predicted="", gold="42", answer_type=None` | `False` | edge | **Run:** `uv run pytest tests/test_verifier.py -v -k "test_verify"` --- ### _compare_integer | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_int_exact_match | Both sides are integers | `predicted="25", gold="25"` | `True` | happy | | test_int_from_float_string | Coerces "25.0" via int(float(x)) | `predicted="25.0", gold="25"` | `True` | happy | | test_int_mismatch | Different integers | `predicted="24", gold="25"` | `False` | happy | | test_int_negative_values | Negative integers match | `predicted="-3", gold="-3"` | `True` | happy | | test_int_negative_mismatch | Negative vs positive | `predicted="-3", gold="3"` | `False` | happy | | test_int_zero | Zero matches zero | `predicted="0", gold="0"` | `True` | edge | | test_int_large_value | Large integers | `predicted="999999999", gold="999999999"` | `True` | edge | | test_int_non_numeric_returns_false | Non-numeric predicted returns False | `predicted="abc", gold="25"` | `False` | error | | test_int_non_numeric_gold_returns_false | Non-numeric gold returns False | `predicted="25", gold="abc"` | `False` | error | | test_int_empty_string_returns_false | Empty string returns False | `predicted="", gold="25"` | `False` | edge | | test_int_whitespace_only_returns_false | Whitespace-only returns False | `predicted=" ", gold="25"` | `False` | edge | | test_int_float_truncation | "25.9" coerced to 25 matches gold "25" | `predicted="25.9", gold="25"` | `True` | edge | **Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_integer"` --- ### _compare_float | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_float_exact_match | Identical float strings | `predicted="3.14", gold="3.14"` | `True` | happy | | test_float_within_1pct_tolerance | Difference within 1% | `predicted="100.5", gold="100.0"` | `True` | happy | | test_float_outside_1pct_tolerance | Difference exceeds 1% | `predicted="102.0", gold="100.0"` | `False` | happy | | test_float_boundary_exactly_1pct | Exactly at 1% boundary | `predicted="101.0", gold="100.0"` | `True` | edge | | test_float_just_over_1pct | Just past 1% boundary | `predicted="101.01", gold="100.0"` | `False` | edge | | test_float_gold_zero_uses_absolute_tolerance | Gold is 0, uses 1e-9 absolute | `predicted="0.0000000001", gold="0"` | `True` | edge | | test_float_gold_zero_fails_large_diff | Gold is 0, predicted too far | `predicted="0.001", gold="0"` | `False` | edge | | test_float_negative_values | Negative floats within tolerance | `predicted="-99.5", gold="-100.0"` | `True` | happy | | test_float_non_numeric_returns_false | Non-numeric predicted | `predicted="abc", gold="3.14"` | `False` | error | | test_float_non_numeric_gold_returns_false | Non-numeric gold | `predicted="3.14", gold="abc"` | `False` | error | | test_float_integer_strings | Integer strings as floats | `predicted="42", gold="42"` | `True` | edge | | test_float_very_small_values | Very small but non-zero | `predicted="0.0001", gold="0.0001"` | `True` | edge | **Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_float"` --- ### _compare_string | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_string_exact_match | Identical strings | `predicted="Alice", gold="Alice"` | `True` | happy | | test_string_case_insensitive | Different casing | `predicted="ALICE", gold="alice"` | `True` | happy | | test_string_whitespace_normalized | Leading/trailing/extra whitespace | `predicted=" Alice Bob ", gold="Alice Bob"` | `True` | happy | | test_string_mismatch | Different strings | `predicted="Alice", gold="Bob"` | `False` | happy | | test_string_empty_both | Both empty | `predicted="", gold=""` | `True` | edge | | test_string_unicode | Unicode characters | `predicted="cafe\u0301", gold="cafe\u0301"` | `True` | edge | | test_string_special_characters | Special characters match | `predicted="O'Brien", gold="O'Brien"` | `True` | edge | | test_string_numeric_as_string | Numbers compared as strings | `predicted="42", gold="42"` | `True` | edge | **Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_string"` --- ### _compare_list | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_list_same_order | Identical lists | `predicted="a, b, c", gold="a, b, c"` | `True` | happy | | test_list_different_order | Reordered elements | `predicted="c, a, b", gold="a, b, c"` | `True` | happy | | test_list_mismatch | Different elements | `predicted="a, b, d", gold="a, b, c"` | `False` | happy | | test_list_extra_element | Predicted has extra | `predicted="a, b, c, d", gold="a, b, c"` | `False` | happy | | test_list_missing_element | Predicted is missing one | `predicted="a, b", gold="a, b, c"` | `False` | happy | | test_list_duplicates_matter | Duplicates in one side | `predicted="a, a, b", gold="a, b"` | Defined by impl | edge | | test_list_with_gold_rows | Uses gold_rows when provided | `predicted="a, b", gold="...", gold_rows=[("a",), ("b",)]` | `True` | happy | | test_list_gold_rows_none_fallback | Falls back to string parsing when gold_rows is None | `predicted="a, b", gold="a, b", gold_rows=None` | `True` | fallback | | test_list_empty | Both sides empty | `predicted="", gold=""` | Defined by impl | edge | | test_list_single_element | Single element lists | `predicted="only", gold="only"` | `True` | edge | | test_list_whitespace_in_elements | Elements with whitespace | `predicted=" a , b ", gold="a, b"` | `True` | edge | | test_list_case_sensitivity | Case handling in list elements | `predicted="Alice, Bob", gold="alice, bob"` | Defined by impl | edge | **Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_list"` --- ### EpisodeContext.gold_rows field | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_episode_context_gold_rows_default | gold_rows defaults to None | `EpisodeContext(...)` | `gold_rows is None` | happy | | test_episode_context_gold_rows_set | gold_rows can be set to list of tuples | `EpisodeContext(..., gold_rows=[(1,), (2,)])` | `gold_rows == [(1,), (2,)]` | happy | | test_episode_context_gold_rows_empty_list | gold_rows can be empty list | `EpisodeContext(..., gold_rows=[])` | `gold_rows == []` | edge | **Run:** `uv run pytest tests/test_verifier.py -v -k "episode_context"` --- ## 2. Integration Tests ### Flow: Primary answer verification through step() | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Agent sends ANSWER action with value string | step() dispatches to _handle_answer | `env.step(SQLAction(action_type="ANSWER", argument=value))` | | 2 | _handle_answer calls verify_answer with predicted, gold, answer_type, gold_rows | verify_answer receives all four arguments | Correct reward returned in observation | | 3 | verify_answer dispatches to type-specific comparer | Correct comparer chosen based on answer_type | `observation.reward == 1.0` for correct answers | | 4 | Boolean result maps to reward | True -> 1.0, False -> 0.0 | `observation.done is True` | ### Flow: Integer answer through full environment | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset environment with question that has answer_type="integer" | Episode created with integer question | `observation.done is False` | | 2 | Submit ANSWER with correct integer (possibly as float string) | verify_answer coerces and matches | `observation.reward == 1.0` | ### Flow: Float answer through full environment | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset with question that has answer_type="float" | Episode created with float question | `observation.done is False` | | 2 | Submit ANSWER within 1% tolerance | verify_answer accepts within tolerance | `observation.reward == 1.0` | ### Flow: String answer through full environment | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset with question that has answer_type="string" | Episode created with string question | `observation.done is False` | | 2 | Submit ANSWER with different casing/whitespace | verify_answer normalizes and matches | `observation.reward == 1.0` | ### Flow: List answer through full environment | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset with question that has answer_type="list" | Episode created with list question, gold_rows populated | `observation.done is False` | | 2 | Submit ANSWER with reordered list | verify_answer compares as sets | `observation.reward == 1.0` | ### Flow: Fallback for missing answer_type | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset with question that has answer_type=None or missing | Episode created without explicit type | `observation.done is False` | | 2 | Submit ANSWER matching gold exactly (modulo case/whitespace) | Falls back to string comparison | `observation.reward == 1.0` | ### Flow: Type coercion failure | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset with question that has answer_type="integer" | Episode created with integer question | `observation.done is False` | | 2 | Submit ANSWER with non-numeric string | _compare_integer catches ValueError, returns False | `observation.reward == 0.0` | **Run:** `uv run pytest tests/test_verifier_integration.py -v` --- ## 3. API Tests No API endpoints are defined for F002. Answer verification is an internal server-side function called within the step() handler. API-level testing is covered by the integration tests above (testing through the step() interface). --- ## 4. E2E Tests ### Scenario: Correct integer answer accepted **Setup:** Environment initialized with a question whose gold answer is "25" and answer_type is "integer". **Actions:** Agent submits ANSWER "25". **Expected:** observation.done is True, observation.reward is 1.0. ### Scenario: Correct float answer accepted within tolerance **Setup:** Environment initialized with a question whose gold answer is "3.14159" and answer_type is "float". **Actions:** Agent submits ANSWER "3.14". **Expected:** observation.done is True, observation.reward is 1.0 (within 1% tolerance). ### Scenario: Correct string answer accepted case-insensitively **Setup:** Environment initialized with a question whose gold answer is "Engineering" and answer_type is "string". **Actions:** Agent submits ANSWER "engineering". **Expected:** observation.done is True, observation.reward is 1.0. ### Scenario: Correct list answer accepted order-insensitively **Setup:** Environment initialized with a question whose gold answer is "alice, bob, charlie" and answer_type is "list". **Actions:** Agent submits ANSWER "charlie, alice, bob". **Expected:** observation.done is True, observation.reward is 1.0. ### Scenario: Wrong answer rejected **Setup:** Environment initialized with any question. **Actions:** Agent submits ANSWER with clearly wrong value. **Expected:** observation.done is True, observation.reward is 0.0. ### Scenario: Backward compatibility -- no answer_type field **Setup:** Environment initialized with a legacy question record that has no answer_type (or answer_type is None). **Actions:** Agent submits ANSWER matching gold answer exactly. **Expected:** observation.done is True, observation.reward is 1.0 (string fallback used). **Run:** `uv run pytest tests/test_smoke.py tests/test_verifier_integration.py -v` --- ## 5. Edge Cases Checklist - [ ] Empty string predicted (after strip) returns False immediately - [ ] Whitespace-only predicted returns False - [ ] Non-numeric string for integer comparison returns False (ValueError caught) - [ ] Non-numeric string for float comparison returns False (ValueError caught) - [ ] Gold value of "0" for float comparison uses absolute tolerance 1e-9 - [ ] Float boundary at exactly 1% tolerance (should pass) - [ ] Float just over 1% tolerance (should fail) - [ ] Integer coercion via int(float(x)) handles "25.0" -> 25 - [ ] Integer coercion truncates "25.9" -> 25 - [ ] List with gold_rows=None falls back to string parsing - [ ] List with gold_rows provided uses structured comparison - [ ] answer_type=None dispatches to string comparison - [ ] Unknown answer_type (e.g., "table", "unknown") dispatches to string comparison - [ ] Very large integer values (MAX_INT range) - [ ] Unicode characters in string comparison - [ ] Special characters in string comparison (quotes, apostrophes) - [ ] Negative numbers for integer and float comparisons - [ ] List with duplicate elements - [ ] Single-element list - [ ] Mixed whitespace in list elements --- ## 6. Evidence Requirements | Category | Evidence Type | Example | |----------|---------------|---------| | Unit tests | pytest output | `uv run pytest tests/test_verifier.py -v` -- `X passed` | | Integration | pytest output | `uv run pytest tests/test_verifier_integration.py -v` -- `X passed` | | E2E | pytest output via smoke tests | `uv run pytest tests/test_smoke.py -v` -- answer tests pass | | Backward compat | pytest output | Existing test_answer_ends_episode_without_budget_decrement still passes |