# Claude Context — Email Triage OpenEnv Technical reference for continuing work on this project in future conversations. --- ## What This Project Is An **OpenEnv competition submission** — a real-world RL environment where agents triage email inboxes. Built to the OpenEnv spec with FastAPI server + Python client library + 3 graded tasks + Docker + HF Spaces deployment. Competition requirements satisfied: - Real-world task (not a game/toy) - Full OpenEnv spec: typed Pydantic models, step()/reset()/state(), openenv.yaml - 3 tasks: easy → medium → hard with deterministic graders scoring 0.0–1.0 - Dense reward function (per-step improvement signal, non-negative) - inference.py using OpenAI client with exact [START]/[STEP]/[END] log format - Dockerfile exposing port 7860 (HF Spaces standard) - README.md with HF Spaces YAML frontmatter --- ## Architecture ``` server.py FastAPI app — owns all episode state └── _sessions Dict[session_id, _Session] — in-memory, single process └── _Session Holds task config, emails, processed actions, scores email_triage_env.py Shared Pydantic models + async HTTP client └── EmailTriageEnv .reset() .step() .state() .close() └── from_docker_image() classmethod — starts container, polls /health tasks/data.py Static email datasets + ground truth labels └── TASK1_EMAILS / TASK1_GROUND_TRUTH (5 emails, easy) └── TASK2_EMAILS / TASK2_GROUND_TRUTH (8 emails, medium) └── TASK3_EMAILS / TASK3_GROUND_TRUTH (12 emails, hard) └── TASKS dict — registry used by server and inference tasks/graders.py Deterministic scoring functions └── score_classification() — accuracy of category labels └── score_priority_ranking() — Spearman correlation (rescaled to [0,1]) └── score_actions() — action correctness (partial credit: read≈archive) └── score_responses() — length + keyword + format (greeting/signoff) └── grade_task() — weighted composite of all components inference.py Baseline script └── Runs all 3 tasks sequentially └── _fallback_action() — deterministic keyword-based fallback (no LLM needed) └── _parse_action() — JSON extraction with markdown fence stripping ``` --- ## Key Design Choices **Reward function:** ```python reward_t = max(0, current_score - prev_score) ``` Non-negative, dense. Agent can resubmit actions for already-processed emails (refine). `current_score` is computed over ALL processed emails at each step. **Episode done condition:** ```python done = (len(processed) >= len(emails)) or (step >= max_steps) ``` **Score = composite quality at episode end** (not sum of rewards). Logged as `score` in `[END]` line of inference.py. **Task weights:** | Component | Task 1 | Task 2 | Task 3 | |-----------|--------|--------|--------| | classification | 0.60 | 0.25 | 0.30 | | priority | 0.20 | 0.60 | 0.30 | | action | 0.20 | 0.15 | 0.20 | | response | 0.00 | 0.00 | 0.20 | **Response scorer** (Task 3 only, per urgent/work email needing a reply): - length ≥ 20 words → up to 0.30 - keyword coverage (task-specific keywords in ground_truth) → up to 0.40 - greeting present → 0.15; sign-off present → 0.15 **Task 3 perfect score without responses = 0.80** (response weight 0.20 × 0.0 = 0). Full 1.0 requires providing response_draft for all `needs_response=True` emails. --- ## API Contract ### POST /reset ```json // request (all fields optional — empty {} works) {"task_id": "email-classify", "session_id": "abc"} // response {"observation": {...TriageObservation...}, "done": false} ``` ### POST /step ```json // request { "session_id": "abc", "email_actions": [ {"email_id": "e1", "category": "spam", "priority": 10, "action": "delete", "response_draft": null} ] } // response {"observation": {...}, "reward": 0.46, "done": false, "info": {"component_scores": {...}, "current_score": 0.46, "step": 1, "processed_count": 1}} ``` ### GET /state?session_id=abc Returns raw session internals (debug use). ### GET /health → {"status": "ok", "version": "1.0.0"} ### GET /tasks → list of task metadata dicts --- ## Pydantic Models (email_triage_env.py) ```python class SingleEmailAction(BaseModel): email_id: str category: str # spam|work|personal|newsletter|urgent priority: int # 1–20, must be unique per episode action: str # read|archive|delete|respond|flag response_draft: Optional[str] class TriageAction(BaseModel): email_actions: List[SingleEmailAction] session_id: Optional[str] class TriageObservation(BaseModel): task_id: str step: int max_steps: int inbox: List[Email] remaining_email_ids: List[str] processed_count: int total_emails: int current_score: float last_feedback: Optional[str] instructions: str class StepResult(BaseModel): observation: TriageObservation reward: float done: bool info: Dict[str, Any] class ResetResult(BaseModel): observation: TriageObservation done: bool ``` --- ## Ground Truth Structure Task 1/2 ground truth per email: ```python {"category": "spam", "priority": 10, "action": "delete"} ``` Task 3 ground truth per email (extra fields for response grading): ```python { "category": "urgent", "priority": 1, "action": "respond", "needs_response": True, "response_keywords": ["payment", "authorize", "patch", "approve"] } ``` --- ## Inference Script Key Variables ```python IMAGE_NAME = os.getenv("IMAGE_NAME") # Docker image (triggers from_docker_image) ENV_URL = os.getenv("ENV_URL", "http://localhost:7860") # Direct server URL API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1") MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct") MAX_STEPS = 15 SUCCESS_THRESHOLD = 0.5 # score >= 0.5 → success=true in [END] line ``` --- ## Validated Test Results Integration tests pass (confirmed locally): | Task | Perfect score | Empty score | Fallback (heuristic) | |------|--------------|------------|----------------------| | email-classify | 1.000 | 0.000 | ~0.64 | | email-prioritize | 1.000 | 0.000 | ~0.85 | | email-triage-full | 0.800* | 0.000 | ~0.56 | *0.80 = perfect without response drafts; add good drafts to reach ~0.94+ Multi-step reward shaping confirmed: processing emails in 2 batches gives positive reward at each step (not just at the end). --- ## Deployment Notes **HF Spaces:** - README.md has YAML frontmatter (`sdk: docker`, `tags: [openenv, ...]`) - Dockerfile exposes port 7860 (HF Spaces standard) - No ENV vars needed in the container itself (server has no external dependencies) **Docker run:** ```bash docker build -t email-triage-env . docker run -p 7860:7860 email-triage-env # PORT env var overrides the default 7860 docker run -p 8000:8000 -e PORT=8000 email-triage-env ``` **from_docker_image() behavior:** - Runs `docker run -d -p PORT:PORT -e PORT=PORT IMAGE_NAME` - Polls `/health` until 200 (up to 90s) - `close()` runs `docker stop` + `docker rm` --- ## Potential Improvements - Add session TTL / cleanup (current: sessions live forever in memory) - Add `POST /tasks/{task_id}/reset` convenience endpoint - Response grader is keyword-based — could be upgraded to semantic similarity - Add a 4th task: email thread summarization - Add adversarial emails (phishing that looks legitimate) for robustness testing - Persist state to Redis for multi-process / HF Spaces restart resilience