Spaces:
Sleeping
Claude Context β Email Triage OpenEnv
Technical reference for continuing work on this project in future conversations.
What This Project Is
An OpenEnv competition submission β a real-world RL environment where agents triage email inboxes. Built to the OpenEnv spec with FastAPI server + Python client library + 3 graded tasks + Docker + HF Spaces deployment.
Competition requirements satisfied:
- Real-world task (not a game/toy)
- Full OpenEnv spec: typed Pydantic models, step()/reset()/state(), openenv.yaml
- 3 tasks: easy β medium β hard with deterministic graders scoring 0.0β1.0
- Dense reward function (per-step improvement signal, non-negative)
- inference.py using OpenAI client with exact [START]/[STEP]/[END] log format
- Dockerfile exposing port 7860 (HF Spaces standard)
- README.md with HF Spaces YAML frontmatter
Architecture
server.py FastAPI app β owns all episode state
βββ _sessions Dict[session_id, _Session] β in-memory, single process
βββ _Session Holds task config, emails, processed actions, scores
email_triage_env.py Shared Pydantic models + async HTTP client
βββ EmailTriageEnv .reset() .step() .state() .close()
βββ from_docker_image() classmethod β starts container, polls /health
tasks/data.py Static email datasets + ground truth labels
βββ TASK1_EMAILS / TASK1_GROUND_TRUTH (5 emails, easy)
βββ TASK2_EMAILS / TASK2_GROUND_TRUTH (8 emails, medium)
βββ TASK3_EMAILS / TASK3_GROUND_TRUTH (12 emails, hard)
βββ TASKS dict β registry used by server and inference
tasks/graders.py Deterministic scoring functions
βββ score_classification() β accuracy of category labels
βββ score_priority_ranking() β Spearman correlation (rescaled to [0,1])
βββ score_actions() β action correctness (partial credit: readβarchive)
βββ score_responses() β length + keyword + format (greeting/signoff)
βββ grade_task() β weighted composite of all components
inference.py Baseline script
βββ Runs all 3 tasks sequentially
βββ _fallback_action() β deterministic keyword-based fallback (no LLM needed)
βββ _parse_action() β JSON extraction with markdown fence stripping
Key Design Choices
Reward function:
reward_t = max(0, current_score - prev_score)
Non-negative, dense. Agent can resubmit actions for already-processed emails (refine).
current_score is computed over ALL processed emails at each step.
Episode done condition:
done = (len(processed) >= len(emails)) or (step >= max_steps)
Score = composite quality at episode end (not sum of rewards).
Logged as score in [END] line of inference.py.
Task weights:
| Component | Task 1 | Task 2 | Task 3 |
|---|---|---|---|
| classification | 0.60 | 0.25 | 0.30 |
| priority | 0.20 | 0.60 | 0.30 |
| action | 0.20 | 0.15 | 0.20 |
| response | 0.00 | 0.00 | 0.20 |
Response scorer (Task 3 only, per urgent/work email needing a reply):
- length β₯ 20 words β up to 0.30
- keyword coverage (task-specific keywords in ground_truth) β up to 0.40
- greeting present β 0.15; sign-off present β 0.15
Task 3 perfect score without responses = 0.80 (response weight 0.20 Γ 0.0 = 0).
Full 1.0 requires providing response_draft for all needs_response=True emails.
API Contract
POST /reset
// request (all fields optional β empty {} works)
{"task_id": "email-classify", "session_id": "abc"}
// response
{"observation": {...TriageObservation...}, "done": false}
POST /step
// request
{
"session_id": "abc",
"email_actions": [
{"email_id": "e1", "category": "spam", "priority": 10,
"action": "delete", "response_draft": null}
]
}
// response
{"observation": {...}, "reward": 0.46, "done": false,
"info": {"component_scores": {...}, "current_score": 0.46, "step": 1, "processed_count": 1}}
GET /state?session_id=abc
Returns raw session internals (debug use).
GET /health β {"status": "ok", "version": "1.0.0"}
GET /tasks β list of task metadata dicts
Pydantic Models (email_triage_env.py)
class SingleEmailAction(BaseModel):
email_id: str
category: str # spam|work|personal|newsletter|urgent
priority: int # 1β20, must be unique per episode
action: str # read|archive|delete|respond|flag
response_draft: Optional[str]
class TriageAction(BaseModel):
email_actions: List[SingleEmailAction]
session_id: Optional[str]
class TriageObservation(BaseModel):
task_id: str
step: int
max_steps: int
inbox: List[Email]
remaining_email_ids: List[str]
processed_count: int
total_emails: int
current_score: float
last_feedback: Optional[str]
instructions: str
class StepResult(BaseModel):
observation: TriageObservation
reward: float
done: bool
info: Dict[str, Any]
class ResetResult(BaseModel):
observation: TriageObservation
done: bool
Ground Truth Structure
Task 1/2 ground truth per email:
{"category": "spam", "priority": 10, "action": "delete"}
Task 3 ground truth per email (extra fields for response grading):
{
"category": "urgent",
"priority": 1,
"action": "respond",
"needs_response": True,
"response_keywords": ["payment", "authorize", "patch", "approve"]
}
Inference Script Key Variables
IMAGE_NAME = os.getenv("IMAGE_NAME") # Docker image (triggers from_docker_image)
ENV_URL = os.getenv("ENV_URL", "http://localhost:7860") # Direct server URL
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
MAX_STEPS = 15
SUCCESS_THRESHOLD = 0.5 # score >= 0.5 β success=true in [END] line
Validated Test Results
Integration tests pass (confirmed locally):
| Task | Perfect score | Empty score | Fallback (heuristic) |
|---|---|---|---|
| email-classify | 1.000 | 0.000 | ~0.64 |
| email-prioritize | 1.000 | 0.000 | ~0.85 |
| email-triage-full | 0.800* | 0.000 | ~0.56 |
*0.80 = perfect without response drafts; add good drafts to reach ~0.94+
Multi-step reward shaping confirmed: processing emails in 2 batches gives positive reward at each step (not just at the end).
Deployment Notes
HF Spaces:
- README.md has YAML frontmatter (
sdk: docker,tags: [openenv, ...]) - Dockerfile exposes port 7860 (HF Spaces standard)
- No ENV vars needed in the container itself (server has no external dependencies)
Docker run:
docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env
# PORT env var overrides the default 7860
docker run -p 8000:8000 -e PORT=8000 email-triage-env
from_docker_image() behavior:
- Runs
docker run -d -p PORT:PORT -e PORT=PORT IMAGE_NAME - Polls
/healthuntil 200 (up to 90s) close()runsdocker stop+docker rm
Potential Improvements
- Add session TTL / cleanup (current: sessions live forever in memory)
- Add
POST /tasks/{task_id}/resetconvenience endpoint - Response grader is keyword-based β could be upgraded to semantic similarity
- Add a 4th task: email thread summarization
- Add adversarial emails (phishing that looks legitimate) for robustness testing
- Persist state to Redis for multi-process / HF Spaces restart resilience