Spaces:
Sleeping
Sleeping
| title: Email Triage OpenEnv | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| tags: | |
| - openenv | |
| - triage | |
| - rl | |
| - agent-evaluation | |
| # Email Triage β OpenEnv Environment | |
| A real-world email triage environment for training and evaluating AI agents. | |
| Agents must **classify**, **prioritize**, and **respond to** a realistic inbox | |
| of emails, mirroring one of the most common productivity tasks professionals | |
| perform daily. | |
| --- | |
| ## Why Email Triage? | |
| Email triage is a genuine, high-value task: the average knowledge worker | |
| spends ~28% of their workday managing email. A capable agent that can triage | |
| an inbox correctly has immediate, measurable economic value β unlike toy grid | |
| worlds or synthetic puzzles. The graded structure (easy β hard) allows | |
| systematic capability benchmarking. | |
| --- | |
| ## Environment Overview | |
| | Field | Value | | |
| |---|---| | |
| | Domain | Productivity / Email Management | | |
| | Tasks | 3 (easy, medium, hard) | | |
| | Action space | Classify + prioritize + act + draft responses | | |
| | Observation space | Full inbox + processing state + feedback | | |
| | Reward | Dense (per-step quality improvement) | | |
| | Episode termination | All emails processed **or** max steps reached | | |
| --- | |
| ## Observation Space | |
| At each step the agent receives a `TriageObservation` object: | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `task_id` | str | Active task identifier | | |
| | `step` | int | Current step (starts at 0 after reset) | | |
| | `max_steps` | int | Episode step limit | | |
| | `inbox` | List[Email] | All emails in the episode | | |
| | `remaining_email_ids` | List[str] | IDs of unprocessed emails | | |
| | `processed_count` | int | Emails handled so far | | |
| | `total_emails` | int | Total emails in the episode | | |
| | `current_score` | float β [0,1] | Composite quality score so far | | |
| | `last_feedback` | str \| None | Grader feedback from the previous step | | |
| | `instructions` | str | Task-specific natural-language instructions | | |
| Each `Email` has: `id`, `sender`, `subject`, `body`, `timestamp`. | |
| --- | |
| ## Action Space | |
| At each step the agent submits a `TriageAction`: | |
| ```json | |
| { | |
| "email_actions": [ | |
| { | |
| "email_id": "e1", | |
| "category": "spam", | |
| "priority": 5, | |
| "action": "delete", | |
| "response_draft": null | |
| } | |
| ] | |
| } | |
| ``` | |
| | Field | Values | Notes | | |
| |---|---|---| | |
| | `email_id` | string | Must match an inbox email ID | | |
| | `category` | `spam` \| `work` \| `personal` \| `newsletter` \| `urgent` | Required | | |
| | `priority` | integer 1βN | 1=most urgent; must be unique across the episode | | |
| | `action` | `read` \| `archive` \| `delete` \| `respond` \| `flag` | Required | | |
| | `response_draft` | string \| null | Required (β₯20 words) when `action=respond` | | |
| --- | |
| ## Tasks | |
| ### Task 1 β `email-classify` (Easy) | |
| - **Emails:** 5 (1 spam, 1 work, 1 personal, 1 newsletter, 1 urgent) | |
| - **Max steps:** 5 | |
| - **Score weights:** classification 60% Β· action 20% Β· priority 20% | |
| - **Goal:** Identify the category of each email correctly. | |
| - **Expected baseline score:** ~0.55β0.75 | |
| ### Task 2 β `email-prioritize` (Medium) | |
| - **Emails:** 8 spanning all urgency levels | |
| - **Max steps:** 8 | |
| - **Score weights:** priority ranking 60% Β· classification 25% Β· action 15% | |
| - **Goal:** Rank all 8 emails by urgency (1=most urgent). Graded by Spearman rank correlation. | |
| - **Expected baseline score:** ~0.45β0.65 | |
| ### Task 3 β `email-triage-full` (Hard) | |
| - **Emails:** 12 (mix of all categories, several requiring responses) | |
| - **Max steps:** 12 | |
| - **Score weights:** classification 30% Β· priority 30% Β· action 20% Β· response quality 20% | |
| - **Goal:** Full triage β classify, prioritize, choose actions, and draft professional responses. | |
| - **Expected baseline score:** ~0.30β0.55 | |
| --- | |
| ## Reward Function | |
| ``` | |
| reward_t = max(0, composite_score_t β composite_score_{t-1}) | |
| ``` | |
| - **Dense:** every step that improves the triage quality yields a positive reward. | |
| - **Non-negative:** partial progress is always rewarded; no step penalty. | |
| - **Informative:** the `last_feedback` field in the observation explains *why* the score changed. | |
| - **Final episode score:** `current_score` at termination (the composite quality, 0β1). | |
| Response quality (Task 3) is assessed on: word count β₯20 (30%), keyword coverage (40%), professional greeting + sign-off (30%). | |
| --- | |
| ## Setup & Usage | |
| ### Local (no Docker) | |
| ```bash | |
| pip install -r requirements.txt | |
| # Start the server | |
| python server.py # listens on port 7860 by default | |
| # or | |
| PORT=8000 python server.py | |
| # Run baseline inference (needs a running server) | |
| ENV_URL=http://localhost:7860 \ | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \ | |
| HF_TOKEN=<your_token> \ | |
| python inference.py | |
| ``` | |
| ### Docker | |
| ```bash | |
| # Build | |
| docker build -t email-triage-env . | |
| # Run server | |
| docker run -p 7860:7860 email-triage-env | |
| # Run inference (server in Docker, inference on host) | |
| IMAGE_NAME=email-triage-env \ | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \ | |
| HF_TOKEN=<your_token> \ | |
| python inference.py | |
| ``` | |
| ### OpenEnv validation | |
| ```bash | |
| pip install openenv-core | |
| openenv validate | |
| ``` | |
| --- | |
| ## REST API | |
| | Endpoint | Method | Description | | |
| |---|---|---| | |
| | `/health` | GET | Liveness probe β returns `{"status":"ok"}` | | |
| | `/tasks` | GET | List all available tasks with metadata | | |
| | `/reset` | POST | Start/reset an episode. Body: `{"task_id":"email-classify","session_id":"optional"}` | | |
| | `/step` | POST | Submit actions. Body: `{"email_actions":[...],"session_id":"..."}` | | |
| | `/state` | GET | Inspect current session state (debug) | | |
| --- | |
| ## Baseline Scores (Qwen/Qwen2.5-72B-Instruct) | |
| | Task | Difficulty | Score | | |
| |---|---|---| | |
| | email-classify | Easy | ~0.70 | | |
| | email-prioritize | Medium | ~0.55 | | |
| | email-triage-full | Hard | ~0.40 | | |
| | **Average** | | **~0.55** | | |
| *Scores are approximate and depend on the model and API temperature.* | |
| --- | |
| ## Project Structure | |
| ``` | |
| . | |
| βββ Dockerfile Container definition | |
| βββ README.md This file | |
| βββ openenv.yaml OpenEnv metadata & task registry | |
| βββ requirements.txt Python dependencies | |
| βββ inference.py Baseline inference script (entry point) | |
| βββ server.py FastAPI environment server | |
| βββ email_triage_env.py Python client library + shared Pydantic models | |
| βββ tasks/ | |
| βββ __init__.py | |
| βββ data.py Email datasets with ground-truth labels | |
| βββ graders.py Deterministic scoring functions | |
| ``` | |