email-triage-env / README.md
Harshit-Makraria's picture
Initial submission β€” Email Triage OpenEnv
93f639c
|
Raw
History Blame Contribute Delete
6.54 kB
---
title: Email Triage OpenEnv
emoji: πŸ“§
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
- openenv
- email
- triage
- rl
- agent-evaluation
---
# Email Triage β€” OpenEnv Environment
A real-world email triage environment for training and evaluating AI agents.
Agents must **classify**, **prioritize**, and **respond to** a realistic inbox
of emails, mirroring one of the most common productivity tasks professionals
perform daily.
---
## Why Email Triage?
Email triage is a genuine, high-value task: the average knowledge worker
spends ~28% of their workday managing email. A capable agent that can triage
an inbox correctly has immediate, measurable economic value β€” unlike toy grid
worlds or synthetic puzzles. The graded structure (easy β†’ hard) allows
systematic capability benchmarking.
---
## Environment Overview
| Field | Value |
|---|---|
| Domain | Productivity / Email Management |
| Tasks | 3 (easy, medium, hard) |
| Action space | Classify + prioritize + act + draft responses |
| Observation space | Full inbox + processing state + feedback |
| Reward | Dense (per-step quality improvement) |
| Episode termination | All emails processed **or** max steps reached |
---
## Observation Space
At each step the agent receives a `TriageObservation` object:
| Field | Type | Description |
|---|---|---|
| `task_id` | str | Active task identifier |
| `step` | int | Current step (starts at 0 after reset) |
| `max_steps` | int | Episode step limit |
| `inbox` | List[Email] | All emails in the episode |
| `remaining_email_ids` | List[str] | IDs of unprocessed emails |
| `processed_count` | int | Emails handled so far |
| `total_emails` | int | Total emails in the episode |
| `current_score` | float ∈ [0,1] | Composite quality score so far |
| `last_feedback` | str \| None | Grader feedback from the previous step |
| `instructions` | str | Task-specific natural-language instructions |
Each `Email` has: `id`, `sender`, `subject`, `body`, `timestamp`.
---
## Action Space
At each step the agent submits a `TriageAction`:
```json
{
"email_actions": [
{
"email_id": "e1",
"category": "spam",
"priority": 5,
"action": "delete",
"response_draft": null
}
]
}
```
| Field | Values | Notes |
|---|---|---|
| `email_id` | string | Must match an inbox email ID |
| `category` | `spam` \| `work` \| `personal` \| `newsletter` \| `urgent` | Required |
| `priority` | integer 1–N | 1=most urgent; must be unique across the episode |
| `action` | `read` \| `archive` \| `delete` \| `respond` \| `flag` | Required |
| `response_draft` | string \| null | Required (β‰₯20 words) when `action=respond` |
---
## Tasks
### Task 1 β€” `email-classify` (Easy)
- **Emails:** 5 (1 spam, 1 work, 1 personal, 1 newsletter, 1 urgent)
- **Max steps:** 5
- **Score weights:** classification 60% Β· action 20% Β· priority 20%
- **Goal:** Identify the category of each email correctly.
- **Expected baseline score:** ~0.55–0.75
### Task 2 β€” `email-prioritize` (Medium)
- **Emails:** 8 spanning all urgency levels
- **Max steps:** 8
- **Score weights:** priority ranking 60% Β· classification 25% Β· action 15%
- **Goal:** Rank all 8 emails by urgency (1=most urgent). Graded by Spearman rank correlation.
- **Expected baseline score:** ~0.45–0.65
### Task 3 β€” `email-triage-full` (Hard)
- **Emails:** 12 (mix of all categories, several requiring responses)
- **Max steps:** 12
- **Score weights:** classification 30% Β· priority 30% Β· action 20% Β· response quality 20%
- **Goal:** Full triage β€” classify, prioritize, choose actions, and draft professional responses.
- **Expected baseline score:** ~0.30–0.55
---
## Reward Function
```
reward_t = max(0, composite_score_t βˆ’ composite_score_{t-1})
```
- **Dense:** every step that improves the triage quality yields a positive reward.
- **Non-negative:** partial progress is always rewarded; no step penalty.
- **Informative:** the `last_feedback` field in the observation explains *why* the score changed.
- **Final episode score:** `current_score` at termination (the composite quality, 0–1).
Response quality (Task 3) is assessed on: word count β‰₯20 (30%), keyword coverage (40%), professional greeting + sign-off (30%).
---
## Setup & Usage
### Local (no Docker)
```bash
pip install -r requirements.txt
# Start the server
python server.py # listens on port 7860 by default
# or
PORT=8000 python server.py
# Run baseline inference (needs a running server)
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=<your_token> \
python inference.py
```
### Docker
```bash
# Build
docker build -t email-triage-env .
# Run server
docker run -p 7860:7860 email-triage-env
# Run inference (server in Docker, inference on host)
IMAGE_NAME=email-triage-env \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=<your_token> \
python inference.py
```
### OpenEnv validation
```bash
pip install openenv-core
openenv validate
```
---
## REST API
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Liveness probe β€” returns `{"status":"ok"}` |
| `/tasks` | GET | List all available tasks with metadata |
| `/reset` | POST | Start/reset an episode. Body: `{"task_id":"email-classify","session_id":"optional"}` |
| `/step` | POST | Submit actions. Body: `{"email_actions":[...],"session_id":"..."}` |
| `/state` | GET | Inspect current session state (debug) |
---
## Baseline Scores (Qwen/Qwen2.5-72B-Instruct)
| Task | Difficulty | Score |
|---|---|---|
| email-classify | Easy | ~0.70 |
| email-prioritize | Medium | ~0.55 |
| email-triage-full | Hard | ~0.40 |
| **Average** | | **~0.55** |
*Scores are approximate and depend on the model and API temperature.*
---
## Project Structure
```
.
β”œβ”€β”€ Dockerfile Container definition
β”œβ”€β”€ README.md This file
β”œβ”€β”€ openenv.yaml OpenEnv metadata & task registry
β”œβ”€β”€ requirements.txt Python dependencies
β”œβ”€β”€ inference.py Baseline inference script (entry point)
β”œβ”€β”€ server.py FastAPI environment server
β”œβ”€β”€ email_triage_env.py Python client library + shared Pydantic models
└── tasks/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ data.py Email datasets with ground-truth labels
└── graders.py Deterministic scoring functions
```