Spaces:

Harshit-Makraria
/

email-triage-env

Sleeping

App Files Files Community

email-triage-env / README.md

Harshit-Makraria

Initial submission — Email Triage OpenEnv

93f639c 3 months ago

preview code

Raw

History Blame Contribute Delete

6.54 kB

metadata

title: Email Triage OpenEnv
emoji: 📧
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - openenv
  - email
  - triage
  - rl
  - agent-evaluation

Email Triage — OpenEnv Environment

A real-world email triage environment for training and evaluating AI agents. Agents must classify, prioritize, and respond to a realistic inbox of emails, mirroring one of the most common productivity tasks professionals perform daily.

Why Email Triage?

Email triage is a genuine, high-value task: the average knowledge worker spends ~28% of their workday managing email. A capable agent that can triage an inbox correctly has immediate, measurable economic value — unlike toy grid worlds or synthetic puzzles. The graded structure (easy → hard) allows systematic capability benchmarking.

Environment Overview

Field	Value
Domain	Productivity / Email Management
Tasks	3 (easy, medium, hard)
Action space	Classify + prioritize + act + draft responses
Observation space	Full inbox + processing state + feedback
Reward	Dense (per-step quality improvement)
Episode termination	All emails processed or max steps reached

Observation Space

At each step the agent receives a TriageObservation object:

Field	Type	Description
`task_id`	str	Active task identifier
`step`	int	Current step (starts at 0 after reset)
`max_steps`	int	Episode step limit
`inbox`	List[Email]	All emails in the episode
`remaining_email_ids`	List[str]	IDs of unprocessed emails
`processed_count`	int	Emails handled so far
`total_emails`	int	Total emails in the episode
`current_score`	float ∈ [0,1]	Composite quality score so far
`last_feedback`	str \| None	Grader feedback from the previous step
`instructions`	str	Task-specific natural-language instructions

Each Email has: id, sender, subject, body, timestamp.

Action Space

At each step the agent submits a TriageAction:

{
  "email_actions": [
    {
      "email_id": "e1",
      "category": "spam",
      "priority": 5,
      "action": "delete",
      "response_draft": null
    }
  ]
}

Field	Values	Notes
`email_id`	string	Must match an inbox email ID
`category`	`spam` \| `work` \| `personal` \| `newsletter` \| `urgent`	Required
`priority`	integer 1–N	1=most urgent; must be unique across the episode
`action`	`read` \| `archive` \| `delete` \| `respond` \| `flag`	Required
`response_draft`	string \| null	Required (≥20 words) when `action=respond`

Tasks

Task 1 — `email-classify` (Easy)

Emails: 5 (1 spam, 1 work, 1 personal, 1 newsletter, 1 urgent)
Max steps: 5
Score weights: classification 60% · action 20% · priority 20%
Goal: Identify the category of each email correctly.
Expected baseline score: ~0.55–0.75

Task 2 — `email-prioritize` (Medium)

Emails: 8 spanning all urgency levels
Max steps: 8
Score weights: priority ranking 60% · classification 25% · action 15%
Goal: Rank all 8 emails by urgency (1=most urgent). Graded by Spearman rank correlation.
Expected baseline score: ~0.45–0.65

Task 3 — `email-triage-full` (Hard)

Emails: 12 (mix of all categories, several requiring responses)
Max steps: 12
Score weights: classification 30% · priority 30% · action 20% · response quality 20%
Goal: Full triage — classify, prioritize, choose actions, and draft professional responses.
Expected baseline score: ~0.30–0.55

Reward Function

reward_t = max(0, composite_score_t − composite_score_{t-1})

Dense: every step that improves the triage quality yields a positive reward.
Non-negative: partial progress is always rewarded; no step penalty.
Informative: the last_feedback field in the observation explains why the score changed.
Final episode score: current_score at termination (the composite quality, 0–1).

Response quality (Task 3) is assessed on: word count ≥20 (30%), keyword coverage (40%), professional greeting + sign-off (30%).

Setup & Usage

Local (no Docker)

pip install -r requirements.txt

# Start the server
python server.py           # listens on port 7860 by default
# or
PORT=8000 python server.py

# Run baseline inference (needs a running server)
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=<your_token> \
python inference.py

Docker

# Build
docker build -t email-triage-env .

# Run server
docker run -p 7860:7860 email-triage-env

# Run inference (server in Docker, inference on host)
IMAGE_NAME=email-triage-env \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=<your_token> \
python inference.py

OpenEnv validation

pip install openenv-core
openenv validate

REST API

Endpoint	Method	Description
`/health`	GET	Liveness probe — returns `{"status":"ok"}`
`/tasks`	GET	List all available tasks with metadata
`/reset`	POST	Start/reset an episode. Body: `{"task_id":"email-classify","session_id":"optional"}`
`/step`	POST	Submit actions. Body: `{"email_actions":[...],"session_id":"..."}`
`/state`	GET	Inspect current session state (debug)

Baseline Scores (Qwen/Qwen2.5-72B-Instruct)

Task	Difficulty	Score
email-classify	Easy	~0.70
email-prioritize	Medium	~0.55
email-triage-full	Hard	~0.40
Average		~0.55

Scores are approximate and depend on the model and API temperature.

Project Structure

.
├── Dockerfile               Container definition
├── README.md                This file
├── openenv.yaml             OpenEnv metadata & task registry
├── requirements.txt         Python dependencies
├── inference.py             Baseline inference script (entry point)
├── server.py                FastAPI environment server
├── email_triage_env.py      Python client library + shared Pydantic models
└── tasks/
    ├── __init__.py
    ├── data.py              Email datasets with ground-truth labels
    └── graders.py           Deterministic scoring functions