email-triage-env / docs /project-guide.md
Harshit-Makraria's picture
Initial submission β€” Email Triage OpenEnv
93f639c
|
Raw
History Blame Contribute Delete
6.59 kB

Project Guide β€” Email Triage OpenEnv

A plain-English walkthrough of what was built, how it works, and how to use it.


What Is This?

This is an AI training environment submitted to the OpenEnv competition.

The idea: instead of training AI agents on games like chess or Atari, this environment teaches agents to do something useful β€” triage an email inbox. The agent reads emails, classifies them (spam? work? urgent?), prioritizes them, and drafts replies.

OpenEnv is a standard interface (like OpenAI Gym but for real-world tasks) that lets anyone plug an AI model in and measure how well it performs.


The Three Tasks

Each task is a self-contained challenge the AI must solve.

Task 1 β€” email-classify (Easy)

  • What: 5 emails arrive. Classify each one: spam / work / personal / newsletter / urgent.
  • Scored on: How many you get right (accuracy).
  • Example: A "Congratulations you won $1M!" email should be spam. A "server is down" email should be urgent.

Task 2 β€” email-prioritize (Medium)

  • What: 8 emails. Rank them 1–8 by urgency (1 = handle first).
  • Scored on: How close your ordering is to the correct order (Spearman rank correlation).
  • Example: "Client meeting in 30 minutes" should rank higher than "coffee chat next week."

Task 3 β€” email-triage-full (Hard)

  • What: 12 emails. Classify + rank + decide action (reply/archive/delete) + write a draft reply for urgent ones.
  • Scored on: All of the above, weighted equally (25% each component).
  • Example: For a "payment service is down" email, the agent should mark it urgent, rank it #1, choose "respond", and write a professional reply authorizing the fix.

How the Scoring Works

Every time the AI submits decisions, it gets a reward between 0 and 1.

  • Reward = how much the score improved compared to last step
  • If the AI processes emails one at a time, it gets small rewards along the way
  • If it processes all at once, it gets one big reward
  • Final score = the composite quality at the end of the episode

Score components per task:

Component Task 1 Task 2 Task 3
Classification 60% 25% 30%
Priority ranking 20% 60% 30%
Correct action 20% 15% 20%
Response quality β€” β€” 20%

Response quality (Task 3 only) checks:

  • Is the draft at least 20 words? (30%)
  • Does it mention relevant keywords? (40%)
  • Does it have a greeting and sign-off? (30%)

How It All Connects

                  [AI Agent / LLM]
                       |
                 inference.py
                 (calls the AI,
                  formats actions)
                       |
            HTTP requests (JSON)
                       |
                  server.py
              (FastAPI web server)
                       |
           tasks/graders.py  ←  tasks/data.py
          (scores the actions)   (email datasets
                                  + ground truth)

The server runs inside a Docker container (or locally). The AI agent connects to it and interacts through three endpoints:

  • POST /reset β€” Start a new episode, get the inbox
  • POST /step β€” Submit decisions, get back a score and feedback
  • GET /state β€” Peek at the current state (for debugging)

File Map

opencv/
β”œβ”€β”€ server.py            The web server (brain of the environment)
β”œβ”€β”€ email_triage_env.py  Python library to connect to the server
β”œβ”€β”€ inference.py         The script that runs an AI model against all 3 tasks
β”œβ”€β”€ openenv.yaml         Metadata file the OpenEnv platform reads
β”œβ”€β”€ Dockerfile           How to package everything into a container
β”œβ”€β”€ requirements.txt     Python packages needed
β”œβ”€β”€ README.md            Public-facing documentation (shown on HuggingFace)
└── tasks/
    β”œβ”€β”€ data.py          The actual emails + correct answers (ground truth)
    └── graders.py       The math that converts answers into scores

Running It Yourself

Option A β€” Direct (no Docker):

# Install dependencies
pip install -r requirements.txt

# Start the server (leave this running)
python server.py

# In another terminal, run the AI agent
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=your_token_here \
python inference.py

Option B β€” Docker:

docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env
# Then run inference.py with ENV_URL=http://localhost:7860

To test just the server manually:

# Check it's running
curl http://localhost:7860/health

# Start a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "email-classify"}'

# Submit an answer
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "email_actions": [
      {"email_id": "e1", "category": "spam", "priority": 5, "action": "delete"}
    ]
  }'

What the AI Output Looks Like

When inference.py runs, it prints structured logs:

[START] task=email-classify env=email-triage model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={...} reward=0.64 done=false error=null
[STEP] step=2 action={...} reward=0.36 done=true error=null
[END] success=true steps=2 score=1.000 rewards=0.64,0.36

[START] task=email-prioritize env=email-triage model=Qwen/Qwen2.5-72B-Instruct
...

The OpenEnv evaluation platform reads exactly this format to score submissions.


Key Design Decisions

Why email triage? It's a real task with measurable correct answers. Unlike "write a story" (subjective), email triage has ground truth: spam IS spam, a production outage IS more urgent than a newsletter.

Why dense rewards (not just end-of-episode)? Sparse rewards (only score at the end) make it hard for AI to learn. By giving reward for each improvement step, the agent gets useful feedback throughout the episode.

Why allow multi-step refinement? The agent can resubmit decisions to improve them. This mirrors how a human might reconsider a classification after seeing all the emails. It also enables interesting strategies (classify easy ones first, then tackle ambiguous ones).

Why Spearman correlation for Task 2? It measures ranking quality properly β€” getting the top 3 emails in the right order matters more than getting emails 6/7/8 slightly wrong. Spearman captures this better than simple accuracy.