# Project Guide — Email Triage OpenEnv

A plain-English walkthrough of what was built, how it works, and how to use it.

---

## What Is This?

This is an **AI training environment** submitted to the OpenEnv competition.

The idea: instead of training AI agents on games like chess or Atari, this environment
teaches agents to do something useful — **triage an email inbox**. The agent reads emails,
classifies them (spam? work? urgent?), prioritizes them, and drafts replies.

OpenEnv is a standard interface (like OpenAI Gym but for real-world tasks) that lets
anyone plug an AI model in and measure how well it performs.

---

## The Three Tasks

Each task is a self-contained challenge the AI must solve.

### Task 1 — email-classify (Easy)
- **What:** 5 emails arrive. Classify each one: spam / work / personal / newsletter / urgent.
- **Scored on:** How many you get right (accuracy).
- **Example:** A "Congratulations you won $1M!" email should be spam. A "server is down" email should be urgent.

### Task 2 — email-prioritize (Medium)
- **What:** 8 emails. Rank them 1–8 by urgency (1 = handle first).
- **Scored on:** How close your ordering is to the correct order (Spearman rank correlation).
- **Example:** "Client meeting in 30 minutes" should rank higher than "coffee chat next week."

### Task 3 — email-triage-full (Hard)
- **What:** 12 emails. Classify + rank + decide action (reply/archive/delete) + write a draft reply for urgent ones.
- **Scored on:** All of the above, weighted equally (25% each component).
- **Example:** For a "payment service is down" email, the agent should mark it urgent, rank it #1, choose "respond", and write a professional reply authorizing the fix.

---

## How the Scoring Works

Every time the AI submits decisions, it gets a **reward** between 0 and 1.

- Reward = how much the score *improved* compared to last step
- If the AI processes emails one at a time, it gets small rewards along the way
- If it processes all at once, it gets one big reward
- Final **score** = the composite quality at the end of the episode

Score components per task:

| Component | Task 1 | Task 2 | Task 3 |
|-----------|--------|--------|--------|
| Classification | 60% | 25% | 30% |
| Priority ranking | 20% | 60% | 30% |
| Correct action | 20% | 15% | 20% |
| Response quality | — | — | 20% |

**Response quality** (Task 3 only) checks:
- Is the draft at least 20 words? (30%)
- Does it mention relevant keywords? (40%)
- Does it have a greeting and sign-off? (30%)

---

## How It All Connects

```
                  [AI Agent / LLM]
                       |
                 inference.py
                 (calls the AI,
                  formats actions)
                       |
            HTTP requests (JSON)
                       |
                  server.py
              (FastAPI web server)
                       |
           tasks/graders.py  ←  tasks/data.py
          (scores the actions)   (email datasets
                                  + ground truth)
```

The server runs inside a **Docker container** (or locally). The AI agent connects to it
and interacts through three endpoints:

- `POST /reset` — Start a new episode, get the inbox
- `POST /step` — Submit decisions, get back a score and feedback
- `GET /state` — Peek at the current state (for debugging)

---

## File Map

```
opencv/
├── server.py            The web server (brain of the environment)
├── email_triage_env.py  Python library to connect to the server
├── inference.py         The script that runs an AI model against all 3 tasks
├── openenv.yaml         Metadata file the OpenEnv platform reads
├── Dockerfile           How to package everything into a container
├── requirements.txt     Python packages needed
├── README.md            Public-facing documentation (shown on HuggingFace)
└── tasks/
    ├── data.py          The actual emails + correct answers (ground truth)
    └── graders.py       The math that converts answers into scores
```

---

## Running It Yourself

**Option A — Direct (no Docker):**
```bash
# Install dependencies
pip install -r requirements.txt

# Start the server (leave this running)
python server.py

# In another terminal, run the AI agent
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=your_token_here \
python inference.py
```

**Option B — Docker:**
```bash
docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env
# Then run inference.py with ENV_URL=http://localhost:7860
```

**To test just the server manually:**
```bash
# Check it's running
curl http://localhost:7860/health

# Start a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "email-classify"}'

# Submit an answer
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "email_actions": [
      {"email_id": "e1", "category": "spam", "priority": 5, "action": "delete"}
    ]
  }'
```

---

## What the AI Output Looks Like

When `inference.py` runs, it prints structured logs:

```
[START] task=email-classify env=email-triage model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={...} reward=0.64 done=false error=null
[STEP] step=2 action={...} reward=0.36 done=true error=null
[END] success=true steps=2 score=1.000 rewards=0.64,0.36

[START] task=email-prioritize env=email-triage model=Qwen/Qwen2.5-72B-Instruct
...
```

The OpenEnv evaluation platform reads exactly this format to score submissions.

---

## Key Design Decisions

**Why email triage?**
It's a real task with measurable correct answers. Unlike "write a story" (subjective),
email triage has ground truth: spam IS spam, a production outage IS more urgent than
a newsletter.

**Why dense rewards (not just end-of-episode)?**
Sparse rewards (only score at the end) make it hard for AI to learn. By giving reward
for each improvement step, the agent gets useful feedback throughout the episode.

**Why allow multi-step refinement?**
The agent can resubmit decisions to improve them. This mirrors how a human might
reconsider a classification after seeing all the emails. It also enables interesting
strategies (classify easy ones first, then tackle ambiguous ones).

**Why Spearman correlation for Task 2?**
It measures ranking quality properly — getting the top 3 emails in the right order
matters more than getting emails 6/7/8 slightly wrong. Spearman captures this better
than simple accuracy.