# Metrics & Monitoring

Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health.

## Proxy Metrics

### Stats Endpoint

```bash
curl http://localhost:8787/stats
```

```json
{
  "requests": {
    "total": 42,
    "cached": 5,
    "rate_limited": 0,
    "failed": 0
  },
  "tokens": {
    "input": 50000,
    "output": 8000,
    "saved": 12500,
    "savings_percent": 25.0
  },
  "cost": {
    "total_cost_usd": 0.15,
    "total_savings_usd": 0.04
  },
  "cache": {
    "entries": 10,
    "total_hits": 5
  }
}
```

### Prometheus Metrics

```bash
curl http://localhost:8787/metrics
```

```prometheus
# HELP headroom_requests_total Total requests processed
headroom_requests_total{mode="optimize"} 1234

# HELP headroom_tokens_saved_total Total tokens saved
headroom_tokens_saved_total 5678900

# HELP headroom_compression_ratio Compression ratio histogram
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_compression_ratio_bucket{le="0.7"} 1100
headroom_compression_ratio_bucket{le="0.9"} 1200

# HELP headroom_latency_seconds Request latency histogram
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_latency_seconds_bucket{le="0.1"} 1150

# HELP headroom_cache_hits_total Cache hit counter
headroom_cache_hits_total 456

# HELP headroom_cache_misses_total Cache miss counter
headroom_cache_misses_total 778
```

### Health Check

```bash
curl http://localhost:8787/health
```

```json
{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "llmlingua_enabled": false
}
```

## SDK Metrics

### Session Stats

Quick stats for the current session (no database query):

```python
stats = client.get_stats()
print(stats)
```

```python
{
    "session": {
        "requests_total": 10,
        "tokens_input_before": 50000,
        "tokens_input_after": 35000,
        "tokens_saved_total": 15000,
        "tokens_output_total": 8000,
        "cache_hits": 3,
        "compression_ratio_avg": 0.70
    },
    "config": {
        "mode": "optimize",
        "provider": "openai",
        "cache_optimizer_enabled": True,
        "semantic_cache_enabled": False
    },
    "transforms": {
        "smart_crusher_enabled": True,
        "cache_aligner_enabled": True,
        "rolling_window_enabled": True
    }
}
```

### Historical Metrics

Query stored metrics from the database:

```python
from datetime import datetime, timedelta

# Get recent metrics
metrics = client.get_metrics(
    start_time=datetime.utcnow() - timedelta(hours=1),
    limit=100,
)

for m in metrics:
    print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")
```

### Summary Statistics

Aggregate statistics across all stored metrics:

```python
summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")
print(f"Average compression: {summary['avg_compression_ratio']:.1%}")
print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}")
```

## Logging

### Enable Logging

```python
import logging

# INFO level shows compression summaries
logging.basicConfig(level=logging.INFO)

# DEBUG level shows detailed transform decisions
logging.basicConfig(level=logging.DEBUG)
```

### Log Output Examples

```
INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
INFO:headroom.cache.compression_store:CCR cache hit: hash=abc123, retrieved 1000 items
DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77)
```

### Proxy Logging

```bash
# Log to file
headroom proxy --log-file headroom.jsonl

# Increase verbosity
headroom proxy --log-level debug
```

## Grafana Dashboard

Example Grafana dashboard configuration for Prometheus metrics:

```json
{
  "panels": [
    {
      "title": "Tokens Saved",
      "type": "stat",
      "targets": [{"expr": "headroom_tokens_saved_total"}]
    },
    {
      "title": "Compression Ratio",
      "type": "gauge",
      "targets": [{"expr": "histogram_quantile(0.5, headroom_compression_ratio_bucket)"}]
    },
    {
      "title": "Request Latency (p99)",
      "type": "graph",
      "targets": [{"expr": "histogram_quantile(0.99, headroom_latency_seconds_bucket)"}]
    },
    {
      "title": "Cache Hit Rate",
      "type": "gauge",
      "targets": [{"expr": "headroom_cache_hits_total / (headroom_cache_hits_total + headroom_cache_misses_total)"}]
    }
  ]
}
```

## Cost Tracking

### Per-Request Cost

Each request includes cost metadata in the response:

```python
response = client.chat.completions.create(...)

# Access via response metadata (if available)
# Cost is calculated based on model pricing and token counts
```

### Budget Alerts

Set a budget limit in the proxy:

```bash
headroom proxy --budget 10.00
```

When the budget is exceeded:
- Requests return a budget exceeded error
- The `/stats` endpoint shows budget status
- Logs indicate budget state

## Validation

Validate your setup is correct:

```python
result = client.validate_setup()

if result["valid"]:
    print("Setup is correct!")
else:
    print("Issues found:")
    for issue in result["issues"]:
        print(f"  - {issue}")
```

## Key Metrics to Monitor

| Metric | What It Tells You | Target |
|--------|------------------|--------|
| `tokens_saved_total` | Total cost savings | Higher is better |
| `compression_ratio_avg` | Efficiency | 0.7-0.9 typical |
| `cache_hit_rate` | Cache effectiveness | >20% is good |
| `latency_p99` | Performance impact | <10ms |
| `failed_requests` | Reliability | 0 |