# Proxy Server Documentation The Headroom proxy server is a production-ready HTTP server that applies context optimization to all requests passing through it. ## Starting the Proxy ```bash # Basic usage headroom proxy # Custom port headroom proxy --port 8080 # With all options headroom proxy \ --host 0.0.0.0 \ --port 8787 \ --log-file /var/log/headroom.jsonl \ --budget 100.0 ``` ## Command Line Options ### Core Options | Option | Default | Description | |--------|---------|-------------| | `--host` | `127.0.0.1` | Host to bind to | | `--port` | `8787` | Port to bind to | | `--no-optimize` | `false` | Disable optimization (passthrough mode) | | `--no-cache` | `false` | Disable semantic caching | | `--no-rate-limit` | `false` | Disable rate limiting | | `--log-file` | None | Path to JSONL log file | | `--budget` | None | Daily budget limit in USD | ### LLMLingua Options (ML Compression) | Option | Default | Description | |--------|---------|-------------| | `--llmlingua` | `false` | Enable LLMLingua-2 ML-based compression | | `--llmlingua-device` | `auto` | Device for model: `auto`, `cuda`, `cpu`, `mps` | | `--llmlingua-rate` | `0.3` | Target compression rate (0.3 = keep 30% of tokens) | **Note:** LLMLingua requires additional dependencies: `pip install headroom-ai[llmlingua]` ```bash # Enable LLMLingua with GPU acceleration headroom proxy --llmlingua --llmlingua-device cuda # More aggressive compression (keep only 20%) headroom proxy --llmlingua --llmlingua-rate 0.2 # Conservative compression for code (keep 50%) headroom proxy --llmlingua --llmlingua-rate 0.5 ``` ## API Endpoints ### Health Check ```bash curl http://localhost:8787/health ``` Response: ```json { "status": "healthy", "optimize": true, "stats": { "total_requests": 42, "tokens_saved": 15000, "savings_percent": 45.2 } } ``` ### Detailed Statistics ```bash curl http://localhost:8787/stats ``` ### Prometheus Metrics ```bash curl http://localhost:8787/metrics ``` ### LLM APIs The proxy supports both Anthropic and OpenAI API formats: ```bash # Anthropic format POST /v1/messages # OpenAI format POST /v1/chat/completions ``` ## Using with Claude Code ```bash # Start proxy headroom proxy --port 8787 # In another terminal ANTHROPIC_BASE_URL=http://localhost:8787 claude ``` ## Using with Cursor 1. Start the proxy: `headroom proxy` 2. In Cursor settings, set the base URL to `http://localhost:8787` ## Using with OpenAI SDK ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8787/v1", api_key="your-api-key", # Still needed for upstream ) ``` ## Features ### LLMLingua ML Compression (Opt-In) When enabled, the proxy uses Microsoft's LLMLingua-2 model for ML-based token compression: ```bash headroom proxy --llmlingua ``` **How it works:** - LLMLinguaCompressor is added to the transform pipeline (before RollingWindow) - Automatically detects content type (JSON, code, text) and adjusts compression - Stores original content in CCR for retrieval if needed **Startup feedback:** ``` # When enabled and available: LLMLingua: ENABLED (device=cuda, rate=0.3) # When installed but not enabled (helpful hint): LLMLingua: available (enable with --llmlingua for ML compression) # When enabled but not installed: WARNING: LLMLingua requested but not installed. Install with: pip install headroom-ai[llmlingua] ``` **Why opt-in?** | Concern | Default Proxy | With LLMLingua | |---------|---------------|----------------| | Dependencies | ~50MB | +2GB (torch, transformers) | | Cold start | <1s | 10-30s (model load) | | Memory | ~100MB | +1GB (model in RAM) | | Overhead | <5ms | 50-200ms per request | Enable LLMLingua when maximum compression justifies the resource cost. ### Semantic Caching The proxy caches responses for repeated queries: - LRU eviction with configurable max entries - TTL-based expiration - Cache key based on message content hash ### Rate Limiting Token bucket rate limiting protects against runaway costs: - Configurable requests per minute - Configurable tokens per minute - Per-API-key tracking ### Cost Tracking Track spending and enforce budgets: - Real-time cost estimation - Budget periods: hourly, daily, monthly - Automatic request rejection when over budget ### Prometheus Metrics Export metrics for monitoring: ``` headroom_requests_total headroom_tokens_saved_total headroom_cost_usd_total headroom_latency_ms_sum ``` ## Configuration via Environment ```bash export HEADROOM_HOST=0.0.0.0 export HEADROOM_PORT=8787 export HEADROOM_BUDGET=100.0 headroom proxy ``` ## Running in Production For production deployments: ```bash # Use a process manager pip install gunicorn # Run with gunicorn gunicorn headroom.proxy.server:app \ --workers 4 \ --bind 0.0.0.0:8787 \ --worker-class uvicorn.workers.UvicornWorker ``` Or with Docker: ```dockerfile FROM python:3.11-slim RUN pip install headroom[proxy] EXPOSE 8787 CMD ["headroom", "proxy", "--host", "0.0.0.0"] ```