Spaces:

minhtudragon
/

headroom

Build error

chopratejas Claude Opus 4.5 commited on Jan 10

Commit

90d3aea

1 Parent(s): c1feb60

Publish headroom-ai v0.2.0 to PyPI with DevEx fixes

- Renamed package from 'headroom' to 'headroom-ai' (PyPI name conflict)
- Fixed numpy/jinja2 imports to be lazy (core install no longer crashes)
- Fixed SQLite default path (now uses temp directory)
- Fixed f-string {tool} crash in proxy server
- Updated README with correct package name and examples
- Added quickstart and troubleshooting docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (10) hide show

README.md +413 -100
docs/quickstart.md +330 -0
docs/troubleshooting.md +442 -0
examples/basic_usage.py +65 -1
headroom/__init__.py +50 -20
headroom/client.py +11 -2
headroom/proxy/server.py +1 -1
headroom/relevance/embedding.py +24 -6
headroom/reporting/generator.py +20 -4
pyproject.toml +2 -2

README.md CHANGED Viewed

@@ -25,7 +25,7 @@
 ---
-## The Problem
 AI coding agents and tool-using applications generate **massive contexts**:
@@ -35,9 +35,7 @@ AI coding agents and tool-using applications generate **massive contexts**:
 **Result**: You pay for tokens you don't need, and cache hits are rare.
-## The Solution
-Headroom is a **smart compression layer** that sits between your app and LLM providers. It applies three transforms:
 | Transform | What It Does | Savings |
 |-----------|--------------|---------|
@@ -47,217 +45,532 @@ Headroom is a **smart compression layer** that sits between your app and LLM pro
 **Zero accuracy loss** - we keep what matters: errors, anomalies, relevant items.
-## Quick Start
-### Option 1: Proxy (Recommended)
-Run Headroom as a proxy server - works with any client:
 ```bash
-pip install headroom
 # Start the proxy
 headroom proxy --port 8787
-# Use with Claude Code
 ANTHROPIC_BASE_URL=http://localhost:8787 claude
-# Use with any OpenAI-compatible client
 OPENAI_BASE_URL=http://localhost:8787/v1 your-app
 ```
 ### Option 2: Python SDK
-Wrap your existing client:
 ```python
-from headroom import HeadroomClient
 from openai import OpenAI
 client = HeadroomClient(
     original_client=OpenAI(),
-    default_mode="optimize",
 )
 # Use exactly like the original client
 response = client.chat.completions.create(
     model="gpt-4o",
     messages=[...],
 )
 ```
-### Option 3: LangChain Integration
 ```python
-from langchain_openai import ChatOpenAI
-from headroom.integrations import HeadroomOptimizer
-llm = ChatOpenAI(model="gpt-4o", callbacks=[HeadroomOptimizer()])
 ```
-## Features
-### Smart Tool Output Compression
 ```python
 # Before: 50KB tool response with 1000 items
-{"results": [{"id": 1, ...}, {"id": 2, ...}, ... 1000 items ...]}
 # After: ~2KB with important items preserved
 # - First 3 items (context)
 # - Last 2 items (recency)
-# - All error items
-# - Anomalous values (> 2 std dev)
-# - Items matching user's query
 ```
-### Cache-Aligned Prefixes
 ```python
 # Before: Cache miss every day due to changing date
 "You are helpful. Today is January 7, 2025."
-# After: Stable prefix (cache hit!) + dynamic context
 "You are helpful."
-# [Dynamic context moved to end]
 ```
-### Rolling Window
 ```python
-# Automatically manages context within token limits
-# - Drops oldest tool outputs first
-# - Never orphans tool call/response pairs
-# - Always preserves system prompt and recent turns
 ```
-### Production Proxy Features
-- **Semantic Caching**: LRU cache with TTL for repeated queries
-- **Rate Limiting**: Token bucket (requests + tokens per minute)
-- **Cost Tracking**: Budget enforcement (hourly/daily/monthly)
-- **Prometheus Metrics**: `/metrics` endpoint for monitoring
-- **Request Logging**: JSONL logs for debugging
-## Installation
 ```bash
-# Core (minimal dependencies)
-pip install headroom
-# With semantic relevance scoring
-pip install headroom[relevance]
-# With proxy server
-pip install headroom[proxy]
-# Everything
-pip install headroom[all]
 ```
-## Modes
-### Audit Mode (Observe Only)
 ```python
-client = HeadroomClient(original_client=base, default_mode="audit")
-# Logs metrics but doesn't modify requests
 ```
-### Optimize Mode (Apply Transforms)
 ```python
-client = HeadroomClient(original_client=base, default_mode="optimize")
-# Applies safe, deterministic transforms
 ```
-### Simulate Mode (Preview)
 ```python
-plan = client.chat.completions.simulate(model="gpt-4o", messages=[...])
-print(f"Would save {plan.tokens_saved} tokens ({plan.savings_percent:.1f}%)")
 ```
-## Configuration
 ```python
-from headroom import HeadroomClient, SmartCrusherConfig
-client = HeadroomClient(
-    original_client=base,
-    default_mode="optimize",
-    smart_crusher_config=SmartCrusherConfig(
-        min_tokens_to_crush=200,      # Only compress if > 200 tokens
-        max_items_after_crush=50,     # Keep at most 50 items
-        keep_first=3,                 # Always keep first 3
-        keep_last=2,                  # Always keep last 2
-        relevance_threshold=0.3,      # Keep items with relevance > 0.3
-    ),
 )
 ```
 ## Supported Providers
-| Provider | Token Counting | Status |
-|----------|----------------|--------|
-| OpenAI | tiktoken | Full support |
-| Anthropic | Official API | Full support |
-| Google | Official API | Full support |
-| Cohere | Official API | Full support |
-| Mistral | Official tokenizer | Full support |
-| LiteLLM | Via provider | Full support |
 ## Safety Guarantees
 Headroom follows strict safety rules:
-1. **Never removes human content** - User/assistant text is sacred
-2. **Never breaks tool ordering** - Tool calls and responses stay paired
 3. **Parse failures are no-ops** - Malformed content passes through unchanged
 4. **Preserves recency** - Last N turns are always kept
-## Benchmarks
-| Scenario | Before | After | Savings |
-|----------|--------|-------|---------|
-| Search results (1000 items) | 45,000 tokens | 4,500 tokens | 90% |
-| Log analysis (500 entries) | 22,000 tokens | 3,300 tokens | 85% |
-| API response (nested JSON) | 15,000 tokens | 2,250 tokens | 85% |
-| Long conversation (50 turns) | 80,000 tokens | 32,000 tokens | 60% |
 ## Documentation
-- [Getting Started Guide](docs/getting-started.md)
-- [Proxy Server Documentation](docs/proxy.md)
-- [Transform Reference](docs/transforms.md)
-- [API Reference](docs/api.md)
-- [Examples](examples/)
 ## Contributing
-We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
 ```bash
 # Development setup
 git clone https://github.com/headroom-sdk/headroom.git
 cd headroom
 pip install -e ".[dev]"
 pytest
 ```
-## License
-Apache License 2.0 - see [LICENSE](LICENSE) for details.
-## Links
-- [GitHub](https://github.com/headroom-sdk/headroom)
-- [PyPI](https://pypi.org/project/headroom/)
-- [Documentation](https://headroom.dev/docs)
-- [Discord](https://discord.gg/headroom)
 ---
 <p align="center">
-  <sub>Built with care for the AI developer community</sub>
 </p>

 ---
+## Why Headroom?
 AI coding agents and tool-using applications generate **massive contexts**:
 **Result**: You pay for tokens you don't need, and cache hits are rare.
+Headroom is a **smart compression layer** that sits between your app and LLM providers:
 | Transform | What It Does | Savings |
 |-----------|--------------|---------|
 **Zero accuracy loss** - we keep what matters: errors, anomalies, relevant items.
+---
+## 5-Minute Quickstart
+### Option 1: Proxy Server (Recommended)
+Works with **any** OpenAI-compatible client without code changes:
 ```bash
+# Install
+pip install "headroom-ai[proxy]"
 # Start the proxy
 headroom proxy --port 8787
+# Verify it's running
+curl http://localhost:8787/health
+# Expected: {"status": "healthy", ...}
+```
+**Use with your tools:**
+```bash
+# Claude Code
 ANTHROPIC_BASE_URL=http://localhost:8787 claude
+# Cursor / Continue / any OpenAI client
 OPENAI_BASE_URL=http://localhost:8787/v1 your-app
+# Python OpenAI SDK
+export OPENAI_BASE_URL=http://localhost:8787/v1
+python your_script.py
 ```
 ### Option 2: Python SDK
+Wrap your existing client for fine-grained control:
+```bash
+pip install headroom-ai openai
+```
 ```python
+from headroom import HeadroomClient, OpenAIProvider
 from openai import OpenAI
+# Create wrapped client
 client = HeadroomClient(
     original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    default_mode="optimize",  # or "audit" to observe only
 )
 # Use exactly like the original client
+response = client.chat.completions.create(
+    model="gpt-4o-mini",
+    messages=[
+        {"role": "user", "content": "Hello!"},
+    ],
+)
+print(response.choices[0].message.content)
+# Check what happened
+stats = client.get_stats()
+print(f"Tokens saved this session: {stats['session']['tokens_saved_total']}")
+```
+**With tool outputs (where real savings happen):**
+```python
+import json
+# Conversation with large tool output
+messages = [
+    {"role": "user", "content": "Search for Python tutorials"},
+    {
+        "role": "assistant",
+        "content": None,
+        "tool_calls": [{
+            "id": "call_123",
+            "type": "function",
+            "function": {"name": "search", "arguments": '{"q": "python"}'},
+        }],
+    },
+    {
+        "role": "tool",
+        "tool_call_id": "call_123",
+        "content": json.dumps({
+            "results": [{"title": f"Tutorial {i}", "score": 100-i} for i in range(500)]
+        }),
+    },
+    {"role": "user", "content": "What are the top 3?"},
+]
+# Headroom compresses 500 results to ~15, keeping highest-scoring items
+response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
+print(f"Tokens saved: {client.get_stats()['session']['tokens_saved_total']}")
+# Typical output: "Tokens saved: 3500"
+```
+### Option 3: LangChain Integration (Coming Soon)
+```python
+# Coming soon - use proxy server for now
+# OPENAI_BASE_URL=http://localhost:8787/v1 python your_langchain_app.py
+```
+---
+## Verify It's Working
+### Check Proxy Stats
+```bash
+curl http://localhost:8787/stats
+```
+```json
+{
+  "requests": {"total": 42, "cached": 5, "rate_limited": 0, "failed": 0},
+  "tokens": {"input": 50000, "output": 8000, "saved": 12500, "savings_percent": 25.0},
+  "cost": {"total_cost_usd": 0.15, "total_savings_usd": 0.04},
+  "cache": {"entries": 10, "total_hits": 5}
+}
+```
+### Check SDK Stats
+```python
+# Quick session stats (no database query)
+stats = client.get_stats()
+print(stats)
+# {
+#   "session": {"requests_total": 10, "tokens_saved_total": 5000, ...},
+#   "config": {"mode": "optimize", "provider": "openai", ...},
+#   "transforms": {"smart_crusher_enabled": True, ...}
+# }
+# Validate setup is correct
+result = client.validate_setup()
+if not result["valid"]:
+    print("Setup issues:", result)
+```
+### Enable Logging
+```python
+import logging
+logging.basicConfig(level=logging.INFO)
+# Now you'll see:
+# INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
+# INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
+```
+---
+## Installation
+```bash
+# Core only (minimal dependencies: tiktoken, pydantic)
+pip install headroom-ai
+# With semantic relevance scoring (adds sentence-transformers)
+pip install "headroom-ai[relevance]"
+# With proxy server (adds fastapi, uvicorn)
+pip install "headroom-ai[proxy]"
+# With HTML reports (adds jinja2)
+pip install "headroom-ai[reports]"
+# Everything
+pip install "headroom-ai[all]"
+```
+**Requirements**: Python 3.10+
+---
+## Configuration
+### SDK Configuration
+```python
+from headroom import HeadroomClient, OpenAIProvider
+from openai import OpenAI
+# Full configuration example
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    default_mode="optimize",              # "audit" (observe only) or "optimize" (apply transforms)
+    enable_cache_optimizer=True,          # Enable provider-specific cache optimization
+    enable_semantic_cache=False,          # Enable query-level semantic caching
+    model_context_limits={                # Override default context limits
+        "gpt-4o": 128000,
+        "gpt-4o-mini": 128000,
+    },
+    # store_url defaults to temp directory; override with absolute path if needed:
+    # store_url="sqlite:////absolute/path/to/headroom.db",
+)
+```
+### Proxy Configuration
+```bash
+# Via command line
+headroom proxy \
+  --port 8787 \
+  --budget 10.00 \
+  --log-file headroom.jsonl
+# Disable optimization (passthrough mode)
+headroom proxy --no-optimize
+# Disable semantic caching
+headroom proxy --no-cache
+# See all options
+headroom proxy --help
+```
+### Per-Request Overrides
+```python
+# Override mode for specific requests
 response = client.chat.completions.create(
     model="gpt-4o",
     messages=[...],
+    headroom_mode="audit",              # Just observe, don't optimize
+    headroom_output_buffer_tokens=8000, # Reserve more for output
+    headroom_keep_turns=5,              # Keep last 5 turns
 )
 ```
+---
+## Modes
+| Mode | Behavior | Use Case |
+|------|----------|----------|
+| `audit` | Observes and logs, no modifications | Production monitoring, baseline measurement |
+| `optimize` | Applies safe, deterministic transforms | Production optimization |
+| `simulate` | Returns plan without API call | Testing, cost estimation |
 ```python
+# Simulate to see what would happen
+plan = client.chat.completions.simulate(
+    model="gpt-4o",
+    messages=large_conversation,
+)
+print(f"Would save {plan.tokens_saved} tokens")
+print(f"Transforms: {plan.transforms}")
+print(f"Estimated savings: {plan.estimated_savings}")
+```
+---
+## Error Handling
+Headroom provides explicit exceptions for debugging:
+```python
+from headroom import (
+    HeadroomClient,
+    HeadroomError,        # Base class - catch all Headroom errors
+    ConfigurationError,   # Invalid configuration
+    ProviderError,        # Provider issues (unknown model, etc.)
+    StorageError,         # Database/storage failures
+    CompressionError,     # Compression failures (rare - we fail safe)
+    ValidationError,      # Setup validation failures
+)
+try:
+    client = HeadroomClient(...)
+    response = client.chat.completions.create(...)
+except ConfigurationError as e:
+    print(f"Config issue: {e}")
+    print(f"Details: {e.details}")  # Additional context
+except StorageError as e:
+    print(f"Storage issue: {e}")
+    # Headroom continues to work, just without metrics persistence
+except HeadroomError as e:
+    print(f"Headroom error: {e}")
 ```
+**Safety guarantee**: If compression fails, the original content passes through unchanged. Your LLM calls never fail due to Headroom.
+---
+## How It Works
+### SmartCrusher: Statistical Compression
 ```python
 # Before: 50KB tool response with 1000 items
+{"results": [{"id": 1, "status": "ok", ...}, ... 1000 items ...]}
 # After: ~2KB with important items preserved
+# Headroom keeps:
 # - First 3 items (context)
 # - Last 2 items (recency)
+# - All error items (status != "ok")
+# - Statistical anomalies (values > 2 std dev from mean)
+# - Items matching user's query (BM25/embedding similarity)
 ```
+### CacheAligner: Prefix Stabilization
 ```python
 # Before: Cache miss every day due to changing date
 "You are helpful. Today is January 7, 2025."
+# After: Stable prefix (cache hit!) + dynamic context moved to end
 "You are helpful."
+# Dynamic content: "Current date: January 7, 2025"
 ```
+### RollingWindow: Context Management
 ```python
+# When context exceeds limit:
+# 1. Drop oldest tool outputs first (as atomic units with their calls)
+# 2. Drop oldest conversation turns
+# 3. NEVER drop: system prompt, last N turns, orphaned tool responses
 ```
+---
+## Metrics & Monitoring
+### Prometheus Metrics (Proxy)
 ```bash
+curl http://localhost:8787/metrics
+```
+```
+# HELP headroom_requests_total Total requests processed
+headroom_requests_total{mode="optimize"} 1234
+# HELP headroom_tokens_saved_total Total tokens saved
+headroom_tokens_saved_total 5678900
+# HELP headroom_compression_ratio Compression ratio histogram
+headroom_compression_ratio_bucket{le="0.5"} 890
 ```
+### Query Stored Metrics (SDK)
 ```python
+from datetime import datetime, timedelta
+# Get recent metrics
+metrics = client.get_metrics(
+    start_time=datetime.utcnow() - timedelta(hours=1),
+    limit=100,
+)
+for m in metrics:
+    print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")
+# Get summary statistics
+summary = client.get_summary()
+print(f"Total requests: {summary['total_requests']}")
+print(f"Total tokens saved: {summary['total_tokens_saved']}")
+```
+---
+## Troubleshooting
+### "Proxy won't start"
+```bash
+# Check if port is in use
+lsof -i :8787
+# Try a different port
+headroom proxy --port 8788
+# Check logs
+headroom proxy --log-level debug
 ```
+### "No token savings"
 ```python
+# 1. Verify mode is "optimize"
+stats = client.get_stats()
+print(stats["config"]["mode"])  # Should be "optimize"
+# 2. Check if transforms are enabled
+print(stats["transforms"])  # smart_crusher_enabled should be True
+# 3. Enable logging to see what's happening
+import logging
+logging.basicConfig(level=logging.DEBUG)
+# 4. Use simulate to see what WOULD happen
+plan = client.chat.completions.simulate(model="gpt-4o", messages=msgs)
+print(f"Transforms that would apply: {plan.transforms}")
 ```
+### "High latency"
 ```python
+# Headroom adds ~1-5ms overhead. If you see more:
+# 1. Check if embedding scorer is enabled (slower but better relevance)
+# Switch to BM25 for faster scoring:
+config.smart_crusher.relevance.tier = "bm25"
+# 2. Disable transforms you don't need
+config.cache_aligner.enabled = False  # If you don't need cache alignment
+# 3. Increase min_tokens_to_crush to skip small payloads
+config.smart_crusher.min_tokens_to_crush = 500
 ```
+### "Compression too aggressive"
 ```python
+# Keep more items
+config.smart_crusher.max_items_after_crush = 50  # Default is 15
+# Or disable compression for specific tools
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[...],
+    headroom_tool_profiles={
+        "important_tool": {"skip_compression": True}
+    }
 )
 ```
+---
 ## Supported Providers
+| Provider | Token Counting | Cache Optimization | Status |
+|----------|----------------|-------------------|--------|
+| OpenAI | tiktoken (exact) | Automatic prefix caching | Full |
+| Anthropic | Official API | cache_control blocks | Full |
+| Google | Official API | Context caching | Full |
+| Cohere | Official API | - | Full |
+| Mistral | Official tokenizer | - | Full |
+| LiteLLM | Via underlying provider | - | Full |
+---
 ## Safety Guarantees
 Headroom follows strict safety rules:
+1. **Never removes human content** - User/assistant messages are never compressed
+2. **Never breaks tool ordering** - Tool calls and responses stay paired as atomic units
 3. **Parse failures are no-ops** - Malformed content passes through unchanged
 4. **Preserves recency** - Last N turns are always kept
+5. **Errors surface, don't hide** - Explicit exceptions with context
+---
+## Performance
+| Scenario | Before | After | Savings | Overhead |
+|----------|--------|-------|---------|----------|
+| Search results (1000 items) | 45,000 tokens | 4,500 tokens | 90% | ~2ms |
+| Log analysis (500 entries) | 22,000 tokens | 3,300 tokens | 85% | ~1ms |
+| API response (nested JSON) | 15,000 tokens | 2,250 tokens | 85% | ~1ms |
+| Long conversation (50 turns) | 80,000 tokens | 32,000 tokens | 60% | ~3ms |
+---
 ## Documentation
+- **[Quickstart Guide](docs/quickstart.md)** - Complete working examples
+- **[Proxy Documentation](docs/proxy.md)** - Production deployment
+- **[Transform Reference](docs/transforms.md)** - How each transform works
+- **[API Reference](docs/api.md)** - Complete API documentation
+- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions
+- **[Architecture](docs/ARCHITECTURE.md)** - How Headroom works internally
+---
+## Examples
+See the [`examples/`](examples/) directory for complete, runnable examples:
+- `basic_usage.py` - Simple SDK usage
+- `proxy_integration.py` - Using the proxy with different clients
+- `custom_compression.py` - Advanced compression configuration
+- `metrics_dashboard.py` - Building a metrics dashboard
+---
 ## Contributing
+We welcome contributions!
 ```bash
 # Development setup
 git clone https://github.com/headroom-sdk/headroom.git
 cd headroom
 pip install -e ".[dev]"
+# Run tests
 pytest
+# Run linting
+ruff check .
+mypy headroom
 ```
+See [CONTRIBUTING.md](CONTRIBUTING.md) for details.
+---
+## License
+Apache License 2.0 - see [LICENSE](LICENSE) for details.
 ---
 <p align="center">
+  <sub>Built for the AI developer community</sub>
 </p>

docs/quickstart.md ADDED Viewed

	@@ -0,0 +1,330 @@

+# Quickstart Guide
+Get Headroom running in 5 minutes with these copy-paste examples.
+---
+## Installation
+```bash
+# Core only (minimal dependencies)
+pip install headroom
+# With proxy server
+pip install "headroom[proxy]"
+# Everything
+pip install "headroom[all]"
+```
+---
+## Option 1: Proxy Server (Zero Code Changes)
+The fastest way to start saving tokens. Works with any OpenAI-compatible client.
+### Step 1: Start the Proxy
+```bash
+headroom proxy --port 8787
+```
+### Step 2: Verify It's Running
+```bash
+curl http://localhost:8787/health
+# Expected: {"status": "healthy", "mode": "optimize", ...}
+```
+### Step 3: Point Your Client
+```bash
+# Claude Code
+ANTHROPIC_BASE_URL=http://localhost:8787 claude
+# Cursor / Continue / any OpenAI client
+OPENAI_BASE_URL=http://localhost:8787/v1 your-app
+# Python
+export OPENAI_BASE_URL=http://localhost:8787/v1
+python your_script.py
+```
+### Step 4: Check Savings
+```bash
+curl http://localhost:8787/stats
+# {"requests_total": 42, "tokens_saved_total": 125000, ...}
+```
+---
+## Option 2: Python SDK
+Wrap your existing client for fine-grained control.
+### Basic Example
+```python
+from headroom import HeadroomClient, OpenAIProvider
+from openai import OpenAI
+# Create wrapped client
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    default_mode="optimize",
+)
+# Use exactly like OpenAI client
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Hello!"},
+    ],
+)
+print(response.choices[0].message.content)
+# Check what happened
+stats = client.get_stats()
+print(f"Tokens saved: {stats['session']['tokens_saved_total']}")
+```
+### With Tool Outputs (Where Savings Happen)
+```python
+from headroom import HeadroomClient, OpenAIProvider
+from openai import OpenAI
+import json
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    default_mode="optimize",
+)
+# Simulate a conversation with large tool outputs
+messages = [
+    {"role": "system", "content": "You analyze search results."},
+    {"role": "user", "content": "Search for Python tutorials."},
+    {
+        "role": "assistant",
+        "content": None,
+        "tool_calls": [{
+            "id": "call_1",
+            "type": "function",
+            "function": {"name": "search", "arguments": '{"q": "python"}'},
+        }],
+    },
+    {
+        "role": "tool",
+        "tool_call_id": "call_1",
+        # This is where Headroom shines - compressing large outputs
+        "content": json.dumps({
+            "results": [{"title": f"Result {i}", "score": 100-i} for i in range(500)]
+        }),
+    },
+    {"role": "user", "content": "What are the top 3 results?"},
+]
+# Headroom compresses the 500 results to ~20, keeping the most relevant
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=messages,
+)
+print(response.choices[0].message.content)
+```
+### Simulate Before Sending
+Preview optimizations without making an API call:
+```python
+# See what would happen without calling the API
+plan = client.chat.completions.simulate(
+    model="gpt-4o",
+    messages=messages,
+)
+print(f"Tokens before: {plan.tokens_before}")
+print(f"Tokens after: {plan.tokens_after}")
+print(f"Would save: {plan.tokens_saved} tokens ({plan.tokens_saved/plan.tokens_before*100:.0f}%)")
+print(f"Transforms: {plan.transforms}")
+print(f"Estimated savings: {plan.estimated_savings}")
+```
+---
+## Option 3: Anthropic SDK
+```python
+from headroom import HeadroomClient, AnthropicProvider
+from anthropic import Anthropic
+client = HeadroomClient(
+    original_client=Anthropic(),
+    provider=AnthropicProvider(),
+    default_mode="optimize",
+)
+# Use Anthropic-style API
+response = client.messages.create(
+    model="claude-sonnet-4-20250514",
+    max_tokens=1024,
+    messages=[
+        {"role": "user", "content": "Hello, Claude!"},
+    ],
+)
+print(response.content[0].text)
+```
+---
+## Verify It's Working
+### Method 1: Enable Logging
+```python
+import logging
+logging.basicConfig(level=logging.INFO)
+# Now you'll see:
+# INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
+# INFO:headroom.transforms.smart_crusher:SmartCrusher: keeping 15 of 500 items
+```
+### Method 2: Check Session Stats
+```python
+stats = client.get_stats()
+print(stats)
+# {
+#   "session": {"requests_total": 10, "tokens_saved_total": 5000, ...},
+#   "config": {"mode": "optimize", "provider": "openai", ...},
+#   "transforms": {"smart_crusher_enabled": True, ...}
+# }
+```
+### Method 3: Validate Setup
+```python
+result = client.validate_setup()
+if not result["valid"]:
+    print("Setup issues:", result)
+else:
+    print("Setup OK!")
+    print(f"Provider: {result['provider']['name']}")
+    print(f"Storage: {result['storage']['url']}")
+```
+---
+## Common Configuration
+### Adjust Compression
+```python
+from headroom import HeadroomClient, OpenAIProvider, HeadroomConfig
+config = HeadroomConfig()
+# Keep more items after compression (default: 15)
+config.smart_crusher.max_items_after_crush = 30
+# Only compress if tool output has > 500 tokens (default: 200)
+config.smart_crusher.min_tokens_to_crush = 500
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    config=config,  # Pass custom config
+    default_mode="optimize",
+)
+```
+### Skip Compression for Specific Tools
+```python
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=messages,
+    headroom_tool_profiles={
+        "database_query": {"skip_compression": True},  # Never compress
+        "search": {"max_items": 50},  # Keep more items
+    },
+)
+```
+### Audit Mode (Observe Only)
+```python
+# Start in audit mode - see what WOULD be optimized
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    default_mode="audit",  # No modifications, just logging
+)
+# Override per-request
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=messages,
+    headroom_mode="optimize",  # Enable for this request only
+)
+```
+---
+## What Gets Optimized?
+| Content Type | What Headroom Does | Typical Savings |
+|--------------|-------------------|-----------------|
+| **Tool outputs with lists** | Keeps errors, anomalies, high-score items | 70-90% |
+| **Repeated search results** | Deduplicates and samples | 60-80% |
+| **Long conversations** | Drops old turns, keeps recent | 40-60% |
+| **System prompts with dates** | Stabilizes for cache hits | Cache savings |
+---
+## Next Steps
+- **[Configuration Reference](api.md)** - All configuration options
+- **[Transform Reference](transforms.md)** - How each transform works
+- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions
+- **[Examples](../examples/)** - More complete examples
+---
+## Quick Troubleshooting
+### "No token savings"
+```python
+# 1. Check mode
+stats = client.get_stats()
+print(stats["config"]["mode"])  # Should be "optimize"
+# 2. Enable logging to see what's happening
+import logging
+logging.basicConfig(level=logging.DEBUG)
+```
+### "High latency"
+```python
+# Use BM25 instead of embeddings for faster relevance scoring
+config.smart_crusher.relevance.tier = "bm25"
+```
+### "Compression too aggressive"
+```python
+# Keep more items
+config.smart_crusher.max_items_after_crush = 50
+```
+See [Troubleshooting Guide](troubleshooting.md) for more solutions.

docs/troubleshooting.md ADDED Viewed

	@@ -0,0 +1,442 @@

+# Troubleshooting Guide
+Solutions for common Headroom issues.
+---
+## Proxy Server Issues
+### "Proxy won't start"
+**Symptom**: `headroom proxy` fails or hangs.
+**Solutions**:
+```bash
+# 1. Check if port is already in use
+lsof -i :8787
+# If something is using the port, either kill it or use a different port
+# 2. Try a different port
+headroom proxy --port 8788
+# 3. Check for missing dependencies
+pip install "headroom[proxy]"
+# 4. Run with debug logging
+headroom proxy --log-level debug
+```
+### "Connection refused" when calling proxy
+**Symptom**: `curl: (7) Failed to connect to localhost port 8787`
+**Solutions**:
+```bash
+# 1. Verify proxy is running
+curl http://localhost:8787/health
+# 2. Check if proxy started on a different port
+ps aux | grep headroom
+# 3. Check firewall settings (macOS)
+sudo pfctl -s rules | grep 8787
+```
+### "Proxy returns errors for some requests"
+**Symptom**: Some requests work, others fail with 502/503.
+**Solutions**:
+```bash
+# 1. Check proxy logs for the actual error
+headroom proxy --log-level debug
+# 2. Verify API key is set
+echo $OPENAI_API_KEY  # or ANTHROPIC_API_KEY
+# 3. Test the underlying API directly
+curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"
+```
+---
+## SDK Issues
+### "No token savings"
+**Symptom**: `stats['session']['tokens_saved_total']` is 0.
+**Diagnosis**:
+```python
+# 1. Check mode
+stats = client.get_stats()
+print(f"Mode: {stats['config']['mode']}")  # Should be "optimize"
+# 2. Check transforms are enabled
+print(f"SmartCrusher: {stats['transforms']['smart_crusher_enabled']}")
+# 3. Check if content meets threshold
+# SmartCrusher only compresses tool outputs > 200 tokens by default
+```
+**Solutions**:
+```python
+# 1. Ensure mode is "optimize"
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    default_mode="optimize",  # NOT "audit"
+)
+# 2. Or override per-request
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=messages,
+    headroom_mode="optimize",
+)
+# 3. Lower the compression threshold
+config = HeadroomConfig()
+config.smart_crusher.min_tokens_to_crush = 100  # Default is 200
+```
+**Why It Might Be 0**:
+- Mode is "audit" (observation only)
+- Messages don't contain tool outputs
+- Tool outputs are below the token threshold
+- Data isn't compressible (high uniqueness)
+### "Compression too aggressive"
+**Symptom**: LLM responses are missing information that was in tool outputs.
+**Solutions**:
+```python
+# 1. Keep more items
+config = HeadroomConfig()
+config.smart_crusher.max_items_after_crush = 50  # Default: 15
+# 2. Skip compression for specific tools
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=messages,
+    headroom_tool_profiles={
+        "important_tool": {"skip_compression": True},
+    },
+)
+# 3. Disable SmartCrusher entirely
+config.smart_crusher.enabled = False
+```
+### "High latency"
+**Symptom**: Requests take longer than expected.
+**Diagnosis**:
+```python
+import time
+import logging
+logging.basicConfig(level=logging.DEBUG)
+start = time.time()
+response = client.chat.completions.create(...)
+print(f"Total time: {time.time() - start:.2f}s")
+# Check logs for:
+# - "SmartCrusher" timing
+# - "EmbeddingScorer" timing (slow if using embeddings)
+```
+**Solutions**:
+```python
+# 1. Use BM25 instead of embeddings (faster)
+config = HeadroomConfig()
+config.smart_crusher.relevance.tier = "bm25"  # Default may use embeddings
+# 2. Increase threshold to skip small payloads
+config.smart_crusher.min_tokens_to_crush = 500
+# 3. Disable transforms you don't need
+config.cache_aligner.enabled = False
+config.rolling_window.enabled = False
+```
+### "ValidationError on setup"
+**Symptom**: `validate_setup()` returns errors.
+**Common Issues**:
+```python
+result = client.validate_setup()
+print(result)
+# Provider error:
+# {"provider": {"ok": False, "error": "No API key"}}
+# → Set OPENAI_API_KEY or pass api_key to OpenAI()
+# Storage error:
+# {"storage": {"ok": False, "error": "unable to open database"}}
+# → Check path permissions, use :memory: for testing
+# Config error:
+# {"config": {"ok": False, "error": "Invalid mode"}}
+# → Use "audit" or "optimize" only
+```
+**Solutions**:
+```python
+# 1. For testing, use in-memory storage
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    store_url="sqlite:///:memory:",  # No file created
+)
+# 2. For temp directory storage
+import tempfile
+import os
+db_path = os.path.join(tempfile.gettempdir(), "headroom.db")
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    store_url=f"sqlite:///{db_path}",
+)
+```
+---
+## Import/Installation Issues
+### "ModuleNotFoundError: No module named 'headroom'"
+```bash
+# 1. Check it's installed in the right environment
+pip show headroom
+# 2. If using virtual environment, ensure it's activated
+source venv/bin/activate  # or equivalent
+# 3. Reinstall
+pip install --upgrade headroom
+```
+### "ImportError: cannot import name 'X' from 'headroom'"
+```python
+# Check available imports
+import headroom
+print(dir(headroom))
+# Common imports:
+from headroom import (
+    HeadroomClient,
+    OpenAIProvider,
+    AnthropicProvider,
+    HeadroomConfig,
+    # Exceptions
+    HeadroomError,
+    ConfigurationError,
+    ProviderError,
+)
+```
+### "Missing optional dependency"
+```bash
+# For proxy server
+pip install "headroom[proxy]"
+# For embedding-based relevance scoring
+pip install "headroom[relevance]"
+# For everything
+pip install "headroom[all]"
+```
+---
+## Provider-Specific Issues
+### OpenAI: "Invalid API key"
+```python
+from openai import OpenAI
+import os
+# Ensure key is set
+api_key = os.environ.get("OPENAI_API_KEY")
+if not api_key:
+    raise ValueError("OPENAI_API_KEY not set")
+client = HeadroomClient(
+    original_client=OpenAI(api_key=api_key),
+    provider=OpenAIProvider(),
+)
+```
+### Anthropic: "Authentication error"
+```python
+from anthropic import Anthropic
+import os
+api_key = os.environ.get("ANTHROPIC_API_KEY")
+client = HeadroomClient(
+    original_client=Anthropic(api_key=api_key),
+    provider=AnthropicProvider(),
+)
+```
+### "Unknown model" warnings
+```python
+# For custom/fine-tuned models, specify context limit
+client = HeadroomClient(
+    original_client=OpenAI(),
+    provider=OpenAIProvider(),
+    model_context_limits={
+        "ft:gpt-4o-2024-08-06:my-org::abc123": 128000,
+        "my-custom-model": 32000,
+    },
+)
+```
+---
+## Debugging Techniques
+### Enable Full Logging
+```python
+import logging
+# See everything
+logging.basicConfig(
+    level=logging.DEBUG,
+    format="%(asctime)s %(name)s %(levelname)s %(message)s",
+)
+# Or just Headroom logs
+logging.getLogger("headroom").setLevel(logging.DEBUG)
+```
+### Inspect Transform Results
+```python
+# Use simulate to see what would happen
+plan = client.chat.completions.simulate(
+    model="gpt-4o",
+    messages=messages,
+)
+print(f"Tokens: {plan.tokens_before} -> {plan.tokens_after}")
+print(f"Transforms: {plan.transforms}")
+print(f"Waste signals: {plan.waste_signals}")
+# See the actual optimized messages
+import json
+print(json.dumps(plan.messages_optimized, indent=2))
+```
+### Check Storage Contents
+```python
+from datetime import datetime, timedelta
+# Get recent metrics
+metrics = client.get_metrics(
+    start_time=datetime.utcnow() - timedelta(hours=1),
+    limit=10,
+)
+for m in metrics:
+    print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")
+    print(f"  Transforms: {m.transforms_applied}")
+    if m.error:
+        print(f"  ERROR: {m.error}")
+```
+### Manual Transform Testing
+```python
+from headroom import SmartCrusher, Tokenizer
+from headroom.config import SmartCrusherConfig
+import json
+# Test compression directly
+config = SmartCrusherConfig()
+crusher = SmartCrusher(config)
+tokenizer = Tokenizer()
+messages = [
+    {"role": "tool", "content": json.dumps({"items": list(range(100))}), "tool_call_id": "1"}
+]
+result = crusher.apply(messages, tokenizer)
+print(f"Tokens: {result.tokens_before} -> {result.tokens_after}")
+print(f"Compressed content: {result.messages[0]['content'][:200]}...")
+```
+---
+## Error Reference
+| Exception | Meaning | Solution |
+|-----------|---------|----------|
+| `ConfigurationError` | Invalid config values | Check config parameters |
+| `ProviderError` | Provider issue (unknown model, etc.) | Set model_context_limits |
+| `StorageError` | Database issue | Check path/permissions |
+| `CompressionError` | Compression failed | Rare - check data format |
+| `TokenizationError` | Token counting failed | Check model name |
+| `ValidationError` | Setup validation failed | Run validate_setup() |
+### Handling Errors
+```python
+from headroom import (
+    HeadroomClient,
+    HeadroomError,
+    ConfigurationError,
+    StorageError,
+)
+try:
+    client = HeadroomClient(...)
+    response = client.chat.completions.create(...)
+except ConfigurationError as e:
+    print(f"Config issue: {e}")
+    print(f"Details: {e.details}")
+except StorageError as e:
+    print(f"Storage issue: {e}")
+    # Headroom continues to work, just without metrics persistence
+except HeadroomError as e:
+    print(f"Headroom error: {e}")
+```
+---
+## Getting Help
+1. **Enable debug logging** and check the output
+2. **Use simulate()** to see what transforms would apply
+3. **Check validate_setup()** for configuration issues
+4. **File an issue** at https://github.com/headroom-sdk/headroom/issues
+When filing an issue, include:
+- Headroom version (`pip show headroom`)
+- Python version
+- Provider (OpenAI/Anthropic)
+- Debug log output
+- Minimal reproduction code

examples/basic_usage.py CHANGED Viewed

@@ -4,8 +4,13 @@ Basic usage example for Headroom SDK.
 This example shows how to wrap an OpenAI client with Headroom
 and use both audit and optimize modes.
 """
 import os
 import tempfile
@@ -14,6 +19,12 @@ from openai import OpenAI
 from headroom import HeadroomClient, OpenAIProvider
 # Load API key from .env.local
 load_dotenv(".env.local")
@@ -153,7 +164,7 @@ def example_get_metrics():
     print("METRICS EXAMPLE")
     print("=" * 50)
-    # Get summary statistics
     summary = client.get_summary()
     print(f"Total requests: {summary['total_requests']}")
     print(f"Total tokens saved: {summary['total_tokens_saved']}")
@@ -161,12 +172,65 @@ def example_get_metrics():
     print()
 if __name__ == "__main__":
     # Run all examples
     example_audit_mode()
     example_optimize_mode()
     example_simulate_mode()
     example_get_metrics()
     # Clean up
     client.close()

 This example shows how to wrap an OpenAI client with Headroom
 and use both audit and optimize modes.
+Run:
+    export OPENAI_API_KEY='sk-...'
+    python examples/basic_usage.py
 """
+import logging
 import os
 import tempfile
 from headroom import HeadroomClient, OpenAIProvider
+# Enable logging to see what Headroom is doing
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(name)s: %(message)s",
+)
 # Load API key from .env.local
 load_dotenv(".env.local")
     print("METRICS EXAMPLE")
     print("=" * 50)
+    # Get summary statistics from database
     summary = client.get_summary()
     print(f"Total requests: {summary['total_requests']}")
     print(f"Total tokens saved: {summary['total_tokens_saved']}")
     print()
+def example_validate_setup():
+    """Example of validating Headroom setup."""
+    print("=" * 50)
+    print("VALIDATE SETUP EXAMPLE")
+    print("=" * 50)
+    # Validate that everything is configured correctly
+    result = client.validate_setup()
+    if result["valid"]:
+        print("Setup is valid!")
+        print(f"  Provider: {result['provider']['name']}")
+        print(f"  Storage: {result['storage']['url']}")
+        print(f"  Mode: {result['config']['mode']}")
+    else:
+        print("Setup issues detected:")
+        for key, val in result.items():
+            if key != "valid" and not val.get("ok"):
+                print(f"  {key}: {val.get('error')}")
+    print()
+def example_get_stats():
+    """Example of getting quick session stats."""
+    print("=" * 50)
+    print("SESSION STATS EXAMPLE")
+    print("=" * 50)
+    # Get quick stats without database query
+    stats = client.get_stats()
+    print("Session stats:")
+    print(f"  Requests total: {stats['session']['requests_total']}")
+    print(f"  Requests optimized: {stats['session']['requests_optimized']}")
+    print(f"  Tokens saved: {stats['session']['tokens_saved_total']}")
+    print("\nConfiguration:")
+    print(f"  Mode: {stats['config']['mode']}")
+    print(f"  Provider: {stats['config']['provider']}")
+    print("\nTransforms enabled:")
+    print(f"  SmartCrusher: {stats['transforms']['smart_crusher_enabled']}")
+    print(f"  RollingWindow: {stats['transforms']['rolling_window_enabled']}")
+    print(f"  CacheAligner: {stats['transforms']['cache_aligner_enabled']}")
+    print()
 if __name__ == "__main__":
+    # First, validate the setup
+    example_validate_setup()
     # Run all examples
     example_audit_mode()
     example_optimize_mode()
     example_simulate_mode()
     example_get_metrics()
+    # Show session stats
+    example_get_stats()
     # Clean up
     client.close()

headroom/__init__.py CHANGED Viewed

@@ -1,41 +1,71 @@
 """
-Headroom - A safe, deterministic Context Budget Controller for LLM APIs.
 Headroom wraps LLM clients to provide:
-- Context waste detection and reporting
-- Tool output compression
-- Cache-aligned prefix optimization
-- Rolling window token management
-- Full streaming support
-Example usage:
     from headroom import HeadroomClient, OpenAIProvider
     from openai import OpenAI
-    base = OpenAI(api_key="...")
-    provider = OpenAIProvider()
     client = HeadroomClient(
-        original_client=base,
-        provider=provider,
-        store_url="sqlite:///headroom.db",
-        default_mode="audit",
     )
-    # Use like normal OpenAI client
-    resp = client.chat.completions.create(
         model="gpt-4o",
-        messages=[...],
-        headroom_mode="optimize",  # Enable optimization
     )
-    # Simulate without API call
     plan = client.chat.completions.simulate(
         model="gpt-4o",
-        messages=[...],
     )
     print(f"Would save {plan.tokens_saved} tokens")
 """
 from .cache import (

 """
+Headroom - The Context Optimization Layer for LLM Applications.
+Cut your LLM costs by 50-90% without losing accuracy.
 Headroom wraps LLM clients to provide:
+- Smart compression of tool outputs (keeps errors, anomalies, relevant items)
+- Cache-aligned prefix optimization for better provider cache hits
+- Rolling window token management for long conversations
+- Full streaming support with zero accuracy loss
+Quick Start:
     from headroom import HeadroomClient, OpenAIProvider
     from openai import OpenAI
+    # Wrap your existing client
     client = HeadroomClient(
+        original_client=OpenAI(),
+        provider=OpenAIProvider(),
+        default_mode="optimize",
     )
+    # Use exactly like the original client
+    response = client.chat.completions.create(
         model="gpt-4o",
+        messages=[
+            {"role": "user", "content": "Hello!"},
+        ],
     )
+    # Check savings
+    stats = client.get_stats()
+    print(f"Tokens saved: {stats['session']['tokens_saved_total']}")
+Verify It's Working:
+    # Validate configuration
+    result = client.validate_setup()
+    if not result["valid"]:
+        print("Issues:", result)
+    # Enable logging to see what's happening
+    import logging
+    logging.basicConfig(level=logging.INFO)
+    # INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens
+Simulate Before Sending:
     plan = client.chat.completions.simulate(
         model="gpt-4o",
+        messages=large_messages,
     )
     print(f"Would save {plan.tokens_saved} tokens")
+    print(f"Transforms: {plan.transforms}")
+Error Handling:
+    from headroom import HeadroomError, ConfigurationError, ProviderError
+    try:
+        response = client.chat.completions.create(...)
+    except ConfigurationError as e:
+        print(f"Config issue: {e.details}")
+    except HeadroomError as e:
+        print(f"Headroom error: {e}")
+For more examples, see https://github.com/headroom-sdk/headroom/tree/main/examples
 """
 from .cache import (

headroom/client.py CHANGED Viewed

@@ -266,7 +266,7 @@ class HeadroomClient:
         self,
         original_client: Any,
         provider: Provider,
-        store_url: str = "sqlite:///headroom.db",
         default_mode: str = "audit",
         model_context_limits: dict[str, int] | None = None,
         cache_optimizer: BaseCacheOptimizer | None = None,
@@ -279,7 +279,7 @@ class HeadroomClient:
         Args:
             original_client: The underlying LLM client (OpenAI-compatible).
             provider: Provider instance for model-specific behavior.
-            store_url: Storage URL (sqlite:// or jsonl://).
             default_mode: Default mode ("audit" | "optimize").
             model_context_limits: Override context limits for models.
             cache_optimizer: Optional custom cache optimizer. If None and
@@ -289,6 +289,15 @@ class HeadroomClient:
         """
         self._original = original_client
         self._provider = provider
         self._store_url = store_url
         self._default_mode = HeadroomMode(default_mode)

         self,
         original_client: Any,
         provider: Provider,
+        store_url: str | None = None,
         default_mode: str = "audit",
         model_context_limits: dict[str, int] | None = None,
         cache_optimizer: BaseCacheOptimizer | None = None,
         Args:
             original_client: The underlying LLM client (OpenAI-compatible).
             provider: Provider instance for model-specific behavior.
+            store_url: Storage URL (sqlite:// or jsonl://). Defaults to temp dir.
             default_mode: Default mode ("audit" | "optimize").
             model_context_limits: Override context limits for models.
             cache_optimizer: Optional custom cache optimizer. If None and
         """
         self._original = original_client
         self._provider = provider
+        # Set default store_url to temp directory for better DevEx
+        if store_url is None:
+            import os
+            import tempfile
+            db_path = os.path.join(tempfile.gettempdir(), "headroom.db")
+            store_url = f"sqlite:///{db_path}"
         self._store_url = store_url
         self._default_mode = HeadroomMode(default_mode)

headroom/proxy/server.py CHANGED Viewed

@@ -1779,7 +1779,7 @@ def run_server(config: ProxyConfig | None = None):
 ║    /v1/retrieve/stats       CCR: Compression store stats             ║
 ║    /v1/retrieve/tool_call   CCR: Handle LLM tool calls               ║
 ║    /v1/feedback             CCR: Feedback loop stats & patterns      ║
-║    /v1/feedback/{tool}      CCR: Compression hints for a tool        ║
 ║    /v1/telemetry            Data flywheel: Telemetry stats           ║
 ║    /v1/telemetry/export     Data flywheel: Export for aggregation    ║
 ║    /v1/telemetry/tools      Data flywheel: Per-tool stats            ║

 ║    /v1/retrieve/stats       CCR: Compression store stats             ║
 ║    /v1/retrieve/tool_call   CCR: Handle LLM tool calls               ║
 ║    /v1/feedback             CCR: Feedback loop stats & patterns      ║
+║    /v1/feedback/{{tool}}    CCR: Compression hints for a tool        ║
 ║    /v1/telemetry            Data flywheel: Telemetry stats           ║
 ║    /v1/telemetry/export     Data flywheel: Export for aggregation    ║
 ║    /v1/telemetry/tools      Data flywheel: Per-tool stats            ║

headroom/relevance/embedding.py CHANGED Viewed

@@ -22,26 +22,44 @@ from __future__ import annotations
 import logging
 from typing import TYPE_CHECKING
-import numpy as np
 from .base import RelevanceScore, RelevanceScorer
 if TYPE_CHECKING:
     from sentence_transformers import SentenceTransformer
 logger = logging.getLogger(__name__)
-def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
     """Compute cosine similarity between two vectors.
     Args:
-        a: First vector.
-        b: Second vector.
     Returns:
         Cosine similarity in range [-1, 1], clamped to [0, 1].
     """
     norm_a = np.linalg.norm(a)
     norm_b = np.linalg.norm(b)
@@ -144,7 +162,7 @@ class EmbeddingScorer(RelevanceScorer):
         return self._model
-    def _encode(self, texts: list[str]) -> np.ndarray:
         """Encode texts to embeddings.
         Args:

 import logging
 from typing import TYPE_CHECKING
 from .base import RelevanceScore, RelevanceScorer
+# numpy is an optional dependency - import lazily
+_numpy = None
+def _get_numpy():
+    """Lazily import numpy."""
+    global _numpy
+    if _numpy is None:
+        try:
+            import numpy as np
+            _numpy = np
+        except ImportError:
+            raise ImportError(
+                "numpy is required for EmbeddingScorer. "
+                "Install with: pip install headroom[relevance]"
+            )
+    return _numpy
 if TYPE_CHECKING:
     from sentence_transformers import SentenceTransformer
 logger = logging.getLogger(__name__)
+def _cosine_similarity(a, b) -> float:
     """Compute cosine similarity between two vectors.
     Args:
+        a: First vector (numpy array).
+        b: Second vector (numpy array).
     Returns:
         Cosine similarity in range [-1, 1], clamped to [0, 1].
     """
+    np = _get_numpy()
     norm_a = np.linalg.norm(a)
     norm_b = np.linalg.norm(b)
         return self._model
+    def _encode(self, texts: list[str]):
         """Encode texts to embeddings.
         Args:

headroom/reporting/generator.py CHANGED Viewed

@@ -4,11 +4,27 @@ from __future__ import annotations
 from datetime import datetime
 from pathlib import Path
-from typing import Any
-from jinja2 import Template
 from ..storage import create_storage
 from ..utils import estimate_cost, format_cost
 # HTML template embedded as string
@@ -350,7 +366,7 @@ def generate_report(
             period = "All time"
         # Render template
-        template = Template(REPORT_TEMPLATE)
         html = template.render(
             generated_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
             period=period,

 from datetime import datetime
 from pathlib import Path
+from typing import TYPE_CHECKING, Any
 from ..storage import create_storage
+if TYPE_CHECKING:
+    from jinja2 import Template
+def _get_jinja2_template(template_str: str):
+    """Lazily import jinja2 and create template."""
+    try:
+        from jinja2 import Template
+        return Template(template_str)
+    except ImportError:
+        raise ImportError(
+            "jinja2 is required for report generation. "
+            "Install with: pip install headroom[reports]"
+        )
 from ..utils import estimate_cost, format_cost
 # HTML template embedded as string
             period = "All time"
         # Render template
+        template = _get_jinja2_template(REPORT_TEMPLATE)
         html = template.render(
             generated_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
             period=period,

pyproject.toml CHANGED Viewed

@@ -3,7 +3,7 @@ requires = ["hatchling"]
 build-backend = "hatchling.build"
 [project]
-name = "headroom"
 version = "0.2.0"
 description = "The Context Optimization Layer for LLM Applications - Cut costs by 50-90%"
 readme = "README.md"
@@ -76,7 +76,7 @@ dev = [
 ]
 # All optional dependencies
 all = [
-    "headroom[relevance,proxy,reports]",
 ]
 [project.scripts]

 build-backend = "hatchling.build"
 [project]
+name = "headroom-ai"
 version = "0.2.0"
 description = "The Context Optimization Layer for LLM Applications - Cut costs by 50-90%"
 readme = "README.md"
 ]
 # All optional dependencies
 all = [
+    "headroom-ai[relevance,proxy,reports]",
 ]
 [project.scripts]