Headroom

The Context Optimization Layer for LLM Applications

Tool outputs are 70-95% redundant boilerplate. Headroom compresses that away.

CI PyPI Python License

--- ## Does It Actually Work? A Real Test **The setup:** 100 production log entries. One critical error buried at position 67.
BEFORE: 100 log entries (18,952 chars) - click to expand ```json [ {"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", "message": "Request processed successfully - latency=50ms", "request_id": "req-000000", "status_code": 200}, {"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", "message": "Request processed successfully - latency=51ms", "request_id": "req-000001", "status_code": 200}, {"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", "message": "Request processed successfully - latency=52ms", "request_id": "req-000002", "status_code": 200}, // ... 64 more INFO entries ... {"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "message": "Connection pool exhausted", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847}, // ... 32 more INFO entries ... ] ```
**AFTER:** Headroom compresses to 6 entries (1,155 chars): ```json [ {"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", ...}, {"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", ...}, {"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", ...}, {"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847}, {"timestamp": "2024-12-15T02:38:00Z", "level": "INFO", "service": "inventory", ...}, {"timestamp": "2024-12-15T03:39:00Z", "level": "INFO", "service": "auth", ...} ] ``` **What happened:** First 3 items + the FATAL error + last 2 items. The critical error at position 67 was automatically preserved. --- **The question we asked Claude:** "What caused the outage? What's the error code? What's the fix?" | | Baseline | Headroom | |--|----------|----------| | Input tokens | 10,144 | 1,260 | | Correct answers | **4/4** | **4/4** | Both responses: *"payment-gateway service, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected"* **87.6% fewer tokens. Same answer.** Run it yourself: `python examples/needle_in_haystack_test.py` --- ## Multi-Tool Agent Test: Real Function Calling **The setup:** An Agno agent with 4 tools (GitHub Issues, ArXiv Papers, Code Search, Database Logs) investigating a memory leak. Total tool output: 62,323 chars (~15,580 tokens). ```python from agno.agent import Agent from agno.models.anthropic import Claude from headroom.integrations.agno import HeadroomAgnoModel # Wrap your model - that's it! base_model = Claude(id="claude-sonnet-4-20250514") model = HeadroomAgnoModel(wrapped_model=base_model) agent = Agent(model=model, tools=[search_github, search_arxiv, search_code, query_db]) response = agent.run("Investigate the memory leak and recommend a fix") ``` **Results with Claude Sonnet:** | | Baseline | Headroom | |--|----------|----------| | Tokens sent to API | 15,662 | 6,100 | | API requests | 2 | 2 | | Tool calls | 4 | 4 | | Duration | 26.5s | 27.0s | **76.3% fewer tokens. Same comprehensive answer.** Both found: Issue #42 (memory leak), the `cleanup_worker()` fix, OutOfMemoryError logs (7.8GB/8GB, 847 threads), and relevant research papers. Run it yourself: `python examples/multi_tool_agent_test.py` --- ## How It Works Headroom doesn't summarize or truncate blindly. It uses **statistical analysis**: 1. **Detects redundancy** - Repeated fields like `"language": "typescript"` across 100 items 2. **Keeps what matters** - First items, last items, query-relevant matches, anomalies 3. **Preserves errors** - Never drops items containing "error", "exception", "failed" 4. **Maintains schema** - Output JSON structure stays identical The compression is **reversible** via CCR (Compress-Cache-Retrieve). If the LLM needs more data, it can request the original. --- ## Why Headroom? - **Zero code changes** - works as a transparent proxy - **47-92% savings** - depends on your workload (tool-heavy = more savings) - **Reversible compression** - LLM retrieves original data via CCR - **Content-aware** - code, logs, JSON each handled optimally - **Provider caching** - automatic prefix optimization for cache hits - **Framework native** - LangChain, Agno, MCP, agents supported --- ## 30-Second Quickstart ### Option 1: Proxy (Zero Code Changes) ```bash pip install "headroom-ai[proxy]" headroom proxy --port 8787 ``` Point your tools at the proxy: ```bash # Claude Code ANTHROPIC_BASE_URL=http://localhost:8787 claude # Any OpenAI-compatible client OPENAI_BASE_URL=http://localhost:8787/v1 cursor ``` ### Option 2: LangChain Integration ```bash pip install "headroom-ai[langchain]" ``` ```python from langchain_openai import ChatOpenAI from headroom.integrations import HeadroomChatModel # Wrap your model - that's it! llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o")) # Use exactly like before response = llm.invoke("Hello!") ``` See the full [LangChain Integration Guide](docs/langchain.md) for memory, retrievers, agents, and more. ### Option 3: Agno Integration ```bash pip install "headroom-ai[agno]" ``` ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from headroom.integrations.agno import HeadroomAgnoModel # Wrap your model - that's it! model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) agent = Agent(model=model) # Use exactly like before response = agent.run("Hello!") # Check savings print(f"Tokens saved: {model.total_tokens_saved}") ``` See the full [Agno Integration Guide](docs/agno.md) for hooks, multi-provider support, and more. --- ## Framework Integrations | Framework | Integration | Docs | |-----------|-------------|------| | **LangChain** | `HeadroomChatModel`, memory, retrievers, agents | [Guide](docs/langchain.md) | | **Agno** | `HeadroomAgnoModel`, hooks, multi-provider | [Guide](docs/agno.md) | | **MCP** | Tool output compression for Claude | [Guide](docs/ccr.md) | | **Any OpenAI Client** | Proxy server | [Guide](docs/proxy.md) | --- ## Features | Feature | Description | Docs | |---------|-------------|------| | **Memory** | Persistent memory across conversations (zero-latency inline extraction) | [Memory](docs/memory.md) | | **Universal Compression** | ML-based content detection + structure-preserving compression | [Compression](docs/compression.md) | | **SmartCrusher** | Compresses JSON tool outputs statistically | [Transforms](docs/transforms.md) | | **CacheAligner** | Stabilizes prefixes for provider caching | [Transforms](docs/transforms.md) | | **RollingWindow** | Manages context limits without breaking tools | [Transforms](docs/transforms.md) | | **CCR** | Reversible compression with automatic retrieval | [CCR Guide](docs/ccr.md) | | **LangChain** | Memory, retrievers, agents, streaming | [LangChain](docs/langchain.md) | | **Agno** | Agent framework integration with hooks | [Agno](docs/agno.md) | | **Text Utilities** | Opt-in compression for search/logs | [Text Compression](docs/text-compression.md) | | **LLMLingua-2** | ML-based 20x compression (opt-in) | [LLMLingua](docs/llmlingua.md) | | **Code-Aware** | AST-based code compression (tree-sitter) | [Transforms](docs/transforms.md) | --- ## Verified Performance These numbers are from actual API calls, not estimates: | Scenario | Before | After | Savings | Verified | |----------|--------|-------|---------|----------| | Code search (100 results) | 17,765 tokens | 1,408 tokens | 92% | Claude Sonnet | | SRE incident debugging | 65,694 tokens | 5,118 tokens | 92% | GPT-4o | | Codebase exploration | 78,502 tokens | 41,254 tokens | 47% | GPT-4o | | GitHub issue triage | 54,174 tokens | 14,761 tokens | 73% | GPT-4o | **Overhead**: ~1-5ms compression latency **When savings are highest**: Tool-heavy workloads (search, logs, database queries) **When savings are lowest**: Conversation-heavy workloads with minimal tool use --- ## Providers | Provider | Token Counting | Cache Optimization | |----------|----------------|-------------------| | OpenAI | tiktoken (exact) | Automatic prefix caching | | Anthropic | Official API | cache_control blocks | | Google | Official API | Context caching | | Cohere | Official API | - | | Mistral | Official tokenizer | - | New models auto-supported via naming pattern detection. --- ## Safety Guarantees - **Never removes human content** - user/assistant messages preserved - **Never breaks tool ordering** - tool calls and responses stay paired - **Parse failures are no-ops** - malformed content passes through unchanged - **Compression is reversible** - LLM retrieves original data via CCR --- ## Installation ```bash pip install headroom-ai # SDK only pip install "headroom-ai[proxy]" # Proxy server pip install "headroom-ai[langchain]" # LangChain integration pip install "headroom-ai[agno]" # Agno agent framework pip install "headroom-ai[code]" # AST-based code compression pip install "headroom-ai[llmlingua]" # ML-based compression pip install "headroom-ai[all]" # Everything ``` **Requirements**: Python 3.10+ --- ## Documentation | Guide | Description | |-------|-------------| | [Memory Guide](docs/memory.md) | Persistent memory for LLMs | | [Compression Guide](docs/compression.md) | Universal compression with ML detection | | [LangChain Integration](docs/langchain.md) | Full LangChain support | | [Agno Integration](docs/agno.md) | Full Agno agent framework support | | [SDK Guide](docs/sdk.md) | Fine-grained control | | [Proxy Guide](docs/proxy.md) | Production deployment | | [Configuration](docs/configuration.md) | All options | | [CCR Guide](docs/ccr.md) | Reversible compression | | [Metrics](docs/metrics.md) | Monitoring | | [Troubleshooting](docs/troubleshooting.md) | Common issues | --- ## Who's Using Headroom? > Add your project here! [Open a PR](https://github.com/chopratejas/headroom/pulls) or [start a discussion](https://github.com/chopratejas/headroom/discussions). --- ## Contributing ```bash git clone https://github.com/chopratejas/headroom.git cd headroom pip install -e ".[dev]" pytest ``` See [CONTRIBUTING.md](CONTRIBUTING.md) for details. --- ## License Apache License 2.0 - see [LICENSE](LICENSE). ---

Built for the AI developer community