Headroom

Compress everything your AI agent reads. Same answers, fraction of the tokens.

Every tool call, DB query, file read, and RAG retrieval your agent makes is 70-95% boilerplate.
Headroom compresses it away before it hits the model.

Works with any agent — coding agents (Claude Code, Codex, Cursor, Aider), custom agents
(LangChain, LangGraph, Agno, Strands, OpenClaw), or your own Python and TypeScript code.

CI PyPI Python Downloads npm License Documentation Discord

--- ## Where Headroom Fits ``` Your Agent / App (coding agents, customer support bots, RAG pipelines, data analysis agents, research agents, any LLM app) │ │ tool calls, logs, DB reads, RAG results, file reads, API responses ▼ Headroom ← proxy, Python/TypeScript SDK, or framework integration │ ▼ LLM Provider (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM) ``` Headroom sits between your application and the LLM provider. It intercepts requests, compresses the context, and forwards an optimized prompt. Use it as a transparent proxy (zero code changes), a Python function (`compress()`), or a framework integration (LangChain, LiteLLM, Agno). ### What gets compressed Headroom optimizes any data your agent injects into a prompt: - **Tool outputs** — shell commands, API calls, search results - **Database queries** — SQL results, key-value lookups - **RAG retrievals** — document chunks, embeddings results - **File reads** — code, logs, configs, CSVs - **API responses** — JSON, XML, HTML - **Conversation history** — long agent sessions with repetitive context --- ## Quick Start **Python:** ```bash pip install "headroom-ai[all]" ``` **TypeScript / Node.js:** ```bash npm install headroom-ai ``` ### Any agent — one function **Python:** ```python from headroom import compress result = compress(messages, model="claude-sonnet-4-5-20250929") response = client.messages.create(model="claude-sonnet-4-5-20250929", messages=result.messages) print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})") ``` **TypeScript:** ```typescript import { compress } from 'headroom-ai'; const result = await compress(messages, { model: 'gpt-4o' }); const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: result.messages }); console.log(`Saved ${result.tokensSaved} tokens`); ``` Works with any LLM client — Anthropic, OpenAI, LiteLLM, Bedrock, Vercel AI SDK, or your own code. ### Any agent — proxy (zero code changes) ```bash headroom proxy --port 8787 ``` ```bash # Run mode (default: token) headroom proxy --mode token # maximize compression headroom proxy --mode cache # preserve Anthropic/OpenAI prefix cache stability ``` ```bash # Point any LLM client at the proxy ANTHROPIC_BASE_URL=http://localhost:8787 your-app OPENAI_BASE_URL=http://localhost:8787/v1 your-app ``` Use `token` mode for short/medium sessions where raw compression savings matter most. Use `cache` mode for long-running chats where preserving prior-turn bytes improves provider cache reuse. Works with any language, any tool, any framework. **[Proxy docs](docs/proxy.md)** ### Coding agents — one command ```bash headroom wrap claude # Starts proxy + launches Claude Code headroom wrap codex # Starts proxy + launches OpenAI Codex CLI headroom wrap aider # Starts proxy + launches Aider headroom wrap cursor # Starts proxy + prints Cursor config headroom wrap openclaw # Installs + configures OpenClaw plugin ``` Headroom starts a proxy, points your tool at it, and compresses everything automatically. ### Multi-agent — SharedContext ```python from headroom import SharedContext ctx = SharedContext() ctx.put("research", big_agent_output) # Agent A stores (compressed) summary = ctx.get("research") # Agent B reads (~80% smaller) full = ctx.get("research", full=True) # Agent B gets original if needed ``` Compress what moves between agents — any framework. **[SharedContext Guide](docs/shared-context.md)** ### MCP Tools (Claude Code, Cursor) ```bash headroom mcp install && claude ``` Gives your AI tool three MCP tools: `headroom_compress`, `headroom_retrieve`, `headroom_stats`. **[MCP Guide](docs/mcp.md)** ### Drop into your existing stack | Your setup | Add Headroom | One-liner | |------------|-------------|-----------| | **Any Python app** | `compress()` | `result = compress(messages, model="gpt-4o")` | | **Any TypeScript app** | `compress()` | `const result = await compress(messages, { model: 'gpt-4o' })` | | **Vercel AI SDK** | Middleware | `wrapLanguageModel({ model, middleware: headroomMiddleware() })` | | **OpenAI Node SDK** | Wrap client | `const client = withHeadroom(new OpenAI())` | | **Anthropic TS SDK** | Wrap client | `const client = withHeadroom(new Anthropic())` | | **Multi-agent** | SharedContext | `ctx = SharedContext(); ctx.put("key", data)` | | **LiteLLM** | Callback | `litellm.callbacks = [HeadroomCallback()]` | | **Any Python proxy** | ASGI Middleware | `app.add_middleware(CompressionMiddleware)` | | **Agno agents** | Wrap model | `HeadroomAgnoModel(your_model)` | | **LangChain** | Wrap model | `HeadroomChatModel(your_llm)` | | **OpenClaw** | One-command wrap | `headroom wrap openclaw` | | **Claude Code** | Wrap | `headroom wrap claude` | | **Codex / Aider** | Wrap | `headroom wrap codex` or `headroom wrap aider` | **[Full Integration Guide](docs/integration-guide.md)** | **[TypeScript SDK](docs/typescript-sdk.md)** --- ## Demo

Headroom Demo

--- ## Does It Actually Work? **100 production log entries. One critical error buried at position 67.** | | Baseline | Headroom | |--|----------|----------| | Input tokens | 10,144 | 1,260 | | Correct answers | **4/4** | **4/4** | Both responses: *"payment-gateway, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected."* **87.6% fewer tokens. Same answer.** Run it: `python examples/needle_in_haystack_test.py`
What Headroom kept From 100 log entries, SmartCrusher kept 6: first 3 (boundary), the FATAL error at position 67 (anomaly detection), and last 2 (recency). The error was automatically preserved — not by keyword matching, but by statistical analysis of field variance.
### Real Workloads | Scenario | Before | After | Savings | |----------|--------|-------|---------| | Code search (100 results) | 17,765 | 1,408 | **92%** | | SRE incident debugging | 65,694 | 5,118 | **92%** | | Codebase exploration | 78,502 | 41,254 | **47%** | | GitHub issue triage | 54,174 | 14,761 | **73%** | ### Accuracy Benchmarks Compression preserves accuracy — tested on real OSS benchmarks. **Standard Benchmarks** — Baseline (direct to API) vs Headroom (through proxy): | Benchmark | Category | N | Baseline | Headroom | Delta | |-----------|----------|---|----------|----------|-------| | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) | Math | 100 | 0.870 | 0.870 | **0.000** | | [TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa) | Factual | 100 | 0.530 | 0.560 | **+0.030** | **Compression Benchmarks** — Accuracy after full compression stack: | Benchmark | Category | N | Accuracy | Compression | Method | |-----------|----------|---|----------|-------------|--------| | [SQuAD v2](https://huggingface.co/datasets/rajpurkar/squad_v2) | QA | 100 | **97%** | 19% | Before/After | | [BFCL](https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard) | Tool/Function | 100 | **97%** | 32% | LLM-as-Judge | | Tool Outputs (built-in) | Agent | 8 | **100%** | 20% | Before/After | | CCR Needle Retention | Lossless | 50 | **100%** | 77% | Exact Match | Run it yourself: ```bash # Quick smoke test (8 cases, ~10s) python -m headroom.evals quick -n 8 --provider openai --model gpt-4o-mini # Full Tier 1 suite (~$3, ~15 min) python -m headroom.evals suite --tier 1 -o eval_results/ # CI mode (exit 1 on regression) python -m headroom.evals suite --tier 1 --ci ``` Full methodology: [Benchmarks](docs/benchmarks.md) | [Evals Framework](headroom/evals/README.md) --- ## Key Capabilities ### Lossless Compression Headroom never throws data away. It compresses aggressively, stores the originals, and gives the LLM a tool to retrieve full details when needed. When it compresses 500 items to 20, it tells the model *what was omitted* ("87 passed, 2 failed, 1 error") so the model knows when to ask for more. ### Smart Content Detection Auto-detects what's in your context — JSON arrays, code, logs, plain text — and routes each to the best compressor. JSON goes to SmartCrusher, code goes through AST-aware compression (Python, JS, Go, Rust, Java, C++), text goes to Kompress (ModernBERT-based, with `[ml]` extra). ### Cache Optimization Stabilizes message prefixes so your provider's KV cache actually works. Claude offers a 90% read discount on cached prefixes — but almost no framework takes advantage of it. Headroom does. ### Failure Learning ```bash headroom learn # Analyze past Claude Code sessions, show recommendations headroom learn --apply # Write learnings to CLAUDE.md and MEMORY.md headroom learn --all --apply # Learn across all your projects ``` Reads your conversation history, finds every failed tool call, correlates it with what eventually succeeded, and writes specific corrections into your project files. Next session starts smarter. **[Learn docs](docs/learn.md)**

headroom learn demo

### Image Compression 40-90% token reduction via trained ML router. Automatically selects the right resize/quality tradeoff per image.
All features | Feature | What it does | |---------|-------------| | **Content Router** | Auto-detects content type, routes to optimal compressor | | **SmartCrusher** | Universal JSON compression — arrays of dicts, strings, numbers, mixed types, nested objects | | **CodeCompressor** | AST-aware compression for Python, JS, Go, Rust, Java, C++ | | **Kompress** | ModernBERT token compression (replaces LLMLingua-2) | | **CCR** | Reversible compression — LLM retrieves originals when needed | | **Compression Summaries** | Tells the LLM what was omitted ("3 errors, 12 failures") | | **CacheAligner** | Stabilizes prefixes for provider KV cache hits | | **IntelligentContext** | Score-based context management with learned importance | | **Image Compression** | 40-90% token reduction via trained ML router | | **Memory** | Persistent memory across conversations | | **Compression Hooks** | Customize compression with pre/post hooks | | **Read Lifecycle** | Detects stale/superseded Read outputs, replaces with CCR markers | | **`headroom learn`** | Analyzes past failures, writes project-specific learnings to CLAUDE.md/MEMORY.md | | **`headroom wrap`** | One-command setup for Claude Code, Codex, Aider, Cursor | | **SharedContext** | Compressed inter-agent context sharing for multi-agent workflows | | **MCP Tools** | headroom_compress, headroom_retrieve, headroom_stats for Claude Code/Cursor |
--- ## Headroom vs Alternatives Context compression is a new space. Here's how the approaches differ: | | Approach | Scope | Deploy as | Framework integrations | Data stays local? | Reversible | |---|---|---|---|---|---|---| | **Headroom** | Multi-algorithm compression | All context (tool outputs, DB reads, RAG, files, logs, history) | Proxy, Python library, ASGI middleware, or callback | LangChain, LangGraph, Agno, Strands, LiteLLM, MCP | Yes (OSS) | Yes (CCR) | | **[RTK](https://github.com/rtk-ai/rtk)** | CLI command rewriter | Shell command outputs | CLI wrapper | None | Yes (OSS) | No | | **[Compresr](https://compresr.ai)** | Cloud compression API | Text sent to their API | API call | None | No | No | | **[Token Company](https://thetokencompany.ai)** | Cloud compression API | Text sent to their API | API call | None | No | No | **Use it however you want.** Headroom works as a standalone proxy (`headroom proxy`), a one-function Python library (`compress()`), ASGI middleware, or a LiteLLM callback. Already using LiteLLM, LangChain, or Agno? Drop Headroom in without replacing anything. **Headroom + RTK work well together.** RTK rewrites CLI commands (`git show` → `git show --short`), Headroom compresses everything else (JSON arrays, code, logs, RAG results, conversation history). Use both. **Headroom vs cloud APIs.** Compresr and Token Company are hosted services — you send your context to their servers, they compress and return it. Headroom runs locally. Your data never leaves your machine. You also get lossless compression (CCR): the LLM can retrieve the full original when it needs more detail. --- ## How It Works Inside ``` Your prompt │ ▼ 1. CacheAligner Stabilize prefix for KV cache │ ▼ 2. ContentRouter Route each content type: │ → SmartCrusher (JSON) │ → CodeCompressor (code) │ → Kompress (text, with [ml]) ▼ 3. IntelligentContext Score-based token fitting │ ▼ LLM Provider Needs full details? LLM calls headroom_retrieve. Originals are in the Compressed Store — nothing is thrown away. ``` **Overhead**: 15-200ms compression latency (net positive for Sonnet/Opus). Full data: [Latency Benchmarks](docs/LATENCY_BENCHMARKS.md) --- ## Integrations | Integration | Status | Docs | |-------------|--------|------| | `headroom wrap claude/codex/aider/cursor` | **Stable** | [Proxy Docs](docs/proxy.md) | | `compress()` — one function | **Stable** | [Integration Guide](docs/integration-guide.md) | | `SharedContext` — multi-agent | **Stable** | [SharedContext Guide](docs/shared-context.md) | | LiteLLM callback | **Stable** | [Integration Guide](docs/integration-guide.md#litellm) | | ASGI middleware | **Stable** | [Integration Guide](docs/integration-guide.md#asgi-middleware) | | Proxy server | **Stable** | [Proxy Docs](docs/proxy.md) | | Agno | **Stable** | [Agno Guide](docs/agno.md) | | MCP (Claude Code, Cursor, etc.) | **Stable** | [MCP Guide](docs/mcp.md) | | Strands | **Stable** | [Strands Guide](docs/strands.md) | | LangChain | **Stable** | [LangChain Guide](docs/langchain.md) | | **OpenClaw** | **Stable** | [OpenClaw plugin](#openclaw-plugin) | --- ## OpenClaw Plugin The [`@headroom-ai/openclaw`](plugins/openclaw) plugin integrates Headroom as a ContextEngine for [OpenClaw](https://github.com/openclaw/openclaw). It compresses tool outputs, code, logs, and structured data inline — 70-90% token savings with zero LLM calls. The plugin can connect to a local or remote Headroom proxy and will auto-start one locally if needed. ### Install ```bash pip install "headroom-ai[proxy]" openclaw plugins install --dangerously-force-unsafe-install headroom-ai/openclaw ``` > **Why `--dangerously-force-unsafe-install`?** The plugin auto-starts `headroom proxy` as a subprocess when no running proxy is detected. OpenClaw blocks process-launching plugins by default, so this flag is required to permit that behavior. Once installed, assign Headroom as the context engine in your OpenClaw config: ```json { "plugins": { "entries": { "headroom": { "enabled": true } }, "slots": { "contextEngine": "headroom" } } } ``` The plugin auto-detects and auto-starts the proxy — no manual proxy management needed. See the [plugin README](plugins/openclaw/README.md) for full configuration options, local development setup, and launcher details. --- ## Cloud Providers ```bash headroom proxy --backend bedrock --region us-east-1 # AWS Bedrock headroom proxy --backend vertex_ai --region us-central1 # Google Vertex headroom proxy --backend azure # Azure OpenAI headroom proxy --backend openrouter # OpenRouter (400+ models) ``` --- ## Installation ```bash pip install headroom-ai # Core library pip install "headroom-ai[all]" # Everything including evals (recommended) pip install "headroom-ai[proxy]" # Proxy server + MCP tools pip install "headroom-ai[mcp]" # MCP tools only (no proxy) pip install "headroom-ai[ml]" # ML compression (Kompress, requires torch) pip install "headroom-ai[agno]" # Agno integration pip install "headroom-ai[langchain]" # LangChain (experimental) pip install "headroom-ai[evals]" # Evaluation framework only ``` ### Container images (GHCR tags) - supported platforms: `linux/amd64`, `linux/arm64` - tags `:code` - image with Code-Aware Compression (AST-based) i.e. `pip install "headroom-ai[proxy,code]"` - tags `:slim` - image with distorless base | Tag | | Extras | Docker Bake target | |---------------------|------------------------------------------------------|--------------|-----------------------------| | `` | ```ghcr.io/chopratejas/headroom:``` | `proxy` | `runtime` | | `latest` | ```ghcr.io/chopratejas/headroom:latest``` | `proxy` | `runtime` | | `nonroot` | ```ghcr.io/chopratejas/headroom:nonroot``` | `proxy` | `runtime-nonroot` | | `code` | ```ghcr.io/chopratejas/headroom:code``` | `proxy,code` | `runtime-code` | | `code-nonroot` | ```ghcr.io/chopratejas/headroom:code-nonroot``` | `proxy,code` | `runtime-code-nonroot` | | `slim` | ```ghcr.io/chopratejas/headroom:slim``` | `proxy` | `runtime-slim` | | `slim-nonroot` | ```ghcr.io/chopratejas/headroom:slim-nonroot``` | `proxy` | `runtime-slim-nonroot` | | `code-slim` | ```ghcr.io/chopratejas/headroom:code-slim``` | `proxy,code` | `runtime-code-slim` | | `code-slim-nonroot` | ```ghcr.io/chopratejas/headroom:code-slim-nonroot``` | `proxy,code` | `runtime-code-slim-nonroot` | ### Docker Bake ```bash # List all available build targets docker buildx bake --list targets # Build default image locally (proxy + nonroot) docker buildx bake runtime-default # Build one variant and load to local Docker image store docker buildx bake runtime-code-slim-nonroot \ --set runtime-code-slim-nonroot.platform=linux/amd64 \ --set runtime-code-slim-nonroot.tags=headroom:local \ --load ``` Python 3.10+ --- ## Documentation | | | |---|---| | [Integration Guide](docs/integration-guide.md) | LiteLLM, ASGI, compress(), proxy | | [Proxy Docs](docs/proxy.md) | Proxy server configuration | | [Architecture](docs/ARCHITECTURE.md) | How the pipeline works | | [CCR Guide](docs/ccr.md) | Reversible compression | | [Benchmarks](docs/benchmarks.md) | Accuracy validation | | [Latency Benchmarks](docs/LATENCY_BENCHMARKS.md) | Compression overhead & cost-benefit analysis | | [Limitations](docs/LIMITATIONS.md) | When compression helps, when it doesn't | | [Evals Framework](headroom/evals/README.md) | Prove compression preserves accuracy | | [Memory](docs/memory.md) | Persistent memory | | [Agno](docs/agno.md) | Agno agent framework | | [MCP](docs/mcp.md) | Context engineering toolkit (compress, retrieve, stats) | | [SharedContext](docs/shared-context.md) | Compressed inter-agent context sharing | | [Learn](docs/learn.md) | Offline failure learning for coding agents | | [Configuration](docs/configuration.md) | All options | --- ## Community Questions, feedback, or just want to follow along? **[Join us on Discord](https://discord.gg/yRmaUNpsPJ)** --- ## Contributing ```bash git clone https://github.com/chopratejas/headroom.git && cd headroom pip install -e ".[dev]" && pytest ``` --- ## License Apache License 2.0 — see [LICENSE](LICENSE).