Spaces:

minhtudragon
/

headroom

Build error

App Files Files Community

headroom / README.md

chopratejas

Fix HeadroomAgnoModel to optimize tool outputs at invoke level

39a55b4 5 months ago

preview code

Raw

History Blame

11.5 kB

	<p align="center">
	<h1 align="center">Headroom</h1>
	<p align="center">
	<strong>The Context Optimization Layer for LLM Applications</strong>
	</p>
	<p align="center">
	Tool outputs are 70-95% redundant boilerplate. Headroom compresses that away.
	</p>
	</p>

	<p align="center">
	<a href="https://github.com/chopratejas/headroom/actions/workflows/ci.yml">
	<img src="https://github.com/chopratejas/headroom/actions/workflows/ci.yml/badge.svg" alt="CI">
	</a>
	<a href="https://pypi.org/project/headroom-ai/">
	<img src="https://img.shields.io/pypi/v/headroom-ai.svg" alt="PyPI">
	</a>
	<a href="https://pypi.org/project/headroom-ai/">
	<img src="https://img.shields.io/pypi/pyversions/headroom-ai.svg" alt="Python">
	</a>
	<a href="https://github.com/chopratejas/headroom/blob/main/LICENSE">
	<img src="https://img.shields.io/badge/license-Apache%202.0-blue.svg" alt="License">
	</a>
	</p>


	---

	## Does It Actually Work? A Real Test

	The setup: 100 production log entries. One critical error buried at position 67.

	<details>
	<summary><b>BEFORE:</b> 100 log entries (18,952 chars) - click to expand</summary>

	```json
	[
	{"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", "message": "Request processed successfully - latency=50ms", "request_id": "req-000000", "status_code": 200},
	{"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", "message": "Request processed successfully - latency=51ms", "request_id": "req-000001", "status_code": 200},
	{"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", "message": "Request processed successfully - latency=52ms", "request_id": "req-000002", "status_code": 200},
	// ... 64 more INFO entries ...
	{"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "message": "Connection pool exhausted", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
	// ... 32 more INFO entries ...
	]
	```
	</details>

	AFTER: Headroom compresses to 6 entries (1,155 chars):

	```json
	[
	{"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", ...},
	{"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", ...},
	{"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", ...},
	{"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
	{"timestamp": "2024-12-15T02:38:00Z", "level": "INFO", "service": "inventory", ...},
	{"timestamp": "2024-12-15T03:39:00Z", "level": "INFO", "service": "auth", ...}
	]
	```

	What happened: First 3 items + the FATAL error + last 2 items. The critical error at position 67 was automatically preserved.

	---

	The question we asked Claude: "What caused the outage? What's the error code? What's the fix?"

	\| \| Baseline \| Headroom \|
	\|--\|----------\|----------\|
	\| Input tokens \| 10,144 \| 1,260 \|
	\| Correct answers \| 4/4 \| 4/4 \|

	Both responses: "payment-gateway service, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected"

	87.6% fewer tokens. Same answer.

	Run it yourself: `python examples/needle_in_haystack_test.py`

	---

	## Multi-Tool Agent Test: Real Function Calling

	The setup: An Agno agent with 4 tools (GitHub Issues, ArXiv Papers, Code Search, Database Logs) investigating a memory leak. Total tool output: 62,323 chars (~15,580 tokens).

	```python
	from agno.agent import Agent
	from agno.models.anthropic import Claude
	from headroom.integrations.agno import HeadroomAgnoModel

	# Wrap your model - that's it!
	base_model = Claude(id="claude-sonnet-4-20250514")
	model = HeadroomAgnoModel(wrapped_model=base_model)

	agent = Agent(model=model, tools=[search_github, search_arxiv, search_code, query_db])
	response = agent.run("Investigate the memory leak and recommend a fix")
	```

	Results with Claude Sonnet:

	\| \| Baseline \| Headroom \|
	\|--\|----------\|----------\|
	\| Tokens sent to API \| 15,662 \| 6,100 \|
	\| API requests \| 2 \| 2 \|
	\| Tool calls \| 4 \| 4 \|
	\| Duration \| 26.5s \| 27.0s \|

	76.3% fewer tokens. Same comprehensive answer.

	Both found: Issue #42 (memory leak), the `cleanup_worker()` fix, OutOfMemoryError logs (7.8GB/8GB, 847 threads), and relevant research papers.

	Run it yourself: `python examples/multi_tool_agent_test.py`

	---

	## How It Works

	Headroom doesn't summarize or truncate blindly. It uses statistical analysis:

	1. Detects redundancy - Repeated fields like `"language": "typescript"` across 100 items
	2. Keeps what matters - First items, last items, query-relevant matches, anomalies
	3. Preserves errors - Never drops items containing "error", "exception", "failed"
	4. Maintains schema - Output JSON structure stays identical

	The compression is reversible via CCR (Compress-Cache-Retrieve). If the LLM needs more data, it can request the original.

	---

	## Why Headroom?

	- Zero code changes - works as a transparent proxy
	- 47-92% savings - depends on your workload (tool-heavy = more savings)
	- Reversible compression - LLM retrieves original data via CCR
	- Content-aware - code, logs, JSON each handled optimally
	- Provider caching - automatic prefix optimization for cache hits
	- Framework native - LangChain, Agno, MCP, agents supported

	---

	## 30-Second Quickstart

	### Option 1: Proxy (Zero Code Changes)

	```bash
	pip install "headroom-ai[proxy]"
	headroom proxy --port 8787
	```

	Point your tools at the proxy:

	```bash
	# Claude Code
	ANTHROPIC_BASE_URL=http://localhost:8787 claude

	# Any OpenAI-compatible client
	OPENAI_BASE_URL=http://localhost:8787/v1 cursor
	```

	### Option 2: LangChain Integration

	```bash
	pip install "headroom-ai[langchain]"
	```

	```python
	from langchain_openai import ChatOpenAI
	from headroom.integrations import HeadroomChatModel

	# Wrap your model - that's it!
	llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

	# Use exactly like before
	response = llm.invoke("Hello!")
	```

	See the full [LangChain Integration Guide](docs/langchain.md) for memory, retrievers, agents, and more.

	### Option 3: Agno Integration

	```bash
	pip install "headroom-ai[agno]"
	```

	```python
	from agno.agent import Agent
	from agno.models.openai import OpenAIChat
	from headroom.integrations.agno import HeadroomAgnoModel

	# Wrap your model - that's it!
	model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
	agent = Agent(model=model)

	# Use exactly like before
	response = agent.run("Hello!")

	# Check savings
	print(f"Tokens saved: {model.total_tokens_saved}")
	```

	See the full [Agno Integration Guide](docs/agno.md) for hooks, multi-provider support, and more.

	---

	## Framework Integrations

	\| Framework \| Integration \| Docs \|
	\|-----------\|-------------\|------\|
	\| LangChain \| `HeadroomChatModel`, memory, retrievers, agents \| [Guide](docs/langchain.md) \|
	\| Agno \| `HeadroomAgnoModel`, hooks, multi-provider \| [Guide](docs/agno.md) \|
	\| MCP \| Tool output compression for Claude \| [Guide](docs/ccr.md) \|
	\| Any OpenAI Client \| Proxy server \| [Guide](docs/proxy.md) \|

	---

	## Features

	\| Feature \| Description \| Docs \|
	\|---------\|-------------\|------\|
	\| Memory \| Persistent memory across conversations (zero-latency inline extraction) \| [Memory](docs/memory.md) \|
	\| Universal Compression \| ML-based content detection + structure-preserving compression \| [Compression](docs/compression.md) \|
	\| SmartCrusher \| Compresses JSON tool outputs statistically \| [Transforms](docs/transforms.md) \|
	\| CacheAligner \| Stabilizes prefixes for provider caching \| [Transforms](docs/transforms.md) \|
	\| RollingWindow \| Manages context limits without breaking tools \| [Transforms](docs/transforms.md) \|
	\| CCR \| Reversible compression with automatic retrieval \| [CCR Guide](docs/ccr.md) \|
	\| LangChain \| Memory, retrievers, agents, streaming \| [LangChain](docs/langchain.md) \|
	\| Agno \| Agent framework integration with hooks \| [Agno](docs/agno.md) \|
	\| Text Utilities \| Opt-in compression for search/logs \| [Text Compression](docs/text-compression.md) \|
	\| LLMLingua-2 \| ML-based 20x compression (opt-in) \| [LLMLingua](docs/llmlingua.md) \|
	\| Code-Aware \| AST-based code compression (tree-sitter) \| [Transforms](docs/transforms.md) \|

	---

	## Verified Performance

	These numbers are from actual API calls, not estimates:

	\| Scenario \| Before \| After \| Savings \| Verified \|
	\|----------\|--------\|-------\|---------\|----------\|
	\| Code search (100 results) \| 17,765 tokens \| 1,408 tokens \| 92% \| Claude Sonnet \|
	\| SRE incident debugging \| 65,694 tokens \| 5,118 tokens \| 92% \| GPT-4o \|
	\| Codebase exploration \| 78,502 tokens \| 41,254 tokens \| 47% \| GPT-4o \|
	\| GitHub issue triage \| 54,174 tokens \| 14,761 tokens \| 73% \| GPT-4o \|

	Overhead: ~1-5ms compression latency

	When savings are highest: Tool-heavy workloads (search, logs, database queries)
	When savings are lowest: Conversation-heavy workloads with minimal tool use

	---

	## Providers

	\| Provider \| Token Counting \| Cache Optimization \|
	\|----------\|----------------\|-------------------\|
	\| OpenAI \| tiktoken (exact) \| Automatic prefix caching \|
	\| Anthropic \| Official API \| cache_control blocks \|
	\| Google \| Official API \| Context caching \|
	\| Cohere \| Official API \| - \|
	\| Mistral \| Official tokenizer \| - \|

	New models auto-supported via naming pattern detection.

	---

	## Safety Guarantees

	- Never removes human content - user/assistant messages preserved
	- Never breaks tool ordering - tool calls and responses stay paired
	- Parse failures are no-ops - malformed content passes through unchanged
	- Compression is reversible - LLM retrieves original data via CCR

	---

	## Installation

	```bash
	pip install headroom-ai # SDK only
	pip install "headroom-ai[proxy]" # Proxy server
	pip install "headroom-ai[langchain]" # LangChain integration
	pip install "headroom-ai[agno]" # Agno agent framework
	pip install "headroom-ai[code]" # AST-based code compression
	pip install "headroom-ai[llmlingua]" # ML-based compression
	pip install "headroom-ai[all]" # Everything
	```

	Requirements: Python 3.10+

	---

	## Documentation

	\| Guide \| Description \|
	\|-------\|-------------\|
	\| [Memory Guide](docs/memory.md) \| Persistent memory for LLMs \|
	\| [Compression Guide](docs/compression.md) \| Universal compression with ML detection \|
	\| [LangChain Integration](docs/langchain.md) \| Full LangChain support \|
	\| [Agno Integration](docs/agno.md) \| Full Agno agent framework support \|
	\| [SDK Guide](docs/sdk.md) \| Fine-grained control \|
	\| [Proxy Guide](docs/proxy.md) \| Production deployment \|
	\| [Configuration](docs/configuration.md) \| All options \|
	\| [CCR Guide](docs/ccr.md) \| Reversible compression \|
	\| [Metrics](docs/metrics.md) \| Monitoring \|
	\| [Troubleshooting](docs/troubleshooting.md) \| Common issues \|

	---

	## Who's Using Headroom?

	> Add your project here! [Open a PR](https://github.com/chopratejas/headroom/pulls) or [start a discussion](https://github.com/chopratejas/headroom/discussions).

	---

	## Contributing

	```bash
	git clone https://github.com/chopratejas/headroom.git
	cd headroom
	pip install -e ".[dev]"
	pytest
	```

	See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

	---

	## License

	Apache License 2.0 - see [LICENSE](LICENSE).

	---

	<p align="center">
	<sub>Built for the AI developer community</sub>
	</p>