# Agno Integration Headroom integrates with [Agno](https://github.com/agno-agi/agno) (formerly Phidata) to provide automatic context optimization for AI agents. This guide covers model wrapping, observability hooks, and multi-provider support. --- ## Installation ```bash pip install "headroom-ai[agno]" ``` This installs Headroom with Agno support. You'll also need Agno itself: ```bash pip install agno ``` --- ## Quick Start ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from headroom.integrations.agno import HeadroomAgnoModel # Wrap your model model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) # Create agent as usual agent = Agent(model=model) # Use exactly like before response = agent.run("What's the capital of France?") # Check savings print(f"Tokens saved: {model.total_tokens_saved}") print(model.get_savings_summary()) # {'total_requests': 1, 'total_tokens_saved': 245, 'average_savings_percent': 12.3} ``` --- ## Integration Patterns ### 1. Basic Model Wrapping The simplest integration - wrap any Agno model with `HeadroomAgnoModel`: ```python from agno.models.openai import OpenAIChat from agno.models.anthropic import Claude from agno.models.google import Gemini from headroom.integrations.agno import HeadroomAgnoModel # Works with any Agno model openai_model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) claude_model = HeadroomAgnoModel(Claude(id="claude-3-5-sonnet-20241022")) gemini_model = HeadroomAgnoModel(Gemini(id="gemini-2.0-flash")) # Each automatically uses the correct provider for accurate token counting ``` **Why this matters**: Headroom automatically detects the underlying provider and applies the correct tokenizer for accurate optimization metrics. ### 2. Agent with Observability Hooks Use hooks for detailed tracking without modifying your model: ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from headroom.integrations.agno import ( HeadroomAgnoModel, HeadroomPreHook, HeadroomPostHook, ) # Model wrapper for optimization model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) # Hooks for observability pre_hook = HeadroomPreHook() post_hook = HeadroomPostHook(token_alert_threshold=10000) agent = Agent( model=model, pre_hooks=[pre_hook], post_hooks=[post_hook], ) # Run agent response = agent.run("Analyze this large dataset...") # Check metrics from model print(f"Tokens saved: {model.total_tokens_saved}") # Check observability from hooks print(f"Post-hook summary: {post_hook.get_summary()}") print(f"Alerts triggered: {post_hook.alerts}") ``` **Why this matters**: Hooks provide observability into agent behavior and can alert when token usage exceeds thresholds. ### 3. Convenience Hook Factory Use `create_headroom_hooks()` to create matched hook pairs: ```python from headroom.integrations.agno import create_headroom_hooks pre_hook, post_hook = create_headroom_hooks( token_alert_threshold=5000, log_level="DEBUG", ) agent = Agent( model=model, pre_hooks=[pre_hook], post_hooks=[post_hook], ) ``` ### 4. Custom Configuration Pass a `HeadroomConfig` for fine-grained control: ```python from headroom import HeadroomConfig, HeadroomMode from headroom.integrations.agno import HeadroomAgnoModel config = HeadroomConfig( default_mode=HeadroomMode.OPTIMIZE, # Add other configuration options as needed ) model = HeadroomAgnoModel( wrapped_model=OpenAIChat(id="gpt-4o"), config=config, ) ``` ### 5. Standalone Message Optimization Optimize messages without wrapping a model: ```python from headroom.integrations.agno import optimize_messages messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Analyze this large JSON: " + large_json}, ] optimized_messages, metrics = optimize_messages(messages, model="gpt-4o") print(f"Tokens saved: {metrics['tokens_saved']}") print(f"Transforms applied: {metrics['transforms_applied']}") ``` ### 6. Async Operations Full async support for high-throughput applications: ```python import asyncio from headroom.integrations.agno import HeadroomAgnoModel async def process_async(): model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) # Async response response = await model.aresponse(messages) # Async streaming async for chunk in model.aresponse_stream(messages): print(chunk, end="", flush=True) print(f"\nTokens saved: {model.total_tokens_saved}") asyncio.run(process_async()) ``` --- ## Real-World Examples ### Example 1: Tool-Heavy Agent ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.duckduckgo import DuckDuckGoTools from headroom.integrations.agno import HeadroomAgnoModel # Wrap model for optimization model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) # Agent with search tools agent = Agent( model=model, tools=[DuckDuckGoTools()], show_tool_calls=True, ) # Tool outputs get compressed automatically response = agent.run("Research the latest AI developments and summarize") # Impact: Tool outputs (often 10K+ tokens) compressed by 70-90% print(f"Tokens saved: {model.total_tokens_saved}") print(model.get_savings_summary()) ``` ### Example 2: Multi-Model Routing ```python from agno.models.openai import OpenAIChat from agno.models.anthropic import Claude from headroom.integrations.agno import HeadroomAgnoModel # Different models for different tasks fast_model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o-mini")) powerful_model = HeadroomAgnoModel(Claude(id="claude-3-5-sonnet-20241022")) # Use fast model for simple tasks simple_agent = Agent(model=fast_model) # Use powerful model for complex reasoning complex_agent = Agent(model=powerful_model) # Each tracks its own metrics print(f"Fast model saved: {fast_model.total_tokens_saved}") print(f"Powerful model saved: {powerful_model.total_tokens_saved}") ``` ### Example 3: Production Monitoring ```python from agno.agent import Agent from headroom.integrations.agno import ( HeadroomAgnoModel, create_headroom_hooks, ) model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) pre_hook, post_hook = create_headroom_hooks( token_alert_threshold=50000, # Alert on large requests log_level="WARNING", ) agent = Agent( model=model, pre_hooks=[pre_hook], post_hooks=[post_hook], ) # Run multiple requests for query in user_queries: response = agent.run(query) # Check for alerts if post_hook.alerts: print(f"WARNING: {len(post_hook.alerts)} requests exceeded threshold") for alert in post_hook.alerts: print(f" - {alert}") # Summary stats summary = post_hook.get_summary() print(f"Total requests: {summary['total_requests']}") print(f"Average tokens: {summary['average_tokens']}") ``` ### Example 4: Reset for New Sessions ```python model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) # Session 1 agent.run("First conversation...") print(f"Session 1 savings: {model.get_savings_summary()}") # Reset for new session model.reset() # Session 2 - metrics start fresh agent.run("Second conversation...") print(f"Session 2 savings: {model.get_savings_summary()}") ``` --- ## Supported Providers HeadroomAgnoModel automatically detects the provider from the wrapped model: | Provider | Agno Models | Auto-Detected | |----------|-------------|---------------| | **OpenAI** | `OpenAIChat`, `OpenAILike` | Yes | | **Anthropic** | `Claude`, `AwsBedrock` | Yes | | **Google** | `Gemini`, `VertexAI` | Yes | | **Cohere** | `Cohere`, `CohereChat` | Yes | | **Groq** | `Groq` | Yes (OpenAI-compatible) | | **Mistral** | `Mistral` | Yes (OpenAI-compatible) | | **Together** | `Together` | Yes (OpenAI-compatible) | | **Ollama** | `Ollama` | Yes (OpenAI-compatible) | To disable auto-detection: ```python model = HeadroomAgnoModel( wrapped_model=some_model, auto_detect_provider=False, # Falls back to OpenAI tokenizer ) ``` --- ## Feature Coverage ### What's Optimized HeadroomAgnoModel optimizes messages at the LLM call boundary. This covers: | Feature | Optimized | Notes | |---------|-----------|-------| | **User/Assistant Messages** | ✅ Yes | Full message history compressed | | **Tool Calls** | ✅ Yes | Tool call arguments optimized | | **Tool Results** | ✅ Yes | JSON responses compressed 70-90% via SmartCrusher | | **System Prompts** | ✅ Yes | Included in message optimization | | **Streaming Responses** | ✅ Yes | Both sync and async | | **Multi-turn Conversations** | ✅ Yes | Full history available for optimization | ### Known Limitations The integration operates at the model layer, not the agent layer. Some Agno features operate outside this boundary: | Agno Feature | Status | Explanation | |--------------|--------|-------------| | **Agent Memory** | ⚠️ Partial | Memory content is optimized when it enters messages, but the persistent memory store itself is not compressed. If you're storing large amounts of data in agent memory, consider summarizing before storage. | | **Knowledge Bases** | ⚠️ Partial | KB retrieval happens before messages reach the model. Retrieved context is optimized as part of the message, but we can't influence KB retrieval itself. | | **Agent Teams** | ❌ Not supported | Each agent's model is wrapped independently. No cross-agent optimization or team-level coordination. | | **Tool Definitions** | ⚠️ Not deduplicated | Tool schemas are sent with every request. Future versions may deduplicate repeated tool definitions. | | **Structured Outputs** | ✅ Supported | `response_model` works normally; optimization doesn't affect output parsing. | | **Reasoning Models** | ✅ Supported | Extended thinking works; we don't compress reasoning traces. | ### Best Practices for Maximum Savings 1. **Tool-heavy agents see the biggest wins** — Tool results (JSON, logs, search results) compress 70-90% 2. **Long conversations benefit from RollingWindow** — Configure context limits to avoid hitting provider maximums 3. **Wrap at the model level, not agent level** — This ensures all LLM calls go through optimization 4. **Use hooks for observability** — Track token usage patterns to identify optimization opportunities ### Future Improvements We're tracking these potential enhancements: - **Memory optimization hooks** — Compress data before it enters agent memory - **Knowledge base integration** — Optimize retrieved context at the KB layer - **Tool schema deduplication** — Cache and reference repeated tool definitions - **Team-level optimization** — Shared context compression across agent teams Contributions welcome! See [CONTRIBUTING.md](https://github.com/chopratejas/headroom/blob/main/CONTRIBUTING.md). --- ## Configuration Reference ### HeadroomAgnoModel | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `wrapped_model` | Any | Required | The Agno model to wrap | | `config` | `HeadroomConfig` | `None` | Custom configuration | | `auto_detect_provider` | `bool` | `True` | Auto-detect provider for token counting | **Properties:** - `wrapped_model` - Access the underlying Agno model - `total_tokens_saved` - Running total of tokens saved - `metrics_history` - List of last 100 `OptimizationMetrics` **Methods:** - `response(messages, **kwargs)` - Sync response with optimization - `response_stream(messages, **kwargs)` - Sync streaming response - `aresponse(messages, **kwargs)` - Async response - `aresponse_stream(messages, **kwargs)` - Async streaming - `get_savings_summary()` - Returns dict with stats - `reset()` - Clear all metrics ### HeadroomPreHook | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `config` | `HeadroomConfig` | `None` | Configuration (for future use) | | `model` | `str` | `"gpt-4o"` | Model name for estimation | ### HeadroomPostHook | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `log_level` | `str` | `"INFO"` | Logging level | | `token_alert_threshold` | `int` | `None` | Alert if tokens exceed this | **Properties:** - `total_requests` - Number of requests tracked - `alerts` - List of alert messages **Methods:** - `get_summary()` - Returns dict with request stats - `reset()` - Clear history and alerts ### create_headroom_hooks() | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `config` | `HeadroomConfig` | `None` | Config for pre-hook | | `model` | `str` | `"gpt-4o"` | Model for pre-hook | | `log_level` | `str` | `"INFO"` | Log level for post-hook | | `token_alert_threshold` | `int` | `None` | Alert threshold for post-hook | Returns: `tuple[HeadroomPreHook, HeadroomPostHook]` --- ## Import Reference ```python # Main integration from headroom.integrations.agno import HeadroomAgnoModel # Hooks from headroom.integrations.agno import HeadroomPreHook from headroom.integrations.agno import HeadroomPostHook from headroom.integrations.agno import create_headroom_hooks # Utilities from headroom.integrations.agno import optimize_messages from headroom.integrations.agno import agno_available from headroom.integrations.agno import get_headroom_provider from headroom.integrations.agno import get_model_name_from_agno # Or import everything from parent from headroom.integrations import ( HeadroomAgnoModel, HeadroomPreHook, HeadroomPostHook, create_headroom_hooks, ) ``` --- ## Troubleshooting ### Check if Agno is Available ```python from headroom.integrations.agno import agno_available if agno_available(): from headroom.integrations.agno import HeadroomAgnoModel else: print("Install agno: pip install agno") ``` ### Provider Detection Issues If auto-detection fails, check the detected provider: ```python from headroom.integrations.agno import get_headroom_provider, get_model_name_from_agno model = OpenAIChat(id="gpt-4o") provider = get_headroom_provider(model) model_name = get_model_name_from_agno(model) print(f"Detected provider: {type(provider).__name__}") print(f"Model name: {model_name}") ``` ### Metrics Not Updating Ensure you're checking the correct object: ```python # Model metrics (optimization) print(model.total_tokens_saved) # Actual savings # Hook metrics (observability) print(post_hook.get_summary()) # Request tracking ``` Note: Hooks track request counts, not token savings. Use the model wrapper for optimization metrics.