headroom / docs /transforms.md
chopratejas's picture
Add AST-based code compression and custom model configuration
905c229
|
Raw
History Blame
12.9 kB

Transform Reference

Headroom provides several transforms that work together to optimize LLM context.

SmartCrusher

Statistical compression for JSON tool outputs.

How It Works

SmartCrusher analyzes JSON arrays and selectively keeps important items:

  1. First/Last items - Context for pagination and recency
  2. Error items - 100% preservation of error states
  3. Anomalies - Statistical outliers (> 2 std dev from mean)
  4. Relevant items - Matches to user's query via BM25/embeddings
  5. Change points - Significant transitions in data

Configuration

from headroom import SmartCrusherConfig

config = SmartCrusherConfig(
    min_tokens_to_crush=200,      # Only compress if > 200 tokens
    max_items_after_crush=50,     # Keep at most 50 items
    keep_first=3,                 # Always keep first 3 items
    keep_last=2,                  # Always keep last 2 items
    relevance_threshold=0.3,      # Keep items with relevance > 0.3
    anomaly_std_threshold=2.0,    # Keep items > 2 std dev from mean
    preserve_errors=True,         # Always keep error items
)

Example

from headroom import SmartCrusher

crusher = SmartCrusher(config)

# Before: 1000 search results (45,000 tokens)
tool_output = {"results": [...1000 items...]}

# After: ~50 important items (4,500 tokens) - 90% reduction
compressed = crusher.crush(tool_output, query="user's question")

What Gets Preserved

Category Preserved Why
Errors 100% Critical for debugging
First N 100% Context/pagination
Last N 100% Recency
Anomalies All Unusual values matter
Relevant Top K Match user's query
Others Sampled Statistical representation

CacheAligner

Prefix stabilization for improved cache hit rates.

The Problem

LLM providers cache request prefixes. But dynamic content breaks caching:

"You are helpful. Today is January 7, 2025."  # Changes daily = no cache

The Solution

CacheAligner extracts dynamic content to stabilize the prefix:

from headroom import CacheAligner

aligner = CacheAligner()
result = aligner.align(messages)

# Static prefix (cacheable):
# "You are helpful."

# Dynamic content moved to end:
# [Current date context]

Configuration

from headroom import CacheAlignerConfig

config = CacheAlignerConfig(
    extract_dates=True,           # Move dates to dynamic section
    normalize_whitespace=True,    # Consistent spacing
    stable_prefix_min_tokens=100, # Min prefix size for alignment
)

Cache Hit Improvement

Scenario Before After
Daily date in prompt 0% hits ~95% hits
Dynamic user context ~10% hits ~80% hits
Consistent prompts ~90% hits ~95% hits

RollingWindow

Context management within token limits.

The Problem

Long conversations exceed context limits. Naive truncation breaks tool calls:

[tool_call: search]  # Kept
[tool_result: ...]   # Dropped = orphaned call!

The Solution

RollingWindow drops complete tool units, preserving pairs:

from headroom import RollingWindow

window = RollingWindow(config)
result = window.apply(messages, max_tokens=100000)

# Guarantees:
# 1. Tool calls paired with results
# 2. System prompt preserved
# 3. Recent turns kept
# 4. Oldest tool outputs dropped first

Configuration

from headroom import RollingWindowConfig

config = RollingWindowConfig(
    max_tokens=100000,            # Target token limit
    preserve_system=True,         # Always keep system prompt
    preserve_recent_turns=5,      # Keep last 5 user/assistant turns
    drop_oldest_first=True,       # Remove oldest tool outputs
)

Drop Priority

  1. Oldest tool outputs - First to go
  2. Old assistant messages - Summary preserved
  3. Old user messages - Only if necessary
  4. Never dropped: System prompt, recent turns, active tool pairs

LLMLinguaCompressor (Optional)

ML-based compression using Microsoft's LLMLingua-2 model.

When to Use

Transform Best For Speed Compression
SmartCrusher JSON arrays ~1ms 70-90%
Text Utilities Search/logs ~1ms 50-90%
LLMLinguaCompressor Any text, max compression 50-200ms 80-95%

Installation

pip install "headroom-ai[llmlingua]"  # Adds ~2GB

Configuration

from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig

config = LLMLinguaConfig(
    device="auto",                    # auto, cuda, cpu, mps
    target_compression_rate=0.3,      # Keep 30% of tokens
    min_tokens_for_compression=100,   # Skip small content
    code_compression_rate=0.4,        # Conservative for code
    json_compression_rate=0.35,       # Moderate for JSON
    text_compression_rate=0.25,       # Aggressive for text
    enable_ccr=True,                  # Store original for retrieval
)

compressor = LLMLinguaCompressor(config)

Content-Aware Rates

LLMLinguaCompressor auto-detects content type:

Content Type Default Rate Behavior
Code 0.4 Conservative - preserves syntax
JSON 0.35 Moderate - keeps structure
Text 0.3 Aggressive - maximum compression

Memory Management

from headroom.transforms import (
    is_llmlingua_model_loaded,
    unload_llmlingua_model,
)

# Check if model is loaded
print(is_llmlingua_model_loaded())  # True/False

# Free ~1GB RAM when done
unload_llmlingua_model()

Proxy Integration

# Enable in proxy
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.3

CodeAwareCompressor (Optional)

AST-based compression for source code using tree-sitter.

When to Use

Transform Best For Speed Compression
SmartCrusher JSON arrays ~1ms 70-90%
CodeAwareCompressor Source code ~10-50ms 40-70%
LLMLinguaCompressor Any text 50-200ms 80-95%

Key Benefits

  • Syntax validity guaranteed — Output always parses correctly
  • Preserves critical structure — Imports, signatures, types, error handlers
  • Multi-language support — Python, JavaScript, TypeScript, Go, Rust, Java, C, C++
  • Lightweight — ~50MB vs ~1GB for LLMLingua

Installation

pip install "headroom-ai[code]"  # Adds tree-sitter-language-pack

Configuration

from headroom.transforms import CodeAwareCompressor, CodeCompressorConfig, DocstringMode

config = CodeCompressorConfig(
    preserve_imports=True,              # Always keep imports
    preserve_signatures=True,           # Always keep function signatures
    preserve_type_annotations=True,     # Keep type hints
    preserve_error_handlers=True,       # Keep try/except blocks
    preserve_decorators=True,           # Keep decorators
    docstring_mode=DocstringMode.FIRST_LINE,  # FULL, FIRST_LINE, REMOVE
    target_compression_rate=0.2,        # Keep 20% of tokens
    max_body_lines=5,                   # Lines to keep per function body
    min_tokens_for_compression=100,     # Skip small content
    language_hint=None,                 # Auto-detect if None
    fallback_to_llmlingua=True,         # Use LLMLingua for unknown langs
)

compressor = CodeAwareCompressor(config)

Example

from headroom.transforms import CodeAwareCompressor

compressor = CodeAwareCompressor()

code = '''
import os
from typing import List

def process_items(items: List[str]) -> List[str]:
    """Process a list of items."""
    results = []
    for item in items:
        if not item:
            continue
        processed = item.strip().lower()
        results.append(processed)
    return results
'''

result = compressor.compress(code, language="python")
print(result.compressed)
# import os
# from typing import List
#
# def process_items(items: List[str]) -> List[str]:
#     """Process a list of items."""
#     results = []
#     for item in items:
#     # ... (5 lines compressed)
#     pass

print(f"Compression: {result.compression_ratio:.0%}")  # ~55%
print(f"Syntax valid: {result.syntax_valid}")  # True

Supported Languages

Tier Languages Support Level
1 Python, JavaScript, TypeScript Full AST analysis
2 Go, Rust, Java, C, C++ Function body compression

Memory Management

from headroom.transforms import is_tree_sitter_available, unload_tree_sitter

# Check if tree-sitter is installed
print(is_tree_sitter_available())  # True/False

# Free memory when done (parsers are lazy-loaded)
unload_tree_sitter()

ContentRouter

Intelligent compression orchestrator that routes content to the optimal compressor.

How It Works

ContentRouter analyzes content and selects the best compression strategy:

  1. Detect content type — JSON, code, logs, search results, plain text
  2. Consider source hints — File paths, tool names for high-confidence routing
  3. Route to compressor — SmartCrusher, CodeAwareCompressor, SearchCompressor, etc.
  4. Log decisions — Transparent routing for debugging

Configuration

from headroom.transforms import ContentRouter, ContentRouterConfig, CompressionStrategy

config = ContentRouterConfig(
    min_section_tokens=100,             # Minimum tokens to compress
    enable_code_aware=True,             # Use CodeAwareCompressor for code
    enable_search_compression=True,     # Use SearchCompressor for grep output
    enable_log_compression=True,        # Use LogCompressor for logs
    default_strategy=CompressionStrategy.TEXT,  # Fallback strategy
)

router = ContentRouter(config)

Example

from headroom.transforms import ContentRouter, generate_source_hint

router = ContentRouter()

# With source hint for high-confidence routing
hint = generate_source_hint(tool_name="grep", file_path="src/auth.py")
result = router.compress(content, source_hint=hint)

print(result.strategy)  # CompressionStrategy.SEARCH or CODE_AWARE
print(result.routing_log)  # List of routing decisions

Compression Strategies

Strategy Used For Compressor
CODE_AWARE Source code CodeAwareCompressor
SMART_CRUSHER JSON arrays SmartCrusher
SEARCH Grep/find output SearchCompressor
LOG Log files LogCompressor
TEXT Plain text TextCompressor
LLMLINGUA Any (max compression) LLMLinguaCompressor
PASSTHROUGH Small content None

Source Hints

Use source hints for accurate routing:

from headroom.transforms import generate_source_hint

# From tool invocation
hint = generate_source_hint(tool_name="Read", file_path="main.py")

# From file extension
hint = generate_source_hint(file_path="components/Button.tsx")

# From explicit tool
hint = generate_source_hint(tool_name="Grep")  # Routes to SEARCH

TransformPipeline

Combine transforms for optimal results.

from headroom import TransformPipeline, SmartCrusher, CacheAligner, RollingWindow

pipeline = TransformPipeline([
    SmartCrusher(),      # First: compress tool outputs
    CacheAligner(),      # Then: stabilize prefix
    RollingWindow(),     # Finally: fit in context
])

result = pipeline.transform(messages)
print(f"Saved {result.tokens_saved} tokens")

With LLMLingua (Optional)

from headroom.transforms import (
    TransformPipeline, SmartCrusher, CacheAligner,
    RollingWindow, LLMLinguaCompressor
)

pipeline = TransformPipeline([
    CacheAligner(),         # 1. Stabilize prefix
    SmartCrusher(),         # 2. Compress JSON arrays
    LLMLinguaCompressor(),  # 3. ML compression on remaining text
    RollingWindow(),        # 4. Final size constraint (always last)
])

Recommended Order

Order Transform Purpose
1 CacheAligner Stabilize prefix for caching
2 SmartCrusher Compress JSON tool outputs
3 LLMLinguaCompressor ML compression (optional)
4 RollingWindow Enforce token limits (always last)

Why this order?

  • CacheAligner first to maximize prefix stability
  • SmartCrusher handles JSON arrays efficiently
  • LLMLingua compresses remaining long text
  • RollingWindow truncates only if still over limit

Safety Guarantees

All transforms follow strict safety rules:

  1. Never remove human content - User/assistant text is sacred
  2. Never break tool ordering - Calls and results stay paired
  3. Parse failures are no-ops - Malformed content passes through
  4. Preserves recency - Last N turns always kept
  5. 100% error preservation - Error items never dropped