# Universal Compression Headroom's Universal Compression module provides intelligent, automatic compression with ML-based content detection and structure preservation. ## Overview Universal Compression combines several techniques: 1. **ML-based Detection** - Automatically detects content type (JSON, code, logs, text) using Magika 2. **Structure Preservation** - Keeps keys, signatures, and templates intact via structure masks 3. **Intelligent Compression** - Compresses content while preserving meaning with LLMLingua 4. **Reversible via CCR** - Stores originals for retrieval when LLM needs full context ## Quick Start ### One-Liner ```python from headroom.compression import compress result = compress(content) print(result.compressed) print(f"Saved {result.savings_percentage:.0f}% tokens") ``` ### With Configuration ```python from headroom.compression import UniversalCompressor, UniversalCompressorConfig config = UniversalCompressorConfig( compression_ratio_target=0.5, # Keep 50% of content use_entropy_preservation=True, # Preserve UUIDs, hashes ) compressor = UniversalCompressor(config=config) result = compressor.compress(content) ``` --- ## How It Works ### Detection Flow ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Content │───>│ Detect │───>│ Extract │───>│ Compress │ │ Input │ │ Type │ │ Structure │ │ Content │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Magika │ │ Handler │ │ LLMLingua │ │ (ML) │ │ (JSON, │ │ (optional) │ │ │ │ Code...) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ ``` ### Structure Masks Structure masks identify what to preserve: | Content Type | What's Preserved | What's Compressed | |--------------|------------------|-------------------| | **JSON** | Keys, brackets, booleans, nulls, short values, UUIDs | Long string values, whitespace | | **Code** | Imports, function signatures, class definitions, types | Function bodies, comments | | **Logs** | Timestamps, log levels, error messages | Repeated patterns, verbose details | | **Text** | High-entropy tokens (IDs, hashes) | Low-information content | --- ## Configuration ### UniversalCompressorConfig ```python from headroom.compression import UniversalCompressorConfig config = UniversalCompressorConfig( # Detection use_magika=True, # Use ML-based detection (requires magika) # Compression use_llmlingua=True, # Use LLMLingua for compression compression_ratio_target=0.3, # Keep 30% of content (70% reduction) min_content_length=100, # Skip content shorter than this # Structure preservation use_entropy_preservation=True, # Preserve high-entropy tokens entropy_threshold=0.85, # Entropy threshold for preservation # CCR ccr_enabled=True, # Store originals for retrieval ) ``` ### Configuration Options | Option | Default | Description | |--------|---------|-------------| | `use_magika` | `True` | Use ML-based content detection | | `use_llmlingua` | `True` | Use LLMLingua for compression | | `compression_ratio_target` | `0.3` | Target ratio (0.3 = keep 30%) | | `min_content_length` | `100` | Minimum chars to compress | | `use_entropy_preservation` | `True` | Preserve high-entropy tokens | | `entropy_threshold` | `0.85` | Entropy threshold (0.0-1.0) | | `ccr_enabled` | `True` | Enable CCR storage | --- ## Content Handlers ### JSON Handler Preserves JSON structure while compressing values: ```python from headroom.compression.handlers.json_handler import JSONStructureHandler handler = JSONStructureHandler( preserve_short_values=True, # Keep values < 20 chars short_value_threshold=20, # Threshold for "short" preserve_high_entropy=True, # Keep UUIDs, hashes entropy_threshold=0.85, # Entropy threshold max_array_items_full=3, # Keep first N array items full max_number_digits=10, # Preserve numbers up to N digits ) ``` **What's Preserved:** - All keys (navigational - LLM sees schema) - Structural syntax (`{`, `}`, `[`, `]`, `:`, `,`) - Booleans and nulls (semantically important) - High-entropy strings (UUIDs, hashes - identifiers) - Short numbers (often IDs) **Example:** ```python # Before { "id": "usr_abc123", "name": "Alice Johnson", "bio": "A long description that goes on and on..." } # After (structure preserved, long values compressed) { "id": "usr_abc123", "name": "Alice Johnson", "bio": "A long...[compressed]..." } ``` ### Code Handler Preserves code structure using AST parsing (tree-sitter) or regex fallback: ```python from headroom.compression.handlers.code_handler import CodeStructureHandler handler = CodeStructureHandler( preserve_comments=False, # Preserve comments as structural use_tree_sitter=True, # Use tree-sitter for parsing default_language="python", # Default when detection fails ) ``` **What's Preserved:** - Import statements - Function/method signatures - Class definitions - Type annotations - Decorators **What's Compressed:** - Function bodies (implementations) - Comments (unless `preserve_comments=True`) **Example:** ```python # Before def process_data(items: List[str]) -> Dict[str, int]: """Process items and count occurrences.""" result = {} for item in items: item = item.strip().lower() if item in result: result[item] += 1 else: result[item] = 1 return result # After (signature preserved, body compressed) def process_data(items: List[str]) -> Dict[str, int]: """Process items and count occurrences.""" result = {} for item in items: ...[compressed]... ``` ### Supported Languages | Language | Parser | Support Level | |----------|--------|---------------| | Python | tree-sitter | Full AST | | JavaScript | tree-sitter | Full AST | | TypeScript | tree-sitter | Full AST | | Go | tree-sitter | Full AST | | Rust | tree-sitter | Full AST | | Java | tree-sitter | Full AST | | C | tree-sitter | Full AST | | C++ | tree-sitter | Full AST | --- ## Compression Result ```python from headroom.compression import compress result = compress(content) # Access result fields print(result.compressed) # Compressed content print(result.original) # Original content print(result.compression_ratio) # e.g., 0.35 (35% of original size) print(result.tokens_before) # Estimated tokens before print(result.tokens_after) # Estimated tokens after print(result.tokens_saved) # tokens_before - tokens_after print(result.savings_percentage) # e.g., 65.0 (65% savings) # Detection info print(result.content_type) # ContentType.JSON, CODE, etc. print(result.detection_confidence) # 0.0-1.0 # Structure info print(result.handler_used) # "json", "code", etc. print(result.preservation_ratio) # Fraction preserved as structure # CCR info print(result.ccr_key) # Key for retrieval (if CCR enabled) ``` --- ## Batch Compression For multiple contents, batch compression is more efficient: ```python from headroom.compression import UniversalCompressor compressor = UniversalCompressor() contents = [ '{"users": [...]}', 'def hello(): pass', 'Plain text content', ] results = compressor.compress_batch(contents) for result in results: print(f"{result.content_type}: {result.savings_percentage:.0f}% saved") ``` --- ## Custom Handlers Register custom handlers for specific content types: ```python from headroom.compression import UniversalCompressor from headroom.compression.detector import ContentType from headroom.compression.handlers.base import BaseStructureHandler, HandlerResult from headroom.compression.masks import StructureMask class LogStructureHandler(BaseStructureHandler): """Custom handler for log content.""" def __init__(self): super().__init__(name="log") def can_handle(self, content: str) -> bool: return "[INFO]" in content or "[ERROR]" in content def _extract_mask(self, content, tokens, **kwargs): # Mark timestamps and log levels as structural mask = [False] * len(content) # ... (custom logic) return HandlerResult( mask=StructureMask(tokens=tokens, mask=mask), handler_name=self.name, confidence=0.9, ) # Register the custom handler compressor = UniversalCompressor() compressor.register_handler(ContentType.TEXT, LogStructureHandler()) ``` --- ## CCR Integration Universal Compression integrates with CCR (Compress-Cache-Retrieve) for reversible compression: ```python from headroom.compression import UniversalCompressor, UniversalCompressorConfig config = UniversalCompressorConfig(ccr_enabled=True) compressor = UniversalCompressor(config=config) result = compressor.compress(large_content) # CCR key for retrieval if result.ccr_key: print(f"Original stored with key: {result.ccr_key}") # LLM can request original via CCR when needed ``` See [CCR Guide](ccr.md) for full CCR documentation. --- ## Performance | Content Type | Compression | Speed | Accuracy | |--------------|-------------|-------|----------| | JSON (large arrays) | 70-90% | ~1ms | Keys preserved | | Code (Python) | 50-70% | ~10ms | Signatures preserved | | Plain text | 60-80% | ~5ms | High-entropy preserved | **Overhead:** ~1-10ms per compression depending on content size and type. --- ## Installation ```bash # Basic compression (fallback to simple compression) pip install headroom-ai # With ML detection (recommended) pip install "headroom-ai[magika]" # With LLMLingua compression pip install "headroom-ai[llmlingua]" # With AST-based code handling pip install "headroom-ai[code]" # Everything pip install "headroom-ai[all]" ``` --- ## Example: Full Pipeline ```python from headroom.compression import UniversalCompressor, UniversalCompressorConfig # Configure for aggressive compression config = UniversalCompressorConfig( compression_ratio_target=0.25, # Keep 25% use_magika=True, use_llmlingua=True, ccr_enabled=True, ) compressor = UniversalCompressor(config=config) # Compress JSON API response json_content = """ { "users": [ {"id": "usr_123", "name": "Alice", "bio": "Software engineer..."}, {"id": "usr_456", "name": "Bob", "bio": "Product manager..."} ], "total": 2, "page": 1 } """ result = compressor.compress(json_content) print(f"Type: {result.content_type}") # ContentType.JSON print(f"Handler: {result.handler_used}") # json print(f"Saved: {result.savings_percentage:.0f}%") # ~60% print(f"Structure: {result.preservation_ratio:.0%} preserved") # ~40% print(f"CCR Key: {result.ccr_key}") # For retrieval ``` --- ## See Also - [Transforms Reference](transforms.md) - Other compression transforms - [CCR Guide](ccr.md) - Reversible compression architecture - [Text Compression](text-compression.md) - Opt-in utilities for search/logs