# Text Compression Utilities

For coding tasks, Headroom provides **standalone text compression utilities** that applications can use explicitly. These are **opt-in** — they're not applied automatically, giving you full control over when and how to compress text content.

> **Design Philosophy**: SmartCrusher compresses JSON automatically because it's structure-preserving and safe. Text compression is lossy and context-dependent, so applications should decide when to use it.

## Available Utilities

| Utility | Input Type | Use Case |
|---------|------------|----------|
| `SearchCompressor` | grep/ripgrep output | Search results with `file:line:content` format |
| `LogCompressor` | Build/test logs | pytest, npm, cargo, make output |
| `TextCompressor` | Generic text | Any plain text with anchor preservation |
| `detect_content_type` | Any content | Detect content type for routing decisions |

## SearchCompressor

Compresses search results (grep, ripgrep, ag) while preserving relevant matches.

```python
from headroom.transforms import SearchCompressor

# Your grep/ripgrep output (could be 1000s of lines)
search_results = """
src/utils.py:42:def process_data(items):
src/utils.py:43:    \"\"\"Process items.\"\"\"
src/models.py:15:class DataProcessor:
src/models.py:89:    def process(self, items):
... hundreds more matches ...
"""

# Explicitly compress when you decide it's appropriate
compressor = SearchCompressor()
result = compressor.compress(search_results, context="find process")

print(f"Compressed {result.original_match_count} matches to {result.compressed_match_count}")
print(result.compressed)
```

### What Gets Preserved

- **Exact query matches**: Lines containing the search term
- **High-relevance matches**: Scored by BM25 similarity to context
- **File diversity**: Ensures results from different files are kept
- **First/last matches**: Context from start and end of results

## LogCompressor

Compresses build and test output while preserving errors, warnings, and summaries.

```python
from headroom.transforms import LogCompressor

# pytest output with 1000s of lines
build_output = """
===== test session starts =====
collected 500 items
tests/test_foo.py::test_1 PASSED
... hundreds of passed tests ...
tests/test_bar.py::test_fail FAILED
AssertionError: expected 5, got 3
===== 1 failed, 499 passed =====
"""

# Compress logs, preserving errors and stack traces
compressor = LogCompressor()
result = compressor.compress(build_output)

# Errors, stack traces, and summary are preserved
print(result.compressed)
print(f"Compression ratio: {result.compression_ratio:.1%}")
```

### What Gets Preserved

- **Errors and failures**: Any line with ERROR, FAILED, Exception, etc.
- **Warnings**: Warning messages that might be important
- **Stack traces**: Full tracebacks for debugging
- **Summaries**: Test/build summary lines
- **Section headers**: Structural markers like `=====`

## TextCompressor

General-purpose text compression with anchor preservation.

```python
from headroom.transforms import TextCompressor

long_text = """
... thousands of lines of documentation ...
"""

compressor = TextCompressor()
result = compressor.compress(long_text, context="authentication")

print(result.compressed)
```

### What Gets Preserved

- **Relevant paragraphs**: Scored by similarity to context
- **Anchors**: Headers, section markers, important keywords
- **Structure**: Document organization is maintained

## Content Type Detection

Automatically detect content type to route to the right compressor.

```python
from headroom.transforms import detect_content_type, ContentType

content = "src/main.py:42:def process():"

detection = detect_content_type(content)
if detection.content_type == ContentType.SEARCH_RESULTS:
    # Route to SearchCompressor
    pass
elif detection.content_type == ContentType.BUILD_OUTPUT:
    # Route to LogCompressor
    pass
elif detection.content_type == ContentType.PLAIN_TEXT:
    # Route to TextCompressor
    pass
```

### Content Types

| Type | Detection Pattern |
|------|-------------------|
| `SEARCH_RESULTS` | `file:line:content` format |
| `BUILD_OUTPUT` | pytest, npm, cargo markers |
| `JSON` | Valid JSON structure |
| `PLAIN_TEXT` | Default fallback |

## Integration Pattern

```python
from headroom.transforms import (
    detect_content_type, ContentType,
    SearchCompressor, LogCompressor, TextCompressor
)

def compress_tool_output(content: str, context: str = "") -> str:
    """Application-level compression with explicit control."""
    detection = detect_content_type(content)

    if detection.content_type == ContentType.SEARCH_RESULTS:
        result = SearchCompressor().compress(content, context)
        return result.compressed
    elif detection.content_type == ContentType.BUILD_OUTPUT:
        result = LogCompressor().compress(content)
        return result.compressed
    elif detection.content_type == ContentType.PLAIN_TEXT:
        result = TextCompressor().compress(content, context)
        return result.compressed
    else:
        # JSON or other - let SmartCrusher handle it automatically
        return content
```

## Configuration

Each compressor accepts configuration options:

```python
from headroom.transforms import SearchCompressor, SearchCompressorConfig

config = SearchCompressorConfig(
    max_results=50,           # Keep up to 50 matches
    preserve_file_diversity=True,  # Ensure different files represented
    relevance_threshold=0.3,  # Minimum relevance score to keep
)

compressor = SearchCompressor(config)
```

## Performance

| Compressor | Typical Input | Output | Speed |
|------------|---------------|--------|-------|
| SearchCompressor | 1000 matches | 30-50 matches | ~2ms |
| LogCompressor | 5000 lines | 100-200 lines | ~3ms |
| TextCompressor | 10000 chars | 2000 chars | ~2ms |

## When to Use

| Scenario | Recommendation |
|----------|----------------|
| JSON tool output | Let SmartCrusher handle automatically |
| grep/ripgrep results | Use SearchCompressor |
| pytest/npm/cargo output | Use LogCompressor |
| Documentation/README | Use TextCompressor |
| Unknown content | Use detect_content_type to route |