Spaces:

minhtudragon
/

headroom

Build error

App Files Files Community

headroom / docs /text-compression.md

chopratejas

v0.2.2: Add CCR Response Handler, Context Tracker, and restructure docs

d724f14 5 months ago

preview code

Raw

History Blame

6.18 kB

	# Text Compression Utilities

	For coding tasks, Headroom provides standalone text compression utilities that applications can use explicitly. These are opt-in — they're not applied automatically, giving you full control over when and how to compress text content.

	> Design Philosophy: SmartCrusher compresses JSON automatically because it's structure-preserving and safe. Text compression is lossy and context-dependent, so applications should decide when to use it.

	## Available Utilities

	\| Utility \| Input Type \| Use Case \|
	\|---------\|------------\|----------\|
	\| `SearchCompressor` \| grep/ripgrep output \| Search results with `file:line:content` format \|
	\| `LogCompressor` \| Build/test logs \| pytest, npm, cargo, make output \|
	\| `TextCompressor` \| Generic text \| Any plain text with anchor preservation \|
	\| `detect_content_type` \| Any content \| Detect content type for routing decisions \|

	## SearchCompressor

	Compresses search results (grep, ripgrep, ag) while preserving relevant matches.

	```python
	from headroom.transforms import SearchCompressor

	# Your grep/ripgrep output (could be 1000s of lines)
	search_results = """
	src/utils.py:42:def process_data(items):
	src/utils.py:43: \"\"\"Process items.\"\"\"
	src/models.py:15:class DataProcessor:
	src/models.py:89: def process(self, items):
	... hundreds more matches ...
	"""

	# Explicitly compress when you decide it's appropriate
	compressor = SearchCompressor()
	result = compressor.compress(search_results, context="find process")

	print(f"Compressed {result.original_match_count} matches to {result.compressed_match_count}")
	print(result.compressed)
	```

	### What Gets Preserved

	- Exact query matches: Lines containing the search term
	- High-relevance matches: Scored by BM25 similarity to context
	- File diversity: Ensures results from different files are kept
	- First/last matches: Context from start and end of results

	## LogCompressor

	Compresses build and test output while preserving errors, warnings, and summaries.

	```python
	from headroom.transforms import LogCompressor

	# pytest output with 1000s of lines
	build_output = """
	===== test session starts =====
	collected 500 items
	tests/test_foo.py::test_1 PASSED
	... hundreds of passed tests ...
	tests/test_bar.py::test_fail FAILED
	AssertionError: expected 5, got 3
	===== 1 failed, 499 passed =====
	"""

	# Compress logs, preserving errors and stack traces
	compressor = LogCompressor()
	result = compressor.compress(build_output)

	# Errors, stack traces, and summary are preserved
	print(result.compressed)
	print(f"Compression ratio: {result.compression_ratio:.1%}")
	```

	### What Gets Preserved

	- Errors and failures: Any line with ERROR, FAILED, Exception, etc.
	- Warnings: Warning messages that might be important
	- Stack traces: Full tracebacks for debugging
	- Summaries: Test/build summary lines
	- Section headers: Structural markers like `=====`

	## TextCompressor

	General-purpose text compression with anchor preservation.

	```python
	from headroom.transforms import TextCompressor

	long_text = """
	... thousands of lines of documentation ...
	"""

	compressor = TextCompressor()
	result = compressor.compress(long_text, context="authentication")

	print(result.compressed)
	```

	### What Gets Preserved

	- Relevant paragraphs: Scored by similarity to context
	- Anchors: Headers, section markers, important keywords
	- Structure: Document organization is maintained

	## Content Type Detection

	Automatically detect content type to route to the right compressor.

	```python
	from headroom.transforms import detect_content_type, ContentType

	content = "src/main.py:42:def process():"

	detection = detect_content_type(content)
	if detection.content_type == ContentType.SEARCH_RESULTS:
	# Route to SearchCompressor
	pass
	elif detection.content_type == ContentType.BUILD_OUTPUT:
	# Route to LogCompressor
	pass
	elif detection.content_type == ContentType.PLAIN_TEXT:
	# Route to TextCompressor
	pass
	```

	### Content Types

	\| Type \| Detection Pattern \|
	\|------\|-------------------\|
	\| `SEARCH_RESULTS` \| `file:line:content` format \|
	\| `BUILD_OUTPUT` \| pytest, npm, cargo markers \|
	\| `JSON` \| Valid JSON structure \|
	\| `PLAIN_TEXT` \| Default fallback \|

	## Integration Pattern

	```python
	from headroom.transforms import (
	detect_content_type, ContentType,
	SearchCompressor, LogCompressor, TextCompressor
	)

	def compress_tool_output(content: str, context: str = "") -> str:
	"""Application-level compression with explicit control."""
	detection = detect_content_type(content)

	if detection.content_type == ContentType.SEARCH_RESULTS:
	result = SearchCompressor().compress(content, context)
	return result.compressed
	elif detection.content_type == ContentType.BUILD_OUTPUT:
	result = LogCompressor().compress(content)
	return result.compressed
	elif detection.content_type == ContentType.PLAIN_TEXT:
	result = TextCompressor().compress(content, context)
	return result.compressed
	else:
	# JSON or other - let SmartCrusher handle it automatically
	return content
	```

	## Configuration

	Each compressor accepts configuration options:

	```python
	from headroom.transforms import SearchCompressor, SearchCompressorConfig

	config = SearchCompressorConfig(
	max_results=50, # Keep up to 50 matches
	preserve_file_diversity=True, # Ensure different files represented
	relevance_threshold=0.3, # Minimum relevance score to keep
	)

	compressor = SearchCompressor(config)
	```

	## Performance

	\| Compressor \| Typical Input \| Output \| Speed \|
	\|------------\|---------------\|--------\|-------\|
	\| SearchCompressor \| 1000 matches \| 30-50 matches \| ~2ms \|
	\| LogCompressor \| 5000 lines \| 100-200 lines \| ~3ms \|
	\| TextCompressor \| 10000 chars \| 2000 chars \| ~2ms \|

	## When to Use

	\| Scenario \| Recommendation \|
	\|----------\|----------------\|
	\| JSON tool output \| Let SmartCrusher handle automatically \|
	\| grep/ripgrep results \| Use SearchCompressor \|
	\| pytest/npm/cargo output \| Use LogCompressor \|
	\| Documentation/README \| Use TextCompressor \|
	\| Unknown content \| Use detect_content_type to route \|