# LLMLingua-2 Integration

For maximum compression, Headroom integrates with **LLMLingua-2**, Microsoft's BERT-based token classifier trained via GPT-4 distillation. It achieves **up to 20x compression** while preserving semantic meaning.

## When to Use LLMLingua-2

| Approach | Best For | Compression | Speed |
|----------|----------|-------------|-------|
| **SmartCrusher** | JSON tool outputs | 70-90% | ~1ms |
| **Text Utilities** | Search/logs | 50-90% | ~1ms |
| **LLMLingua-2** | Any text, max compression | 80-95% | ~50-200ms |

LLMLingua-2 is ideal when you need maximum compression and can tolerate slightly higher latency (e.g., compressing large tool outputs before storage, offline processing).

## Installation

```bash
# Adds ~2GB of model weights
pip install "headroom-ai[llmlingua]"
```

## Basic Usage

```python
from headroom.transforms import LLMLinguaCompressor

# Create compressor (model loaded lazily on first use)
compressor = LLMLinguaCompressor()

# Compress any text
long_output = "The function processUserData takes a user object and validates..."
result = compressor.compress(long_output)

print(f"Before: {result.original_tokens} tokens")
print(f"After: {result.compressed_tokens} tokens")
print(f"Saved: {result.savings_percentage:.1f}%")
print(result.compressed)
```

## Content-Aware Compression

LLMLingua-2 automatically adjusts compression based on content type:

```python
from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig

# Conservative for code (keep 40% of tokens)
config = LLMLinguaConfig(
    code_compression_rate=0.4,    # More conservative
    json_compression_rate=0.35,   # Moderate
    text_compression_rate=0.25,   # Aggressive
)

compressor = LLMLinguaCompressor(config)

# Auto-detects content type
code_result = compressor.compress("def calculate(x): return x * 2")
text_result = compressor.compress("This is a verbose explanation...")
```

## Memory Management

The model uses ~1GB RAM. Unload it when done:

```python
from headroom.transforms import (
    LLMLinguaCompressor,
    unload_llmlingua_model,
    is_llmlingua_model_loaded,
)

compressor = LLMLinguaCompressor()
result = compressor.compress(content)  # Model loaded here

# Check if loaded
print(is_llmlingua_model_loaded())  # True

# Free memory when done
unload_llmlingua_model()  # Frees ~1GB
print(is_llmlingua_model_loaded())  # False

# Next compression will reload automatically
```

## Device Configuration

```python
from headroom.transforms import LLMLinguaConfig, LLMLinguaCompressor

# Force CPU (slower but works everywhere)
config = LLMLinguaConfig(device="cpu")

# Force GPU (faster but needs CUDA)
config = LLMLinguaConfig(device="cuda")

# Auto-detect (default): uses CUDA > MPS > CPU
config = LLMLinguaConfig(device="auto")

compressor = LLMLinguaCompressor(config)
```

## Use in Pipeline

```python
from headroom.transforms import TransformPipeline, LLMLinguaCompressor, SmartCrusher

# Combine with other transforms
pipeline = TransformPipeline([
    SmartCrusher(),        # First: compress JSON
    LLMLinguaCompressor(), # Then: ML compression on remaining text
])

result = pipeline.apply(messages, tokenizer)
```

## Proxy Integration

Enable LLMLingua in the proxy server for automatic ML compression:

```bash
# Enable LLMLingua in proxy (requires: pip install headroom-ai[llmlingua,proxy])
headroom proxy --llmlingua

# With custom settings
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4

# The proxy shows LLMLingua status at startup:
#   LLMLingua: ENABLED  (device=cuda, rate=0.4)
#
# If llmlingua is installed but not enabled, you'll see a helpful hint:
#   LLMLingua: available (enable with --llmlingua for ML compression)
```

## Configuration Reference

| Option | Default | Description |
|--------|---------|-------------|
| `device` | `"auto"` | Device to run model on: auto, cpu, cuda, mps |
| `code_compression_rate` | `0.4` | Keep 40% of tokens for code |
| `json_compression_rate` | `0.35` | Keep 35% of tokens for JSON |
| `text_compression_rate` | `0.25` | Keep 25% of tokens for text |
| `force_tokens` | `[]` | Tokens to always preserve |
| `drop_consecutive` | `True` | Drop consecutive whitespace |

## Performance Characteristics

| Metric | Value |
|--------|-------|
| Model size | ~500MB |
| Memory usage | ~1GB RAM |
| Cold start | 10-30s (first load) |
| Inference | 50-200ms per request |
| Compression | 80-95% |

## Why Opt-In?

LLMLingua adds significant dependencies and overhead:

| Aspect | Default Proxy | With LLMLingua |
|--------|--------------|----------------|
| Dependencies | ~50MB | ~2GB |
| Cold start | <1s | 10-30s |
| Per-request | ~1-5ms | ~50-200ms |
| Compression | 70-90% | 80-95% |

The default proxy is lightweight and fast. Enable LLMLingua when you need maximum compression and can accept the tradeoffs.

## Troubleshooting

### "Model not found"

```bash
# Ensure llmlingua extra is installed
pip install "headroom-ai[llmlingua]"
```

### "CUDA out of memory"

```python
# Force CPU mode
config = LLMLinguaConfig(device="cpu")
```

### "Slow compression"

- Use GPU if available: `device="cuda"`
- Batch multiple compressions
- Consider using SmartCrusher for JSON (faster, similar results)