# Hugging Face Spaces & Concurrent Request Handling

## Overview
The system now supports Hugging Face Spaces deployment (T4 medium GPU) with proper concurrent request handling.

## Features Implemented

### 1. ✅ Request Queue Manager
**File**: `services/ai-service/src/ai_med_extract/services/request_queue.py`

**Features**:
- **Concurrent Request Limiting**: Max 2 concurrent requests for T4 medium GPU
- **Request Queuing**: Queue of up to 5 requests when all slots are busy
- **Priority System**: High/Normal/Low priority support
- **Automatic Slot Management**: Releases slots when requests complete
- **Queue Status API**: `/api/queue_status` endpoint for monitoring

**HF Spaces Configuration**:
- Max concurrent: 2 requests (T4 medium GPU limitation)
- Max queue size: 5 requests
- Queue timeout: 5 minutes

**Local/Dev Configuration**:
- Max concurrent: 4 requests
- Max queue size: 20 requests
- Queue timeout: 10 minutes

### 2. ✅ Queue Integration in Routes

**Endpoints Updated**:
- `/generate_patient_summary` (streaming mode)
- `/generate_patient_summary_streaming`
- `/generate_patient_summary_large_data`

**How It Works**:
1. Request arrives → Check queue capacity
2. If capacity available → Enqueue request
3. Create job → Wait for processing slot
4. When slot available → Start background processing
5. When complete → Release slot automatically

### 3. ✅ HF Spaces Optimizations

**Automatic Detection**:
- Detects `HF_SPACES` environment variable
- Adjusts limits automatically for T4 medium GPU
- Optimizes memory usage

**Resource Management**:
- Limits concurrent GPU operations
- Prevents OOM errors
- Manages model loading/unloading

## Usage

### Check Queue Status
```bash
GET /api/queue_status
```

Response:
```json
{
  "active_requests": 1,
  "queue_size": 2,
  "max_concurrent": 2,
  "max_queue_size": 5,
  "total_processed": 10,
  "total_rejected": 0,
  "total_timeout": 0,
  "queue_positions": [
    {
      "request_id": "...",
      "job_id": "...",
      "priority": "NORMAL",
      "wait_time": 5.2
    }
  ]
}
```

### Making Requests

**Normal Request** (non-streaming):
- No queue management (processed immediately)
- Suitable for fast rule-based generation

**Streaming Request**:
- Automatically queued if slots are full
- Returns 503 if queue is full
- Streams progress updates including queue position

## Error Handling

### Queue Full (503 Service Unavailable)
```json
{
  "detail": "Queue full (5/5). Please try again later."
}
```

### Queue Timeout
- If request waits >5 minutes in queue
- Job marked as error
- Slot released automatically

## Performance

### T4 Medium GPU Limits
- **Concurrent Requests**: 2 (prevents GPU OOM)
- **Queue Size**: 5 (reasonable wait time)
- **Memory**: ~16GB GPU, shared between requests

### Resource Sharing
- Models are cached and shared between requests
- GPU memory is managed per request
- CPU memory is cleaned up after each request

## Monitoring

### Queue Metrics
- Active requests count
- Queue size
- Total processed/rejected/timeout
- Average wait time

### Job Status
- Queue position shown in job data
- Progress updates include queue status
- SSE stream shows queue position

## Best Practices for HF Spaces

1. **Use Streaming**: Always use `stream=true` for long operations
2. **Monitor Queue**: Check `/api/queue_status` before making requests
3. **Handle 503**: Implement retry logic for queue full errors
4. **Timeout Handling**: Set appropriate client timeouts (>5 minutes)
5. **Resource Limits**: Be aware of T4 medium GPU limitations

## Configuration

### Environment Variables
- `HF_SPACES=true` - Enables HF Spaces mode
- `SPACE_ID` - Auto-detected on HF Spaces

### Adjusting Limits
Edit `services/ai-service/src/ai_med_extract/services/request_queue.py`:
```python
# For HF Spaces
RequestQueueManager(
    max_concurrent=2,      # Adjust based on GPU
    max_queue_size=5,      # Adjust based on expected load
    queue_timeout=300      # 5 minutes
)
```

## Testing Concurrent Requests

```python
import requests
import concurrent.futures

def make_request(i):
    response = requests.post(
        "https://your-space.hf.space/generate_patient_summary",
        json={"patientid": "...", "token": "...", "key": "...", "stream": True},
        stream=True
    )
    return i, response.status_code

# Test 5 concurrent requests
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    results = executor.map(make_request, range(5))
    for i, status in results:
        print(f"Request {i}: {status}")
```

Expected behavior:
- 2 requests start immediately
- 3 requests queued
- Requests process in order as slots become available