# Hugging Face Spaces & Concurrent Request Handling ## Overview The system now supports Hugging Face Spaces deployment (T4 medium GPU) with proper concurrent request handling. ## Features Implemented ### 1. ✅ Request Queue Manager **File**: `services/ai-service/src/ai_med_extract/services/request_queue.py` **Features**: - **Concurrent Request Limiting**: Max 2 concurrent requests for T4 medium GPU - **Request Queuing**: Queue of up to 5 requests when all slots are busy - **Priority System**: High/Normal/Low priority support - **Automatic Slot Management**: Releases slots when requests complete - **Queue Status API**: `/api/queue_status` endpoint for monitoring **HF Spaces Configuration**: - Max concurrent: 2 requests (T4 medium GPU limitation) - Max queue size: 5 requests - Queue timeout: 5 minutes **Local/Dev Configuration**: - Max concurrent: 4 requests - Max queue size: 20 requests - Queue timeout: 10 minutes ### 2. ✅ Queue Integration in Routes **Endpoints Updated**: - `/generate_patient_summary` (streaming mode) - `/generate_patient_summary_streaming` - `/generate_patient_summary_large_data` **How It Works**: 1. Request arrives → Check queue capacity 2. If capacity available → Enqueue request 3. Create job → Wait for processing slot 4. When slot available → Start background processing 5. When complete → Release slot automatically ### 3. ✅ HF Spaces Optimizations **Automatic Detection**: - Detects `HF_SPACES` environment variable - Adjusts limits automatically for T4 medium GPU - Optimizes memory usage **Resource Management**: - Limits concurrent GPU operations - Prevents OOM errors - Manages model loading/unloading ## Usage ### Check Queue Status ```bash GET /api/queue_status ``` Response: ```json { "active_requests": 1, "queue_size": 2, "max_concurrent": 2, "max_queue_size": 5, "total_processed": 10, "total_rejected": 0, "total_timeout": 0, "queue_positions": [ { "request_id": "...", "job_id": "...", "priority": "NORMAL", "wait_time": 5.2 } ] } ``` ### Making Requests **Normal Request** (non-streaming): - No queue management (processed immediately) - Suitable for fast rule-based generation **Streaming Request**: - Automatically queued if slots are full - Returns 503 if queue is full - Streams progress updates including queue position ## Error Handling ### Queue Full (503 Service Unavailable) ```json { "detail": "Queue full (5/5). Please try again later." } ``` ### Queue Timeout - If request waits >5 minutes in queue - Job marked as error - Slot released automatically ## Performance ### T4 Medium GPU Limits - **Concurrent Requests**: 2 (prevents GPU OOM) - **Queue Size**: 5 (reasonable wait time) - **Memory**: ~16GB GPU, shared between requests ### Resource Sharing - Models are cached and shared between requests - GPU memory is managed per request - CPU memory is cleaned up after each request ## Monitoring ### Queue Metrics - Active requests count - Queue size - Total processed/rejected/timeout - Average wait time ### Job Status - Queue position shown in job data - Progress updates include queue status - SSE stream shows queue position ## Best Practices for HF Spaces 1. **Use Streaming**: Always use `stream=true` for long operations 2. **Monitor Queue**: Check `/api/queue_status` before making requests 3. **Handle 503**: Implement retry logic for queue full errors 4. **Timeout Handling**: Set appropriate client timeouts (>5 minutes) 5. **Resource Limits**: Be aware of T4 medium GPU limitations ## Configuration ### Environment Variables - `HF_SPACES=true` - Enables HF Spaces mode - `SPACE_ID` - Auto-detected on HF Spaces ### Adjusting Limits Edit `services/ai-service/src/ai_med_extract/services/request_queue.py`: ```python # For HF Spaces RequestQueueManager( max_concurrent=2, # Adjust based on GPU max_queue_size=5, # Adjust based on expected load queue_timeout=300 # 5 minutes ) ``` ## Testing Concurrent Requests ```python import requests import concurrent.futures def make_request(i): response = requests.post( "https://your-space.hf.space/generate_patient_summary", json={"patientid": "...", "token": "...", "key": "...", "stream": True}, stream=True ) return i, response.status_code # Test 5 concurrent requests with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: results = executor.map(make_request, range(5)) for i, status in results: print(f"Request {i}: {status}") ``` Expected behavior: - 2 requests start immediately - 3 requests queued - Requests process in order as slots become available