# Hugging Face T4 Space Performance Guide

## Why Response Times Vary

Your HF T4 Space experiences variable response times due to several factors:

### 1. **Cold Start vs Warm Start** ⏱️

| Scenario | Response Time | Reason |
|----------|---------------|--------|
| **First request after idle** | 2-5 minutes | Model loading + container startup |
| **Subsequent requests** | 30-60 seconds | Model already in memory |
| **After 15-30 min idle** | 2-5 minutes | HF may unload models |
| **Concurrent requests (3+)** | 5+ minutes | Queue waiting time |

### 2. **Request Queueing** 🚦

Your `RequestQueueManager` configuration:
- **Max concurrent**: 2 requests
- **Queue size**: 10 requests
- **Queue timeout**: 5 minutes

**What happens:**
```
Request 1 & 2: Processing immediately
Request 3-12: Waiting in queue (up to 5 min)
Request 13+: Rejected (queue full)
```

### 3. **Lazy Model Loading** 🐌

Your code uses lazy loading:
```python
self.model_loader = unified_model_manager.get_model(
    self.current_model_name,
    self.current_model_type,
    lazy=True  # Model loads on first use
)
```

**Impact:**
- First request to a model: +30s-2min loading time
- Model may be unloaded after inactivity
- Next request: reload penalty

### 4. **HF Spaces Infrastructure** 🏗️

- **Shared resources**: Performance varies with HF load
- **Container restarts**: HF may restart your space periodically
- **Network latency**: Model downloads if not cached
- **Memory pressure**: GPU memory competition between requests

---

## 🛠️ Solutions to Improve Consistency

### **Solution 1: Eager Model Loading** (Recommended)

Preload models at startup instead of lazy loading:

**File**: `services/ai-service/src/ai_med_extract/app.py`

Add this to your app initialization:

```python
def initialize_agents(app, preload_small_models=False):
    """Initialize all agents with eager model loading for HF Spaces"""
    import os
    is_hf_spaces = os.getenv("HF_SPACES", "false").lower() == "true"
    
    # For HF Spaces, preload the primary model
    if is_hf_spaces:
        logger.info("🚀 HF Spaces detected - preloading primary model...")
        from .utils.unified_model_manager import unified_model_manager
        
        # Preload the GGUF model (your primary model)
        try:
            primary_model = unified_model_manager.get_model(
                "microsoft/Phi-3-mini-4k-instruct-gguf",
                "gguf",
                lazy=False  # EAGER loading
            )
            logger.info("✅ Primary model preloaded successfully")
        except Exception as e:
            logger.error(f"❌ Failed to preload primary model: {e}")
    
    # ... rest of initialization
```

### **Solution 2: Model Keep-Alive Mechanism**

Prevent models from being unloaded:

**Create**: `services/ai-service/src/ai_med_extract/utils/model_keepalive.py`

```python
"""
Model Keep-Alive Service
Prevents models from being unloaded due to inactivity
"""
import asyncio
import logging
from datetime import datetime

logger = logging.getLogger(__name__)

class ModelKeepAlive:
    def __init__(self, model_manager, interval_seconds=300):
        """
        Args:
            model_manager: Unified model manager instance
            interval_seconds: Ping interval (default: 5 minutes)
        """
        self.model_manager = model_manager
        self.interval = interval_seconds
        self.running = False
        self.task = None
    
    async def ping_models(self):
        """Send keep-alive pings to loaded models"""
        while self.running:
            try:
                # Get list of loaded models
                loaded_models = self.model_manager.list_loaded_models()
                
                for model_info in loaded_models:
                    model_name = model_info.get('name')
                    logger.debug(f"Keep-alive ping: {model_name}")
                    
                    # Update last_used timestamp
                    model = self.model_manager.get_model(
                        model_name,
                        model_info.get('type'),
                        lazy=True  # Don't reload if already loaded
                    )
                    if model:
                        model.last_used = datetime.now().timestamp()
                
                logger.info(f"✅ Keep-alive ping sent to {len(loaded_models)} models")
                
            except Exception as e:
                logger.error(f"❌ Keep-alive error: {e}")
            
            await asyncio.sleep(self.interval)
    
    def start(self):
        """Start the keep-alive service"""
        if not self.running:
            self.running = True
            self.task = asyncio.create_task(self.ping_models())
            logger.info(f"🚀 Model keep-alive started (interval: {self.interval}s)")
    
    def stop(self):
        """Stop the keep-alive service"""
        if self.running:
            self.running = False
            if self.task:
                self.task.cancel()
            logger.info("🛑 Model keep-alive stopped")

# Global instance
_keepalive_service = None

def get_keepalive_service(model_manager):
    """Get or create the global keep-alive service"""
    global _keepalive_service
    if _keepalive_service is None:
        _keepalive_service = ModelKeepAlive(model_manager)
    return _keepalive_service
```

### **Solution 3: Increase Concurrent Request Limit**

If you have enough GPU memory, increase concurrent requests:

**File**: `services/ai-service/src/ai_med_extract/services/request_queue.py`

```python
# For T4 Medium with 16GB GPU RAM
RequestQueueManager(
    max_concurrent=3,  # Increased from 2
    max_queue_size=15,  # Increased from 10
    queue_timeout=300   # Keep at 5 minutes
)
```

**⚠️ Warning**: Monitor GPU memory usage. If you get OOM errors, reduce back to 2.

### **Solution 4: Add Health Check Endpoint with Model Warmup**

Keep your space alive with periodic health checks:

**File**: `services/ai-service/src/ai_med_extract/api/routes_fastapi.py`

Add this endpoint:

```python
@app.get("/warmup")
async def warmup_models():
    """
    Warmup endpoint to keep models loaded
    Can be called by external monitoring service
    """
    from ..utils.unified_model_manager import unified_model_manager
    
    try:
        # Get primary model (will load if not already loaded)
        model = unified_model_manager.get_model(
            "microsoft/Phi-3-mini-4k-instruct-gguf",
            "gguf",
            lazy=False
        )
        
        # Optional: Run a tiny inference to keep GPU warm
        test_prompt = "Test"
        _ = model.generate(test_prompt, max_tokens=10)
        
        return {
            "status": "warm",
            "timestamp": datetime.now().isoformat(),
            "models_loaded": len(unified_model_manager.list_loaded_models())
        }
    except Exception as e:
        return {
            "status": "cold",
            "error": str(e),
            "timestamp": datetime.now().isoformat()
        }
```

Then use a service like **UptimeRobot** or **Cron-job.org** to ping `/warmup` every 5 minutes.

### **Solution 5: Optimize Queue Settings for HF Spaces**

**File**: `app.py` (root level)

Add queue configuration:

```python
# Set environment variables for Hugging Face Spaces
os.environ.setdefault("FAST_MODE", "true")
os.environ.setdefault("PRELOAD_SMALL_MODELS", "false")
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:128")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")

# NEW: Queue optimization
os.environ.setdefault("MAX_CONCURRENT_REQUESTS", "3")  # Increase if stable
os.environ.setdefault("EAGER_MODEL_LOADING", "true")   # Disable lazy loading
os.environ.setdefault("MODEL_KEEPALIVE", "true")       # Enable keep-alive
```

---

## 📈 Expected Performance After Optimization

| Metric | Before | After Optimization |
|--------|--------|-------------------|
| **First request (cold)** | 2-5 min | 30-60 sec (model preloaded) |
| **Subsequent requests** | 30-60 sec | 30-60 sec (consistent) |
| **After 15 min idle** | 2-5 min | 30-60 sec (keep-alive prevents unload) |
| **Concurrent requests** | Queue wait | Faster queue processing |

---

## 🔍 Monitoring & Debugging

### Check Model Status

Add this endpoint to monitor model loading:

```python
@app.get("/model-status")
async def model_status():
    """Get current model loading status"""
    from ..utils.unified_model_manager import unified_model_manager
    
    loaded_models = unified_model_manager.list_loaded_models()
    
    return {
        "loaded_models": loaded_models,
        "total_loaded": len(loaded_models),
        "queue_status": get_queue_manager().get_queue_status(),
        "timestamp": datetime.now().isoformat()
    }
```

### Check Queue Status

```python
@app.get("/queue-status")
async def queue_status():
    """Get current request queue status"""
    from ..services.request_queue import get_queue_manager
    
    return get_queue_manager().get_queue_status()
```

---

## 🎯 Quick Wins (Implement These First)

1. **Change lazy loading to eager loading** in `patient_summary_agent.py`:
   ```python
   lazy=False  # Instead of lazy=True
   ```

2. **Increase concurrent requests** (if GPU memory allows):
   ```python
   max_concurrent=3  # Instead of 2
   ```

3. **Set up external monitoring** to ping `/warmup` every 5 minutes

4. **Monitor GPU memory** to ensure you're not hitting OOM errors

---

## 🚨 Common Issues

### Issue: "Model failed to load"
**Cause**: Model not in cache or GPU OOM
**Solution**: Check `preload_models.py` ran during build

### Issue: "Request queued for 5 minutes"
**Cause**: Too many concurrent requests
**Solution**: Increase `max_concurrent` or optimize model inference time

### Issue: "First request very slow"
**Cause**: Cold start / lazy loading
**Solution**: Enable eager loading and keep-alive

---

## 📚 Additional Resources

- [HF Spaces Documentation](https://huggingface.co/docs/hub/spaces)
- [T4 GPU Specs](https://www.nvidia.com/en-us/data-center/tesla-t4/)
- [Model Optimization Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one)

---

**Last Updated**: 2025-11-27
**Your Current Setup**: T4 Medium (16GB GPU, 16GB RAM)
**Primary Model**: microsoft/Phi-3-mini-4k-instruct-gguf (Q4 quantized)