# Quick Fix Guide: Reduce Variable Response Times on HF Spaces

## Problem
Your HF T4 Space has inconsistent response times:
- Sometimes: **1 minute** ✅
- Sometimes: **5+ minutes** ❌

## Root Causes
1. **Lazy model loading** - Model loads on first request
2. **Model unloading** - Models unload after inactivity
3. **Request queueing** - Only 2 concurrent requests allowed
4. **Cold starts** - HF Spaces may sleep after inactivity

## Quick Fix (5 Minutes)

### Step 1: Update `app.py` (Root Level)

Add these lines at the top of your `app.py`:

```python
# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
    configure_hf_spaces_env,
    apply_hf_spaces_optimizations
)

# Before creating the app
configure_hf_spaces_env()

# After creating the app (after line 42)
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)

# ADD THIS:
apply_hf_spaces_optimizations(app)

logging.info("Application initialized successfully")
```

### Step 2: Configuration Applied

The optimizations automatically configure:

**Request Queue Settings:**
- **Max Concurrent Requests**: 6 (increased from 2)
- **Max Queue Size**: 10 requests
- **Queue Timeout**: 20 minutes

**Model Loading:**
- **Eager Loading**: Enabled (models preload at startup)
- **Keep-Alive Service**: Enabled (prevents model unloading)
- **Keep-Alive Interval**: 5 minutes

**Logging:**
- **Detailed Logging**: Enabled for all operations
- **Model Operation Logs**: Track loading, generation start/end
- **Generation Metrics**: Track tokens/second, duration, etc.

These settings are automatically applied when you call `apply_hf_spaces_optimizations(app)`.

### Step 3: Set Up External Monitoring (Optional but Recommended)

Use a free service like **UptimeRobot** or **Cron-job.org** to ping your warmup endpoint every 5 minutes:

**URL to ping**: `https://your-space-name.hf.space/warmup`

**Interval**: Every 5 minutes

This prevents your space from going cold.

### Step 4: Deploy to HF Spaces

```bash
git add .
git commit -m "Add HF Spaces performance optimizations"
git push
```

## Expected Results

| Metric | Before | After |
|--------|--------|-------|
| First request (cold) | 2-5 min | 30-60 sec |
| Subsequent requests | 30-60 sec | 30-60 sec |
| After 15 min idle | 2-5 min | 30-60 sec |
| Consistency | ❌ Variable | ✅ Consistent |

## Monitoring Endpoints

After deployment, you can check these endpoints:

1. **Model Status**: `https://your-space.hf.space/model-status`
   - Shows which models are loaded

2. **Queue Status**: `https://your-space.hf.space/queue-status`
   - Shows request queue state

3. **Keep-Alive Status**: `https://your-space.hf.space/keepalive-status`
   - Shows keep-alive service stats

4. **Warmup**: `https://your-space.hf.space/warmup`
   - Manually trigger model warmup

## Troubleshooting

### Issue: "Module not found" error
**Solution**: Make sure you created the new files:
- `services/ai-service/src/ai_med_extract/utils/model_keepalive.py`
- `services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py`

### Issue: GPU OOM (Out of Memory) errors
**Solution**: Reduce `max_concurrent` back to 2 in `request_queue.py`

### Issue: Keep-alive not working
**Solution**: Check `/keepalive-status` endpoint to verify service is running

## Advanced: Manual Testing

Test the optimizations locally:

```bash
# Start the app
python -m uvicorn services.ai-service.src.ai_med_extract.main:app --reload --port 7860

# In another terminal, test warmup
curl http://localhost:7860/warmup

# Check model status
curl http://localhost:7860/model-status

# Check queue status
curl http://localhost:7860/queue-status
```

## Rollback Plan

If something breaks, you can quickly rollback:

```bash
git revert HEAD
git push
```

Or simply remove the `apply_hf_spaces_optimizations(app)` line from `app.py`.

## Need More Help?

Check the full guide: `docs/HF_SPACES_PERFORMANCE_GUIDE.md`

---

**Estimated Time to Implement**: 5-10 minutes  
**Expected Performance Improvement**: 60-80% more consistent response times  
**Risk Level**: Low (all changes are additive, easy to rollback)