HNTAI / docs /QUICK_FIX_PERFORMANCE.md
sachinchandrankallar's picture
feat: Add Colab patient summary script, AI service utilities for performance, and related documentation.
f091f7a
|
Raw
History Blame
4.18 kB
# Quick Fix Guide: Reduce Variable Response Times on HF Spaces
## Problem
Your HF T4 Space has inconsistent response times:
- Sometimes: **1 minute**
- Sometimes: **5+ minutes**
## Root Causes
1. **Lazy model loading** - Model loads on first request
2. **Model unloading** - Models unload after inactivity
3. **Request queueing** - Only 2 concurrent requests allowed
4. **Cold starts** - HF Spaces may sleep after inactivity
## Quick Fix (5 Minutes)
### Step 1: Update `app.py` (Root Level)
Add these lines at the top of your `app.py`:
```python
# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
configure_hf_spaces_env,
apply_hf_spaces_optimizations
)
# Before creating the app
configure_hf_spaces_env()
# After creating the app (after line 42)
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)
# ADD THIS:
apply_hf_spaces_optimizations(app)
logging.info("Application initialized successfully")
```
### Step 2: Configuration Applied
The optimizations automatically configure:
**Request Queue Settings:**
- **Max Concurrent Requests**: 6 (increased from 2)
- **Max Queue Size**: 10 requests
- **Queue Timeout**: 20 minutes
**Model Loading:**
- **Eager Loading**: Enabled (models preload at startup)
- **Keep-Alive Service**: Enabled (prevents model unloading)
- **Keep-Alive Interval**: 5 minutes
**Logging:**
- **Detailed Logging**: Enabled for all operations
- **Model Operation Logs**: Track loading, generation start/end
- **Generation Metrics**: Track tokens/second, duration, etc.
These settings are automatically applied when you call `apply_hf_spaces_optimizations(app)`.
### Step 3: Set Up External Monitoring (Optional but Recommended)
Use a free service like **UptimeRobot** or **Cron-job.org** to ping your warmup endpoint every 5 minutes:
**URL to ping**: `https://your-space-name.hf.space/warmup`
**Interval**: Every 5 minutes
This prevents your space from going cold.
### Step 4: Deploy to HF Spaces
```bash
git add .
git commit -m "Add HF Spaces performance optimizations"
git push
```
## Expected Results
| Metric | Before | After |
|--------|--------|-------|
| First request (cold) | 2-5 min | 30-60 sec |
| Subsequent requests | 30-60 sec | 30-60 sec |
| After 15 min idle | 2-5 min | 30-60 sec |
| Consistency | ❌ Variable | ✅ Consistent |
## Monitoring Endpoints
After deployment, you can check these endpoints:
1. **Model Status**: `https://your-space.hf.space/model-status`
- Shows which models are loaded
2. **Queue Status**: `https://your-space.hf.space/queue-status`
- Shows request queue state
3. **Keep-Alive Status**: `https://your-space.hf.space/keepalive-status`
- Shows keep-alive service stats
4. **Warmup**: `https://your-space.hf.space/warmup`
- Manually trigger model warmup
## Troubleshooting
### Issue: "Module not found" error
**Solution**: Make sure you created the new files:
- `services/ai-service/src/ai_med_extract/utils/model_keepalive.py`
- `services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py`
### Issue: GPU OOM (Out of Memory) errors
**Solution**: Reduce `max_concurrent` back to 2 in `request_queue.py`
### Issue: Keep-alive not working
**Solution**: Check `/keepalive-status` endpoint to verify service is running
## Advanced: Manual Testing
Test the optimizations locally:
```bash
# Start the app
python -m uvicorn services.ai-service.src.ai_med_extract.main:app --reload --port 7860
# In another terminal, test warmup
curl http://localhost:7860/warmup
# Check model status
curl http://localhost:7860/model-status
# Check queue status
curl http://localhost:7860/queue-status
```
## Rollback Plan
If something breaks, you can quickly rollback:
```bash
git revert HEAD
git push
```
Or simply remove the `apply_hf_spaces_optimizations(app)` line from `app.py`.
## Need More Help?
Check the full guide: `docs/HF_SPACES_PERFORMANCE_GUIDE.md`
---
**Estimated Time to Implement**: 5-10 minutes
**Expected Performance Improvement**: 60-80% more consistent response times
**Risk Level**: Low (all changes are additive, easy to rollback)