Spaces:
Paused
Quick Fix Guide: Reduce Variable Response Times on HF Spaces
Problem
Your HF T4 Space has inconsistent response times:
- Sometimes: 1 minute ✅
- Sometimes: 5+ minutes ❌
Root Causes
- Lazy model loading - Model loads on first request
- Model unloading - Models unload after inactivity
- Request queueing - Only 2 concurrent requests allowed
- Cold starts - HF Spaces may sleep after inactivity
Quick Fix (5 Minutes)
Step 1: Update app.py (Root Level)
Add these lines at the top of your app.py:
# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
configure_hf_spaces_env,
apply_hf_spaces_optimizations
)
# Before creating the app
configure_hf_spaces_env()
# After creating the app (after line 42)
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)
# ADD THIS:
apply_hf_spaces_optimizations(app)
logging.info("Application initialized successfully")
Step 2: Configuration Applied
The optimizations automatically configure:
Request Queue Settings:
- Max Concurrent Requests: 6 (increased from 2)
- Max Queue Size: 10 requests
- Queue Timeout: 20 minutes
Model Loading:
- Eager Loading: Enabled (models preload at startup)
- Keep-Alive Service: Enabled (prevents model unloading)
- Keep-Alive Interval: 5 minutes
Logging:
- Detailed Logging: Enabled for all operations
- Model Operation Logs: Track loading, generation start/end
- Generation Metrics: Track tokens/second, duration, etc.
These settings are automatically applied when you call apply_hf_spaces_optimizations(app).
Step 3: Set Up External Monitoring (Optional but Recommended)
Use a free service like UptimeRobot or Cron-job.org to ping your warmup endpoint every 5 minutes:
URL to ping: https://your-space-name.hf.space/warmup
Interval: Every 5 minutes
This prevents your space from going cold.
Step 4: Deploy to HF Spaces
git add .
git commit -m "Add HF Spaces performance optimizations"
git push
Expected Results
| Metric | Before | After |
|---|---|---|
| First request (cold) | 2-5 min | 30-60 sec |
| Subsequent requests | 30-60 sec | 30-60 sec |
| After 15 min idle | 2-5 min | 30-60 sec |
| Consistency | ❌ Variable | ✅ Consistent |
Monitoring Endpoints
After deployment, you can check these endpoints:
Model Status:
https://your-space.hf.space/model-status- Shows which models are loaded
Queue Status:
https://your-space.hf.space/queue-status- Shows request queue state
Keep-Alive Status:
https://your-space.hf.space/keepalive-status- Shows keep-alive service stats
Warmup:
https://your-space.hf.space/warmup- Manually trigger model warmup
Troubleshooting
Issue: "Module not found" error
Solution: Make sure you created the new files:
services/ai-service/src/ai_med_extract/utils/model_keepalive.pyservices/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py
Issue: GPU OOM (Out of Memory) errors
Solution: Reduce max_concurrent back to 2 in request_queue.py
Issue: Keep-alive not working
Solution: Check /keepalive-status endpoint to verify service is running
Advanced: Manual Testing
Test the optimizations locally:
# Start the app
python -m uvicorn services.ai-service.src.ai_med_extract.main:app --reload --port 7860
# In another terminal, test warmup
curl http://localhost:7860/warmup
# Check model status
curl http://localhost:7860/model-status
# Check queue status
curl http://localhost:7860/queue-status
Rollback Plan
If something breaks, you can quickly rollback:
git revert HEAD
git push
Or simply remove the apply_hf_spaces_optimizations(app) line from app.py.
Need More Help?
Check the full guide: docs/HF_SPACES_PERFORMANCE_GUIDE.md
Estimated Time to Implement: 5-10 minutes
Expected Performance Improvement: 60-80% more consistent response times
Risk Level: Low (all changes are additive, easy to rollback)