HNTAI / docs /QUICK_FIX_PERFORMANCE.md
sachinchandrankallar's picture
feat: Add Colab patient summary script, AI service utilities for performance, and related documentation.
f091f7a
|
Raw
History Blame
4.18 kB

Quick Fix Guide: Reduce Variable Response Times on HF Spaces

Problem

Your HF T4 Space has inconsistent response times:

  • Sometimes: 1 minute
  • Sometimes: 5+ minutes

Root Causes

  1. Lazy model loading - Model loads on first request
  2. Model unloading - Models unload after inactivity
  3. Request queueing - Only 2 concurrent requests allowed
  4. Cold starts - HF Spaces may sleep after inactivity

Quick Fix (5 Minutes)

Step 1: Update app.py (Root Level)

Add these lines at the top of your app.py:

# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
    configure_hf_spaces_env,
    apply_hf_spaces_optimizations
)

# Before creating the app
configure_hf_spaces_env()

# After creating the app (after line 42)
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)

# ADD THIS:
apply_hf_spaces_optimizations(app)

logging.info("Application initialized successfully")

Step 2: Configuration Applied

The optimizations automatically configure:

Request Queue Settings:

  • Max Concurrent Requests: 6 (increased from 2)
  • Max Queue Size: 10 requests
  • Queue Timeout: 20 minutes

Model Loading:

  • Eager Loading: Enabled (models preload at startup)
  • Keep-Alive Service: Enabled (prevents model unloading)
  • Keep-Alive Interval: 5 minutes

Logging:

  • Detailed Logging: Enabled for all operations
  • Model Operation Logs: Track loading, generation start/end
  • Generation Metrics: Track tokens/second, duration, etc.

These settings are automatically applied when you call apply_hf_spaces_optimizations(app).

Step 3: Set Up External Monitoring (Optional but Recommended)

Use a free service like UptimeRobot or Cron-job.org to ping your warmup endpoint every 5 minutes:

URL to ping: https://your-space-name.hf.space/warmup

Interval: Every 5 minutes

This prevents your space from going cold.

Step 4: Deploy to HF Spaces

git add .
git commit -m "Add HF Spaces performance optimizations"
git push

Expected Results

Metric Before After
First request (cold) 2-5 min 30-60 sec
Subsequent requests 30-60 sec 30-60 sec
After 15 min idle 2-5 min 30-60 sec
Consistency ❌ Variable ✅ Consistent

Monitoring Endpoints

After deployment, you can check these endpoints:

  1. Model Status: https://your-space.hf.space/model-status

    • Shows which models are loaded
  2. Queue Status: https://your-space.hf.space/queue-status

    • Shows request queue state
  3. Keep-Alive Status: https://your-space.hf.space/keepalive-status

    • Shows keep-alive service stats
  4. Warmup: https://your-space.hf.space/warmup

    • Manually trigger model warmup

Troubleshooting

Issue: "Module not found" error

Solution: Make sure you created the new files:

  • services/ai-service/src/ai_med_extract/utils/model_keepalive.py
  • services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py

Issue: GPU OOM (Out of Memory) errors

Solution: Reduce max_concurrent back to 2 in request_queue.py

Issue: Keep-alive not working

Solution: Check /keepalive-status endpoint to verify service is running

Advanced: Manual Testing

Test the optimizations locally:

# Start the app
python -m uvicorn services.ai-service.src.ai_med_extract.main:app --reload --port 7860

# In another terminal, test warmup
curl http://localhost:7860/warmup

# Check model status
curl http://localhost:7860/model-status

# Check queue status
curl http://localhost:7860/queue-status

Rollback Plan

If something breaks, you can quickly rollback:

git revert HEAD
git push

Or simply remove the apply_hf_spaces_optimizations(app) line from app.py.

Need More Help?

Check the full guide: docs/HF_SPACES_PERFORMANCE_GUIDE.md


Estimated Time to Implement: 5-10 minutes
Expected Performance Improvement: 60-80% more consistent response times
Risk Level: Low (all changes are additive, easy to rollback)