Spaces:

salvinjose
/

HNTAI

Paused

# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
    configure_hf_spaces_env,
    apply_hf_spaces_optimizations
)

# Before creating the app
configure_hf_spaces_env()

# After creating the app (after line 42)
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)

# ADD THIS:
apply_hf_spaces_optimizations(app)

logging.info("Application initialized successfully")

Step 2: Configuration Applied

The optimizations automatically configure:

Request Queue Settings:

Max Concurrent Requests: 6 (increased from 2)
Max Queue Size: 10 requests
Queue Timeout: 20 minutes

Model Loading:

Eager Loading: Enabled (models preload at startup)
Keep-Alive Service: Enabled (prevents model unloading)
Keep-Alive Interval: 5 minutes

Logging:

Detailed Logging: Enabled for all operations
Model Operation Logs: Track loading, generation start/end
Generation Metrics: Track tokens/second, duration, etc.

These settings are automatically applied when you call apply_hf_spaces_optimizations(app).

Step 3: Set Up External Monitoring (Optional but Recommended)

Use a free service like UptimeRobot or Cron-job.org to ping your warmup endpoint every 5 minutes:

URL to ping: https://your-space-name.hf.space/warmup

Interval: Every 5 minutes

This prevents your space from going cold.

Step 4: Deploy to HF Spaces

git add .
git commit -m "Add HF Spaces performance optimizations"
git push

Expected Results

Metric	Before	After
First request (cold)	2-5 min	30-60 sec
Subsequent requests	30-60 sec	30-60 sec
After 15 min idle	2-5 min	30-60 sec
Consistency	❌ Variable	✅ Consistent

Monitoring Endpoints

After deployment, you can check these endpoints:

Model Status: https://your-space.hf.space/model-status
- Shows which models are loaded
Queue Status: https://your-space.hf.space/queue-status
- Shows request queue state
Keep-Alive Status: https://your-space.hf.space/keepalive-status
- Shows keep-alive service stats
Warmup: https://your-space.hf.space/warmup
- Manually trigger model warmup

Troubleshooting

Issue: "Module not found" error

Solution: Make sure you created the new files:

services/ai-service/src/ai_med_extract/utils/model_keepalive.py
services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py

Issue: GPU OOM (Out of Memory) errors

Solution: Reduce max_concurrent back to 2 in request_queue.py

Issue: Keep-alive not working

Solution: Check /keepalive-status endpoint to verify service is running

Advanced: Manual Testing

Test the optimizations locally:

# Start the app
python -m uvicorn services.ai-service.src.ai_med_extract.main:app --reload --port 7860

# In another terminal, test warmup
curl http://localhost:7860/warmup

# Check model status
curl http://localhost:7860/model-status

# Check queue status
curl http://localhost:7860/queue-status

Rollback Plan

If something breaks, you can quickly rollback:

git revert HEAD
git push

Or simply remove the apply_hf_spaces_optimizations(app) line from app.py.

Need More Help?

Check the full guide: docs/HF_SPACES_PERFORMANCE_GUIDE.md

Estimated Time to Implement: 5-10 minutes
Expected Performance Improvement: 60-80% more consistent response times
Risk Level: Low (all changes are additive, easy to rollback)

Quick Fix Guide: Reduce Variable Response Times on HF Spaces

Problem

Root Causes

Quick Fix (5 Minutes)

Step 1: Update app.py (Root Level)