Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

HNTAI / docs /QUICK_FIX_PERFORMANCE.md

sachinchandrankallar

feat: Add Colab patient summary script, AI service utilities for performance, and related documentation.

f091f7a 7 months ago

preview code

Raw

History Blame

4.18 kB

	# Quick Fix Guide: Reduce Variable Response Times on HF Spaces

	## Problem
	Your HF T4 Space has inconsistent response times:
	- Sometimes: 1 minute ✅
	- Sometimes: 5+ minutes ❌

	## Root Causes
	1. Lazy model loading - Model loads on first request
	2. Model unloading - Models unload after inactivity
	3. Request queueing - Only 2 concurrent requests allowed
	4. Cold starts - HF Spaces may sleep after inactivity

	## Quick Fix (5 Minutes)

	### Step 1: Update `app.py` (Root Level)

	Add these lines at the top of your `app.py`:

	```python
	# At the top, after imports
	from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
	configure_hf_spaces_env,
	apply_hf_spaces_optimizations
	)

	# Before creating the app
	configure_hf_spaces_env()

	# After creating the app (after line 42)
	app = create_app(initialize=False)
	initialize_agents(app, preload_small_models=False)

	# ADD THIS:
	apply_hf_spaces_optimizations(app)

	logging.info("Application initialized successfully")
	```

	### Step 2: Configuration Applied

	The optimizations automatically configure:

	Request Queue Settings:
	- Max Concurrent Requests: 6 (increased from 2)
	- Max Queue Size: 10 requests
	- Queue Timeout: 20 minutes

	Model Loading:
	- Eager Loading: Enabled (models preload at startup)
	- Keep-Alive Service: Enabled (prevents model unloading)
	- Keep-Alive Interval: 5 minutes

	Logging:
	- Detailed Logging: Enabled for all operations
	- Model Operation Logs: Track loading, generation start/end
	- Generation Metrics: Track tokens/second, duration, etc.

	These settings are automatically applied when you call `apply_hf_spaces_optimizations(app)`.

	### Step 3: Set Up External Monitoring (Optional but Recommended)

	Use a free service like UptimeRobot or Cron-job.org to ping your warmup endpoint every 5 minutes:

	URL to ping: `https://your-space-name.hf.space/warmup`

	Interval: Every 5 minutes

	This prevents your space from going cold.

	### Step 4: Deploy to HF Spaces

	```bash
	git add .
	git commit -m "Add HF Spaces performance optimizations"
	git push
	```

	## Expected Results

	\| Metric \| Before \| After \|
	\|--------\|--------\|-------\|
	\| First request (cold) \| 2-5 min \| 30-60 sec \|
	\| Subsequent requests \| 30-60 sec \| 30-60 sec \|
	\| After 15 min idle \| 2-5 min \| 30-60 sec \|
	\| Consistency \| ❌ Variable \| ✅ Consistent \|

	## Monitoring Endpoints

	After deployment, you can check these endpoints:

	1. Model Status: `https://your-space.hf.space/model-status`
	- Shows which models are loaded

	2. Queue Status: `https://your-space.hf.space/queue-status`
	- Shows request queue state

	3. Keep-Alive Status: `https://your-space.hf.space/keepalive-status`
	- Shows keep-alive service stats

	4. Warmup: `https://your-space.hf.space/warmup`
	- Manually trigger model warmup

	## Troubleshooting

	### Issue: "Module not found" error
	Solution: Make sure you created the new files:
	- `services/ai-service/src/ai_med_extract/utils/model_keepalive.py`
	- `services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py`

	### Issue: GPU OOM (Out of Memory) errors
	Solution: Reduce `max_concurrent` back to 2 in `request_queue.py`

	### Issue: Keep-alive not working
	Solution: Check `/keepalive-status` endpoint to verify service is running

	## Advanced: Manual Testing

	Test the optimizations locally:

	```bash
	# Start the app
	python -m uvicorn services.ai-service.src.ai_med_extract.main:app --reload --port 7860

	# In another terminal, test warmup
	curl http://localhost:7860/warmup

	# Check model status
	curl http://localhost:7860/model-status

	# Check queue status
	curl http://localhost:7860/queue-status
	```

	## Rollback Plan

	If something breaks, you can quickly rollback:

	```bash
	git revert HEAD
	git push
	```

	Or simply remove the `apply_hf_spaces_optimizations(app)` line from `app.py`.

	## Need More Help?

	Check the full guide: `docs/HF_SPACES_PERFORMANCE_GUIDE.md`

	---

	Estimated Time to Implement: 5-10 minutes
	Expected Performance Improvement: 60-80% more consistent response times
	Risk Level: Low (all changes are additive, easy to rollback)