# Hugging Face T4 Space Performance Guide ## Why Response Times Vary Your HF T4 Space experiences variable response times due to several factors: ### 1. **Cold Start vs Warm Start** ⏱️ | Scenario | Response Time | Reason | |----------|---------------|--------| | **First request after idle** | 2-5 minutes | Model loading + container startup | | **Subsequent requests** | 30-60 seconds | Model already in memory | | **After 15-30 min idle** | 2-5 minutes | HF may unload models | | **Concurrent requests (3+)** | 5+ minutes | Queue waiting time | ### 2. **Request Queueing** 🚦 Your `RequestQueueManager` configuration: - **Max concurrent**: 2 requests - **Queue size**: 10 requests - **Queue timeout**: 5 minutes **What happens:** ``` Request 1 & 2: Processing immediately Request 3-12: Waiting in queue (up to 5 min) Request 13+: Rejected (queue full) ``` ### 3. **Lazy Model Loading** 🐌 Your code uses lazy loading: ```python self.model_loader = unified_model_manager.get_model( self.current_model_name, self.current_model_type, lazy=True # Model loads on first use ) ``` **Impact:** - First request to a model: +30s-2min loading time - Model may be unloaded after inactivity - Next request: reload penalty ### 4. **HF Spaces Infrastructure** 🏗️ - **Shared resources**: Performance varies with HF load - **Container restarts**: HF may restart your space periodically - **Network latency**: Model downloads if not cached - **Memory pressure**: GPU memory competition between requests --- ## 🛠️ Solutions to Improve Consistency ### **Solution 1: Eager Model Loading** (Recommended) Preload models at startup instead of lazy loading: **File**: `services/ai-service/src/ai_med_extract/app.py` Add this to your app initialization: ```python def initialize_agents(app, preload_small_models=False): """Initialize all agents with eager model loading for HF Spaces""" import os is_hf_spaces = os.getenv("HF_SPACES", "false").lower() == "true" # For HF Spaces, preload the primary model if is_hf_spaces: logger.info("🚀 HF Spaces detected - preloading primary model...") from .utils.unified_model_manager import unified_model_manager # Preload the GGUF model (your primary model) try: primary_model = unified_model_manager.get_model( "microsoft/Phi-3-mini-4k-instruct-gguf", "gguf", lazy=False # EAGER loading ) logger.info("✅ Primary model preloaded successfully") except Exception as e: logger.error(f"❌ Failed to preload primary model: {e}") # ... rest of initialization ``` ### **Solution 2: Model Keep-Alive Mechanism** Prevent models from being unloaded: **Create**: `services/ai-service/src/ai_med_extract/utils/model_keepalive.py` ```python """ Model Keep-Alive Service Prevents models from being unloaded due to inactivity """ import asyncio import logging from datetime import datetime logger = logging.getLogger(__name__) class ModelKeepAlive: def __init__(self, model_manager, interval_seconds=300): """ Args: model_manager: Unified model manager instance interval_seconds: Ping interval (default: 5 minutes) """ self.model_manager = model_manager self.interval = interval_seconds self.running = False self.task = None async def ping_models(self): """Send keep-alive pings to loaded models""" while self.running: try: # Get list of loaded models loaded_models = self.model_manager.list_loaded_models() for model_info in loaded_models: model_name = model_info.get('name') logger.debug(f"Keep-alive ping: {model_name}") # Update last_used timestamp model = self.model_manager.get_model( model_name, model_info.get('type'), lazy=True # Don't reload if already loaded ) if model: model.last_used = datetime.now().timestamp() logger.info(f"✅ Keep-alive ping sent to {len(loaded_models)} models") except Exception as e: logger.error(f"❌ Keep-alive error: {e}") await asyncio.sleep(self.interval) def start(self): """Start the keep-alive service""" if not self.running: self.running = True self.task = asyncio.create_task(self.ping_models()) logger.info(f"🚀 Model keep-alive started (interval: {self.interval}s)") def stop(self): """Stop the keep-alive service""" if self.running: self.running = False if self.task: self.task.cancel() logger.info("🛑 Model keep-alive stopped") # Global instance _keepalive_service = None def get_keepalive_service(model_manager): """Get or create the global keep-alive service""" global _keepalive_service if _keepalive_service is None: _keepalive_service = ModelKeepAlive(model_manager) return _keepalive_service ``` ### **Solution 3: Increase Concurrent Request Limit** If you have enough GPU memory, increase concurrent requests: **File**: `services/ai-service/src/ai_med_extract/services/request_queue.py` ```python # For T4 Medium with 16GB GPU RAM RequestQueueManager( max_concurrent=3, # Increased from 2 max_queue_size=15, # Increased from 10 queue_timeout=300 # Keep at 5 minutes ) ``` **⚠️ Warning**: Monitor GPU memory usage. If you get OOM errors, reduce back to 2. ### **Solution 4: Add Health Check Endpoint with Model Warmup** Keep your space alive with periodic health checks: **File**: `services/ai-service/src/ai_med_extract/api/routes_fastapi.py` Add this endpoint: ```python @app.get("/warmup") async def warmup_models(): """ Warmup endpoint to keep models loaded Can be called by external monitoring service """ from ..utils.unified_model_manager import unified_model_manager try: # Get primary model (will load if not already loaded) model = unified_model_manager.get_model( "microsoft/Phi-3-mini-4k-instruct-gguf", "gguf", lazy=False ) # Optional: Run a tiny inference to keep GPU warm test_prompt = "Test" _ = model.generate(test_prompt, max_tokens=10) return { "status": "warm", "timestamp": datetime.now().isoformat(), "models_loaded": len(unified_model_manager.list_loaded_models()) } except Exception as e: return { "status": "cold", "error": str(e), "timestamp": datetime.now().isoformat() } ``` Then use a service like **UptimeRobot** or **Cron-job.org** to ping `/warmup` every 5 minutes. ### **Solution 5: Optimize Queue Settings for HF Spaces** **File**: `app.py` (root level) Add queue configuration: ```python # Set environment variables for Hugging Face Spaces os.environ.setdefault("FAST_MODE", "true") os.environ.setdefault("PRELOAD_SMALL_MODELS", "false") os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:128") os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") os.environ.setdefault("OMP_NUM_THREADS", "1") os.environ.setdefault("MKL_NUM_THREADS", "1") # NEW: Queue optimization os.environ.setdefault("MAX_CONCURRENT_REQUESTS", "3") # Increase if stable os.environ.setdefault("EAGER_MODEL_LOADING", "true") # Disable lazy loading os.environ.setdefault("MODEL_KEEPALIVE", "true") # Enable keep-alive ``` --- ## 📈 Expected Performance After Optimization | Metric | Before | After Optimization | |--------|--------|-------------------| | **First request (cold)** | 2-5 min | 30-60 sec (model preloaded) | | **Subsequent requests** | 30-60 sec | 30-60 sec (consistent) | | **After 15 min idle** | 2-5 min | 30-60 sec (keep-alive prevents unload) | | **Concurrent requests** | Queue wait | Faster queue processing | --- ## 🔍 Monitoring & Debugging ### Check Model Status Add this endpoint to monitor model loading: ```python @app.get("/model-status") async def model_status(): """Get current model loading status""" from ..utils.unified_model_manager import unified_model_manager loaded_models = unified_model_manager.list_loaded_models() return { "loaded_models": loaded_models, "total_loaded": len(loaded_models), "queue_status": get_queue_manager().get_queue_status(), "timestamp": datetime.now().isoformat() } ``` ### Check Queue Status ```python @app.get("/queue-status") async def queue_status(): """Get current request queue status""" from ..services.request_queue import get_queue_manager return get_queue_manager().get_queue_status() ``` --- ## 🎯 Quick Wins (Implement These First) 1. **Change lazy loading to eager loading** in `patient_summary_agent.py`: ```python lazy=False # Instead of lazy=True ``` 2. **Increase concurrent requests** (if GPU memory allows): ```python max_concurrent=3 # Instead of 2 ``` 3. **Set up external monitoring** to ping `/warmup` every 5 minutes 4. **Monitor GPU memory** to ensure you're not hitting OOM errors --- ## 🚨 Common Issues ### Issue: "Model failed to load" **Cause**: Model not in cache or GPU OOM **Solution**: Check `preload_models.py` ran during build ### Issue: "Request queued for 5 minutes" **Cause**: Too many concurrent requests **Solution**: Increase `max_concurrent` or optimize model inference time ### Issue: "First request very slow" **Cause**: Cold start / lazy loading **Solution**: Enable eager loading and keep-alive --- ## 📚 Additional Resources - [HF Spaces Documentation](https://huggingface.co/docs/hub/spaces) - [T4 GPU Specs](https://www.nvidia.com/en-us/data-center/tesla-t4/) - [Model Optimization Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one) --- **Last Updated**: 2025-11-27 **Your Current Setup**: T4 Medium (16GB GPU, 16GB RAM) **Primary Model**: microsoft/Phi-3-mini-4k-instruct-gguf (Q4 quantized)