# GGUF Model Troubleshooting Guide for Hugging Face Spaces ## Problem Description Your Hugging Face Space is throwing 500 errors when calling the `generatepatientsummary` API with GGUF models, specifically with: - `"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"` - `"patient_summarizer_model_type": "gguf"` ## Root Causes Identified ### 1. **Memory Constraints** - Phi-3-mini-4k-instruct model is ~2.4GB - Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM) - Model loading + inference may exceed available memory ### 2. **Model Download Timeouts** - Large model downloads can timeout in Spaces environment - Network issues during model fetching - Insufficient timeout handling ### 3. **Missing Dependencies** - `llama-cpp-python` requires specific system libraries - CPU optimization flags may not be set correctly ## Solutions Implemented ### 1. **Enhanced Error Handling** - Added comprehensive logging throughout the pipeline - Implemented fallback mechanisms when GGUF fails - Better error messages for debugging ### 2. **Timeout Management** - 5-minute timeout for model loading - 2-minute timeout for text generation - Threading-based timeout (more reliable than signals) ### 3. **Memory Optimization** - Reduced context window from 4096 to 4000 tokens - Reduced batch size from 128 to 64 - CPU-only mode with optimized thread usage ### 4. **Fallback Pipeline** - Template-based response when GGUF fails - Ensures API always returns a response - Maintains API contract even during failures ## Testing Your Fix ### Run the Test Script ```bash cd HNTAI python test_gguf.py ``` This will test: - Model loading - Basic generation - Full summary generation - Fallback pipeline ### Expected Output ``` ✓ Model loaded successfully in X.XXs ✓ Generation successful in X.XXs ✓ Full summary generation successful in X.XXs 🎉 All tests passed! GGUF model is working correctly. ``` ## Deployment Steps ### 1. **Update Your Space** ```bash git add . git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks" git push ``` ### 2. **Monitor Logs** Check your Hugging Face Space logs for: - Model loading times - Memory usage - Error messages - Fallback activations ### 3. **Test the API** ```bash curl -X POST "https://your-space.hf.space/generate_patient_summary" \ -H "Content-Type: application/json" \ -d '{ "patientid": "test123", "token": "your_token", "key": "your_key", "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf", "patient_summarizer_model_type": "gguf" }' ``` ## Environment Variables Set these in your Hugging Face Space: ```bash # Memory optimization GGUF_N_THREADS=2 GGUF_N_BATCH=64 # Cache directories HF_HOME=/tmp/huggingface XDG_CACHE_HOME=/tmp TORCH_HOME=/tmp/torch ``` ## Alternative Models If Phi-3-mini-4k-instruct still fails, try smaller models: ### Smaller GGUF Models ```json { "patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf", "patient_summarizer_model_type": "gguf" } ``` ### Fallback to HuggingFace Models ```json { "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct", "patient_summarizer_model_type": "text-generation" } ``` ## Monitoring and Debugging ### 1. **Check Space Logs** - Look for "GGUF" prefixed log messages - Monitor memory usage patterns - Check for timeout errors ### 2. **API Response Codes** - `200`: Success - `408`: Generation timeout - `500`: Model loading failure (will use fallback) ### 3. **Performance Metrics** - Model loading time: Should be < 5 minutes - Generation time: Should be < 2 minutes - Memory usage: Should stay within Space limits ## Common Issues and Solutions ### Issue: "Model download failed" **Solution**: Check network connectivity and model availability ### Issue: "Failed to initialize GGUF model" **Solution**: Verify llama-cpp-python installation and system dependencies ### Issue: "Generation timed out" **Solution**: Reduce max_tokens or use smaller model ### Issue: "Out of memory" **Solution**: Use smaller model variant (q2_k instead of q4) ## Support If issues persist: 1. Run `test_gguf.py` and share output 2. Check Hugging Face Space logs 3. Verify model availability in the Hub 4. Consider upgrading to Pro tier for more resources ## Expected Behavior After Fix ✅ **Before**: 500 errors after 5 minutes ✅ **After**: - Successful model loading with detailed logging - Graceful fallback if model fails - Proper timeout handling - Always returns a response (either real or fallback)