Spaces:
Paused
Paused
| # GGUF Model Troubleshooting Guide for Hugging Face Spaces | |
| ## Problem Description | |
| Your Hugging Face Space is throwing 500 errors when calling the `generatepatientsummary` API with GGUF models, specifically with: | |
| - `"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"` | |
| - `"patient_summarizer_model_type": "gguf"` | |
| ## Root Causes Identified | |
| ### 1. **Memory Constraints** | |
| - Phi-3-mini-4k-instruct model is ~2.4GB | |
| - Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM) | |
| - Model loading + inference may exceed available memory | |
| ### 2. **Model Download Timeouts** | |
| - Large model downloads can timeout in Spaces environment | |
| - Network issues during model fetching | |
| - Insufficient timeout handling | |
| ### 3. **Missing Dependencies** | |
| - `llama-cpp-python` requires specific system libraries | |
| - CPU optimization flags may not be set correctly | |
| ## Solutions Implemented | |
| ### 1. **Enhanced Error Handling** | |
| - Added comprehensive logging throughout the pipeline | |
| - Implemented fallback mechanisms when GGUF fails | |
| - Better error messages for debugging | |
| ### 2. **Timeout Management** | |
| - 5-minute timeout for model loading | |
| - 2-minute timeout for text generation | |
| - Threading-based timeout (more reliable than signals) | |
| ### 3. **Memory Optimization** | |
| - Reduced context window from 4096 to 4000 tokens | |
| - Reduced batch size from 128 to 64 | |
| - CPU-only mode with optimized thread usage | |
| ### 4. **Fallback Pipeline** | |
| - Template-based response when GGUF fails | |
| - Ensures API always returns a response | |
| - Maintains API contract even during failures | |
| ## Testing Your Fix | |
| ### Run the Test Script | |
| ```bash | |
| cd HNTAI | |
| python test_gguf.py | |
| ``` | |
| This will test: | |
| - Model loading | |
| - Basic generation | |
| - Full summary generation | |
| - Fallback pipeline | |
| ### Expected Output | |
| ``` | |
| β Model loaded successfully in X.XXs | |
| β Generation successful in X.XXs | |
| β Full summary generation successful in X.XXs | |
| π All tests passed! GGUF model is working correctly. | |
| ``` | |
| ## Deployment Steps | |
| ### 1. **Update Your Space** | |
| ```bash | |
| git add . | |
| git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks" | |
| git push | |
| ``` | |
| ### 2. **Monitor Logs** | |
| Check your Hugging Face Space logs for: | |
| - Model loading times | |
| - Memory usage | |
| - Error messages | |
| - Fallback activations | |
| ### 3. **Test the API** | |
| ```bash | |
| curl -X POST "https://your-space.hf.space/generate_patient_summary" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "patientid": "test123", | |
| "token": "your_token", | |
| "key": "your_key", | |
| "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf", | |
| "patient_summarizer_model_type": "gguf" | |
| }' | |
| ``` | |
| ## Environment Variables | |
| Set these in your Hugging Face Space: | |
| ```bash | |
| # Memory optimization | |
| GGUF_N_THREADS=2 | |
| GGUF_N_BATCH=64 | |
| # Cache directories | |
| HF_HOME=/tmp/huggingface | |
| XDG_CACHE_HOME=/tmp | |
| TORCH_HOME=/tmp/torch | |
| ``` | |
| ## Alternative Models | |
| If Phi-3-mini-4k-instruct still fails, try smaller models: | |
| ### Smaller GGUF Models | |
| ```json | |
| { | |
| "patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf", | |
| "patient_summarizer_model_type": "gguf" | |
| } | |
| ``` | |
| ### Fallback to HuggingFace Models | |
| ```json | |
| { | |
| "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct", | |
| "patient_summarizer_model_type": "text-generation" | |
| } | |
| ``` | |
| ## Monitoring and Debugging | |
| ### 1. **Check Space Logs** | |
| - Look for "GGUF" prefixed log messages | |
| - Monitor memory usage patterns | |
| - Check for timeout errors | |
| ### 2. **API Response Codes** | |
| - `200`: Success | |
| - `408`: Generation timeout | |
| - `500`: Model loading failure (will use fallback) | |
| ### 3. **Performance Metrics** | |
| - Model loading time: Should be < 5 minutes | |
| - Generation time: Should be < 2 minutes | |
| - Memory usage: Should stay within Space limits | |
| ## Common Issues and Solutions | |
| ### Issue: "Model download failed" | |
| **Solution**: Check network connectivity and model availability | |
| ### Issue: "Failed to initialize GGUF model" | |
| **Solution**: Verify llama-cpp-python installation and system dependencies | |
| ### Issue: "Generation timed out" | |
| **Solution**: Reduce max_tokens or use smaller model | |
| ### Issue: "Out of memory" | |
| **Solution**: Use smaller model variant (q2_k instead of q4) | |
| ## Support | |
| If issues persist: | |
| 1. Run `test_gguf.py` and share output | |
| 2. Check Hugging Face Space logs | |
| 3. Verify model availability in the Hub | |
| 4. Consider upgrading to Pro tier for more resources | |
| ## Expected Behavior After Fix | |
| β **Before**: 500 errors after 5 minutes | |
| β **After**: | |
| - Successful model loading with detailed logging | |
| - Graceful fallback if model fails | |
| - Proper timeout handling | |
| - Always returns a response (either real or fallback) | |