Spaces:
Paused
Paused
GGUF Model Troubleshooting Guide for Hugging Face Spaces
Problem Description
Your Hugging Face Space is throwing 500 errors when calling the generatepatientsummary API with GGUF models, specifically with:
"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf""patient_summarizer_model_type": "gguf"
Root Causes Identified
1. Memory Constraints
- Phi-3-mini-4k-instruct model is ~2.4GB
- Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM)
- Model loading + inference may exceed available memory
2. Model Download Timeouts
- Large model downloads can timeout in Spaces environment
- Network issues during model fetching
- Insufficient timeout handling
3. Missing Dependencies
llama-cpp-pythonrequires specific system libraries- CPU optimization flags may not be set correctly
Solutions Implemented
1. Enhanced Error Handling
- Added comprehensive logging throughout the pipeline
- Implemented fallback mechanisms when GGUF fails
- Better error messages for debugging
2. Timeout Management
- 5-minute timeout for model loading
- 2-minute timeout for text generation
- Threading-based timeout (more reliable than signals)
3. Memory Optimization
- Reduced context window from 4096 to 4000 tokens
- Reduced batch size from 128 to 64
- CPU-only mode with optimized thread usage
4. Fallback Pipeline
- Template-based response when GGUF fails
- Ensures API always returns a response
- Maintains API contract even during failures
Testing Your Fix
Run the Test Script
cd HNTAI
python test_gguf.py
This will test:
- Model loading
- Basic generation
- Full summary generation
- Fallback pipeline
Expected Output
β Model loaded successfully in X.XXs
β Generation successful in X.XXs
β Full summary generation successful in X.XXs
π All tests passed! GGUF model is working correctly.
Deployment Steps
1. Update Your Space
git add .
git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks"
git push
2. Monitor Logs
Check your Hugging Face Space logs for:
- Model loading times
- Memory usage
- Error messages
- Fallback activations
3. Test the API
curl -X POST "https://your-space.hf.space/generate_patient_summary" \
-H "Content-Type: application/json" \
-d '{
"patientid": "test123",
"token": "your_token",
"key": "your_key",
"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
"patient_summarizer_model_type": "gguf"
}'
Environment Variables
Set these in your Hugging Face Space:
# Memory optimization
GGUF_N_THREADS=2
GGUF_N_BATCH=64
# Cache directories
HF_HOME=/tmp/huggingface
XDG_CACHE_HOME=/tmp
TORCH_HOME=/tmp/torch
Alternative Models
If Phi-3-mini-4k-instruct still fails, try smaller models:
Smaller GGUF Models
{
"patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf",
"patient_summarizer_model_type": "gguf"
}
Fallback to HuggingFace Models
{
"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct",
"patient_summarizer_model_type": "text-generation"
}
Monitoring and Debugging
1. Check Space Logs
- Look for "GGUF" prefixed log messages
- Monitor memory usage patterns
- Check for timeout errors
2. API Response Codes
200: Success408: Generation timeout500: Model loading failure (will use fallback)
3. Performance Metrics
- Model loading time: Should be < 5 minutes
- Generation time: Should be < 2 minutes
- Memory usage: Should stay within Space limits
Common Issues and Solutions
Issue: "Model download failed"
Solution: Check network connectivity and model availability
Issue: "Failed to initialize GGUF model"
Solution: Verify llama-cpp-python installation and system dependencies
Issue: "Generation timed out"
Solution: Reduce max_tokens or use smaller model
Issue: "Out of memory"
Solution: Use smaller model variant (q2_k instead of q4)
Support
If issues persist:
- Run
test_gguf.pyand share output - Check Hugging Face Space logs
- Verify model availability in the Hub
- Consider upgrading to Pro tier for more resources
Expected Behavior After Fix
β Before: 500 errors after 5 minutes β After:
- Successful model loading with detailed logging
- Graceful fallback if model fails
- Proper timeout handling
- Always returns a response (either real or fallback)