HNTAI / GGUF_TROUBLESHOOTING.md
sachinchandrankallar's picture
Revert "Merge branch 'FT-DEV-17/09/2025'"
aba0d25
|
Raw
History Blame
4.68 kB
# GGUF Model Troubleshooting Guide for Hugging Face Spaces
## Problem Description
Your Hugging Face Space is throwing 500 errors when calling the `generatepatientsummary` API with GGUF models, specifically with:
- `"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"`
- `"patient_summarizer_model_type": "gguf"`
## Root Causes Identified
### 1. **Memory Constraints**
- Phi-3-mini-4k-instruct model is ~2.4GB
- Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM)
- Model loading + inference may exceed available memory
### 2. **Model Download Timeouts**
- Large model downloads can timeout in Spaces environment
- Network issues during model fetching
- Insufficient timeout handling
### 3. **Missing Dependencies**
- `llama-cpp-python` requires specific system libraries
- CPU optimization flags may not be set correctly
## Solutions Implemented
### 1. **Enhanced Error Handling**
- Added comprehensive logging throughout the pipeline
- Implemented fallback mechanisms when GGUF fails
- Better error messages for debugging
### 2. **Timeout Management**
- 5-minute timeout for model loading
- 2-minute timeout for text generation
- Threading-based timeout (more reliable than signals)
### 3. **Memory Optimization**
- Reduced context window from 4096 to 4000 tokens
- Reduced batch size from 128 to 64
- CPU-only mode with optimized thread usage
### 4. **Fallback Pipeline**
- Template-based response when GGUF fails
- Ensures API always returns a response
- Maintains API contract even during failures
## Testing Your Fix
### Run the Test Script
```bash
cd HNTAI
python test_gguf.py
```
This will test:
- Model loading
- Basic generation
- Full summary generation
- Fallback pipeline
### Expected Output
```
βœ“ Model loaded successfully in X.XXs
βœ“ Generation successful in X.XXs
βœ“ Full summary generation successful in X.XXs
πŸŽ‰ All tests passed! GGUF model is working correctly.
```
## Deployment Steps
### 1. **Update Your Space**
```bash
git add .
git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks"
git push
```
### 2. **Monitor Logs**
Check your Hugging Face Space logs for:
- Model loading times
- Memory usage
- Error messages
- Fallback activations
### 3. **Test the API**
```bash
curl -X POST "https://your-space.hf.space/generate_patient_summary" \
-H "Content-Type: application/json" \
-d '{
"patientid": "test123",
"token": "your_token",
"key": "your_key",
"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
"patient_summarizer_model_type": "gguf"
}'
```
## Environment Variables
Set these in your Hugging Face Space:
```bash
# Memory optimization
GGUF_N_THREADS=2
GGUF_N_BATCH=64
# Cache directories
HF_HOME=/tmp/huggingface
XDG_CACHE_HOME=/tmp
TORCH_HOME=/tmp/torch
```
## Alternative Models
If Phi-3-mini-4k-instruct still fails, try smaller models:
### Smaller GGUF Models
```json
{
"patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf",
"patient_summarizer_model_type": "gguf"
}
```
### Fallback to HuggingFace Models
```json
{
"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct",
"patient_summarizer_model_type": "text-generation"
}
```
## Monitoring and Debugging
### 1. **Check Space Logs**
- Look for "GGUF" prefixed log messages
- Monitor memory usage patterns
- Check for timeout errors
### 2. **API Response Codes**
- `200`: Success
- `408`: Generation timeout
- `500`: Model loading failure (will use fallback)
### 3. **Performance Metrics**
- Model loading time: Should be < 5 minutes
- Generation time: Should be < 2 minutes
- Memory usage: Should stay within Space limits
## Common Issues and Solutions
### Issue: "Model download failed"
**Solution**: Check network connectivity and model availability
### Issue: "Failed to initialize GGUF model"
**Solution**: Verify llama-cpp-python installation and system dependencies
### Issue: "Generation timed out"
**Solution**: Reduce max_tokens or use smaller model
### Issue: "Out of memory"
**Solution**: Use smaller model variant (q2_k instead of q4)
## Support
If issues persist:
1. Run `test_gguf.py` and share output
2. Check Hugging Face Space logs
3. Verify model availability in the Hub
4. Consider upgrading to Pro tier for more resources
## Expected Behavior After Fix
βœ… **Before**: 500 errors after 5 minutes
βœ… **After**:
- Successful model loading with detailed logging
- Graceful fallback if model fails
- Proper timeout handling
- Always returns a response (either real or fallback)