# GGUF Model Troubleshooting Guide for Hugging Face Spaces

## Problem Description
Your Hugging Face Space is throwing 500 errors when calling the `generatepatientsummary` API with GGUF models, specifically with:
- `"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"`
- `"patient_summarizer_model_type": "gguf"`

## Root Causes Identified

### 1. **Memory Constraints**
- Phi-3-mini-4k-instruct model is ~2.4GB
- Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM)
- Model loading + inference may exceed available memory

### 2. **Model Download Timeouts**
- Large model downloads can timeout in Spaces environment
- Network issues during model fetching
- Insufficient timeout handling

### 3. **Missing Dependencies**
- `llama-cpp-python` requires specific system libraries
- CPU optimization flags may not be set correctly

## Solutions Implemented

### 1. **Enhanced Error Handling**
- Added comprehensive logging throughout the pipeline
- Implemented fallback mechanisms when GGUF fails
- Better error messages for debugging

### 2. **Timeout Management**
- 5-minute timeout for model loading
- 2-minute timeout for text generation
- Threading-based timeout (more reliable than signals)

### 3. **Memory Optimization**
- Reduced context window from 4096 to 4000 tokens
- Reduced batch size from 128 to 64
- CPU-only mode with optimized thread usage

### 4. **Fallback Pipeline**
- Template-based response when GGUF fails
- Ensures API always returns a response
- Maintains API contract even during failures

## Testing Your Fix

### Run the Test Script
```bash
cd HNTAI
python test_gguf.py
```

This will test:
- Model loading
- Basic generation
- Full summary generation
- Fallback pipeline

### Expected Output
```
✓ Model loaded successfully in X.XXs
✓ Generation successful in X.XXs
✓ Full summary generation successful in X.XXs
🎉 All tests passed! GGUF model is working correctly.
```

## Deployment Steps

### 1. **Update Your Space**
```bash
git add .
git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks"
git push
```

### 2. **Monitor Logs**
Check your Hugging Face Space logs for:
- Model loading times
- Memory usage
- Error messages
- Fallback activations

### 3. **Test the API**
```bash
curl -X POST "https://your-space.hf.space/generate_patient_summary" \
  -H "Content-Type: application/json" \
  -d '{
    "patientid": "test123",
    "token": "your_token",
    "key": "your_key",
    "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
    "patient_summarizer_model_type": "gguf"
  }'
```

## Environment Variables

Set these in your Hugging Face Space:

```bash
# Memory optimization
GGUF_N_THREADS=2
GGUF_N_BATCH=64

# Cache directories
HF_HOME=/tmp/huggingface
XDG_CACHE_HOME=/tmp
TORCH_HOME=/tmp/torch
```

## Alternative Models

If Phi-3-mini-4k-instruct still fails, try smaller models:

### Smaller GGUF Models
```json
{
  "patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf",
  "patient_summarizer_model_type": "gguf"
}
```

### Fallback to HuggingFace Models
```json
{
  "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct",
  "patient_summarizer_model_type": "text-generation"
}
```

## Monitoring and Debugging

### 1. **Check Space Logs**
- Look for "GGUF" prefixed log messages
- Monitor memory usage patterns
- Check for timeout errors

### 2. **API Response Codes**
- `200`: Success
- `408`: Generation timeout
- `500`: Model loading failure (will use fallback)

### 3. **Performance Metrics**
- Model loading time: Should be < 5 minutes
- Generation time: Should be < 2 minutes
- Memory usage: Should stay within Space limits

## Common Issues and Solutions

### Issue: "Model download failed"
**Solution**: Check network connectivity and model availability

### Issue: "Failed to initialize GGUF model"
**Solution**: Verify llama-cpp-python installation and system dependencies

### Issue: "Generation timed out"
**Solution**: Reduce max_tokens or use smaller model

### Issue: "Out of memory"
**Solution**: Use smaller model variant (q2_k instead of q4)

## Support

If issues persist:
1. Run `test_gguf.py` and share output
2. Check Hugging Face Space logs
3. Verify model availability in the Hub
4. Consider upgrading to Pro tier for more resources

## Expected Behavior After Fix

✅ **Before**: 500 errors after 5 minutes
✅ **After**: 
- Successful model loading with detailed logging
- Graceful fallback if model fails
- Proper timeout handling
- Always returns a response (either real or fallback)