Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

HNTAI / GGUF_TROUBLESHOOTING.md

sachinchandrankallar

Revert "Merge branch 'FT-DEV-17/09/2025'"

aba0d25 9 months ago

preview code

Raw

History Blame

4.68 kB

GGUF Model Troubleshooting Guide for Hugging Face Spaces

Problem Description

Your Hugging Face Space is throwing 500 errors when calling the generatepatientsummary API with GGUF models, specifically with:

"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"
"patient_summarizer_model_type": "gguf"

Root Causes Identified

1. Memory Constraints

Phi-3-mini-4k-instruct model is ~2.4GB
Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM)
Model loading + inference may exceed available memory

2. Model Download Timeouts

Large model downloads can timeout in Spaces environment
Network issues during model fetching
Insufficient timeout handling

3. Missing Dependencies

llama-cpp-python requires specific system libraries
CPU optimization flags may not be set correctly

Solutions Implemented

1. Enhanced Error Handling

Added comprehensive logging throughout the pipeline
Implemented fallback mechanisms when GGUF fails
Better error messages for debugging

2. Timeout Management

5-minute timeout for model loading
2-minute timeout for text generation
Threading-based timeout (more reliable than signals)

3. Memory Optimization

Reduced context window from 4096 to 4000 tokens
Reduced batch size from 128 to 64
CPU-only mode with optimized thread usage

4. Fallback Pipeline

Template-based response when GGUF fails
Ensures API always returns a response
Maintains API contract even during failures

Testing Your Fix

Run the Test Script

cd HNTAI
python test_gguf.py

This will test:

Model loading
Basic generation
Full summary generation
Fallback pipeline

Expected Output

✓ Model loaded successfully in X.XXs
✓ Generation successful in X.XXs
✓ Full summary generation successful in X.XXs
🎉 All tests passed! GGUF model is working correctly.

Deployment Steps

1. Update Your Space

git add .
git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks"
git push

2. Monitor Logs

Check your Hugging Face Space logs for:

Model loading times
Memory usage
Error messages
Fallback activations

3. Test the API

curl -X POST "https://your-space.hf.space/generate_patient_summary" \
  -H "Content-Type: application/json" \
  -d '{
    "patientid": "test123",
    "token": "your_token",
    "key": "your_key",
    "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
    "patient_summarizer_model_type": "gguf"
  }'

Environment Variables

Set these in your Hugging Face Space:

# Memory optimization
GGUF_N_THREADS=2
GGUF_N_BATCH=64

# Cache directories
HF_HOME=/tmp/huggingface
XDG_CACHE_HOME=/tmp
TORCH_HOME=/tmp/torch

Alternative Models

If Phi-3-mini-4k-instruct still fails, try smaller models:

Smaller GGUF Models

{
  "patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf",
  "patient_summarizer_model_type": "gguf"
}

Fallback to HuggingFace Models

{
  "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct",
  "patient_summarizer_model_type": "text-generation"
}

Monitoring and Debugging

1. Check Space Logs

Look for "GGUF" prefixed log messages
Monitor memory usage patterns
Check for timeout errors

2. API Response Codes

200: Success
408: Generation timeout
500: Model loading failure (will use fallback)

3. Performance Metrics

Model loading time: Should be < 5 minutes
Generation time: Should be < 2 minutes
Memory usage: Should stay within Space limits

Common Issues and Solutions

Issue: "Model download failed"

Solution: Check network connectivity and model availability

Issue: "Failed to initialize GGUF model"

Solution: Verify llama-cpp-python installation and system dependencies

Issue: "Generation timed out"

Solution: Reduce max_tokens or use smaller model

Issue: "Out of memory"

Solution: Use smaller model variant (q2_k instead of q4)

Support

If issues persist:

Run test_gguf.py and share output
Check Hugging Face Space logs
Verify model availability in the Hub
Consider upgrading to Pro tier for more resources

Expected Behavior After Fix

✅ Before: 500 errors after 5 minutes ✅ After:

Successful model loading with detailed logging
Graceful fallback if model fails
Proper timeout handling
Always returns a response (either real or fallback)