HNTAI / GGUF_TROUBLESHOOTING.md
sachinchandrankallar's picture
Revert "Merge branch 'FT-DEV-17/09/2025'"
aba0d25
|
Raw
History Blame
4.68 kB

GGUF Model Troubleshooting Guide for Hugging Face Spaces

Problem Description

Your Hugging Face Space is throwing 500 errors when calling the generatepatientsummary API with GGUF models, specifically with:

  • "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"
  • "patient_summarizer_model_type": "gguf"

Root Causes Identified

1. Memory Constraints

  • Phi-3-mini-4k-instruct model is ~2.4GB
  • Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM)
  • Model loading + inference may exceed available memory

2. Model Download Timeouts

  • Large model downloads can timeout in Spaces environment
  • Network issues during model fetching
  • Insufficient timeout handling

3. Missing Dependencies

  • llama-cpp-python requires specific system libraries
  • CPU optimization flags may not be set correctly

Solutions Implemented

1. Enhanced Error Handling

  • Added comprehensive logging throughout the pipeline
  • Implemented fallback mechanisms when GGUF fails
  • Better error messages for debugging

2. Timeout Management

  • 5-minute timeout for model loading
  • 2-minute timeout for text generation
  • Threading-based timeout (more reliable than signals)

3. Memory Optimization

  • Reduced context window from 4096 to 4000 tokens
  • Reduced batch size from 128 to 64
  • CPU-only mode with optimized thread usage

4. Fallback Pipeline

  • Template-based response when GGUF fails
  • Ensures API always returns a response
  • Maintains API contract even during failures

Testing Your Fix

Run the Test Script

cd HNTAI
python test_gguf.py

This will test:

  • Model loading
  • Basic generation
  • Full summary generation
  • Fallback pipeline

Expected Output

βœ“ Model loaded successfully in X.XXs
βœ“ Generation successful in X.XXs
βœ“ Full summary generation successful in X.XXs
πŸŽ‰ All tests passed! GGUF model is working correctly.

Deployment Steps

1. Update Your Space

git add .
git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks"
git push

2. Monitor Logs

Check your Hugging Face Space logs for:

  • Model loading times
  • Memory usage
  • Error messages
  • Fallback activations

3. Test the API

curl -X POST "https://your-space.hf.space/generate_patient_summary" \
  -H "Content-Type: application/json" \
  -d '{
    "patientid": "test123",
    "token": "your_token",
    "key": "your_key",
    "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
    "patient_summarizer_model_type": "gguf"
  }'

Environment Variables

Set these in your Hugging Face Space:

# Memory optimization
GGUF_N_THREADS=2
GGUF_N_BATCH=64

# Cache directories
HF_HOME=/tmp/huggingface
XDG_CACHE_HOME=/tmp
TORCH_HOME=/tmp/torch

Alternative Models

If Phi-3-mini-4k-instruct still fails, try smaller models:

Smaller GGUF Models

{
  "patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf",
  "patient_summarizer_model_type": "gguf"
}

Fallback to HuggingFace Models

{
  "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct",
  "patient_summarizer_model_type": "text-generation"
}

Monitoring and Debugging

1. Check Space Logs

  • Look for "GGUF" prefixed log messages
  • Monitor memory usage patterns
  • Check for timeout errors

2. API Response Codes

  • 200: Success
  • 408: Generation timeout
  • 500: Model loading failure (will use fallback)

3. Performance Metrics

  • Model loading time: Should be < 5 minutes
  • Generation time: Should be < 2 minutes
  • Memory usage: Should stay within Space limits

Common Issues and Solutions

Issue: "Model download failed"

Solution: Check network connectivity and model availability

Issue: "Failed to initialize GGUF model"

Solution: Verify llama-cpp-python installation and system dependencies

Issue: "Generation timed out"

Solution: Reduce max_tokens or use smaller model

Issue: "Out of memory"

Solution: Use smaller model variant (q2_k instead of q4)

Support

If issues persist:

  1. Run test_gguf.py and share output
  2. Check Hugging Face Space logs
  3. Verify model availability in the Hub
  4. Consider upgrading to Pro tier for more resources

Expected Behavior After Fix

βœ… Before: 500 errors after 5 minutes βœ… After:

  • Successful model loading with detailed logging
  • Graceful fallback if model fails
  • Proper timeout handling
  • Always returns a response (either real or fallback)