Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

HNTAI / GGUF_TROUBLESHOOTING.md

sachinchandrankallar

Revert "Merge branch 'FT-DEV-17/09/2025'"

aba0d25 9 months ago

preview code

Raw

History Blame

4.68 kB

	# GGUF Model Troubleshooting Guide for Hugging Face Spaces

	## Problem Description
	Your Hugging Face Space is throwing 500 errors when calling the `generatepatientsummary` API with GGUF models, specifically with:
	- `"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"`
	- `"patient_summarizer_model_type": "gguf"`

	## Root Causes Identified

	### 1. Memory Constraints
	- Phi-3-mini-4k-instruct model is ~2.4GB
	- Hugging Face Spaces have limited memory (Basic: 16GB RAM, Pro: 32GB RAM)
	- Model loading + inference may exceed available memory

	### 2. Model Download Timeouts
	- Large model downloads can timeout in Spaces environment
	- Network issues during model fetching
	- Insufficient timeout handling

	### 3. Missing Dependencies
	- `llama-cpp-python` requires specific system libraries
	- CPU optimization flags may not be set correctly

	## Solutions Implemented

	### 1. Enhanced Error Handling
	- Added comprehensive logging throughout the pipeline
	- Implemented fallback mechanisms when GGUF fails
	- Better error messages for debugging

	### 2. Timeout Management
	- 5-minute timeout for model loading
	- 2-minute timeout for text generation
	- Threading-based timeout (more reliable than signals)

	### 3. Memory Optimization
	- Reduced context window from 4096 to 4000 tokens
	- Reduced batch size from 128 to 64
	- CPU-only mode with optimized thread usage

	### 4. Fallback Pipeline
	- Template-based response when GGUF fails
	- Ensures API always returns a response
	- Maintains API contract even during failures

	## Testing Your Fix

	### Run the Test Script
	```bash
	cd HNTAI
	python test_gguf.py
	```

	This will test:
	- Model loading
	- Basic generation
	- Full summary generation
	- Fallback pipeline

	### Expected Output
	```
	✓ Model loaded successfully in X.XXs
	✓ Generation successful in X.XXs
	✓ Full summary generation successful in X.XXs
	🎉 All tests passed! GGUF model is working correctly.
	```

	## Deployment Steps

	### 1. Update Your Space
	```bash
	git add .
	git commit -m "Fix GGUF model 500 errors with enhanced error handling and fallbacks"
	git push
	```

	### 2. Monitor Logs
	Check your Hugging Face Space logs for:
	- Model loading times
	- Memory usage
	- Error messages
	- Fallback activations

	### 3. Test the API
	```bash
	curl -X POST "https://your-space.hf.space/generate_patient_summary" \
	-H "Content-Type: application/json" \
	-d '{
	"patientid": "test123",
	"token": "your_token",
	"key": "your_key",
	"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
	"patient_summarizer_model_type": "gguf"
	}'
	```

	## Environment Variables

	Set these in your Hugging Face Space:

	```bash
	# Memory optimization
	GGUF_N_THREADS=2
	GGUF_N_BATCH=64

	# Cache directories
	HF_HOME=/tmp/huggingface
	XDG_CACHE_HOME=/tmp
	TORCH_HOME=/tmp/torch
	```

	## Alternative Models

	If Phi-3-mini-4k-instruct still fails, try smaller models:

	### Smaller GGUF Models
	```json
	{
	"patient_summarizer_model_name": "TheBloke/Phi-3-mini-4k-instruct-GGUF/phi-3-mini-4k-instruct-q2_k.gguf",
	"patient_summarizer_model_type": "gguf"
	}
	```

	### Fallback to HuggingFace Models
	```json
	{
	"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct",
	"patient_summarizer_model_type": "text-generation"
	}
	```

	## Monitoring and Debugging

	### 1. Check Space Logs
	- Look for "GGUF" prefixed log messages
	- Monitor memory usage patterns
	- Check for timeout errors

	### 2. API Response Codes
	- `200`: Success
	- `408`: Generation timeout
	- `500`: Model loading failure (will use fallback)

	### 3. Performance Metrics
	- Model loading time: Should be < 5 minutes
	- Generation time: Should be < 2 minutes
	- Memory usage: Should stay within Space limits

	## Common Issues and Solutions

	### Issue: "Model download failed"
	Solution: Check network connectivity and model availability

	### Issue: "Failed to initialize GGUF model"
	Solution: Verify llama-cpp-python installation and system dependencies

	### Issue: "Generation timed out"
	Solution: Reduce max_tokens or use smaller model

	### Issue: "Out of memory"
	Solution: Use smaller model variant (q2_k instead of q4)

	## Support

	If issues persist:
	1. Run `test_gguf.py` and share output
	2. Check Hugging Face Space logs
	3. Verify model availability in the Hub
	4. Consider upgrading to Pro tier for more resources

	## Expected Behavior After Fix

	✅ Before: 500 errors after 5 minutes
	✅ After:
	- Successful model loading with detailed logging
	- Graceful fallback if model fails
	- Proper timeout handling
	- Always returns a response (either real or fallback)