Spaces:
Paused
Paused
Enhance GGUF model loading and generation process with improved progress updates and logging. Updated job status messages to include visual indicators for different stages of model loading and text generation. Streamlined the use of extended streaming for all requests to prevent timeout issues, ensuring a more responsive user experience.
8a71d89 Comprehensive Streaming Fix - 20 Second Timeout Issue
Problem Summary
The streaming was stopping at 20 seconds because:
- Detection Issue: System wasn't properly detecting GGUF mode
- Generator Issue: System was using regular
sse_generatorinstead of extended one - Timeout Issue: 20-second HTTP/2 protocol timeout on Hugging Face Spaces
Complete Solution Implemented
1. Universal Extended Streaming
# ALWAYS use extended streaming to prevent 20-second timeout issues
print(f"π Using extended streaming generator for ALL requests to prevent timeout issues")
return StreamingResponse(
sse_generator_extended(job_id), # Use extended generator for ALL cases
media_type="text/event-stream",
headers={...}
)
2. Enhanced GGUF Detection
# Now checks multiple fields for GGUF detection
is_gguf_mode = (data.get('generation_mode') == 'gguf' or
data.get('patient_summarizer_model_type') == 'gguf' or
'gguf' in data.get('patient_summarizer_model_name', '').lower())
3. Extended Timeout Configuration
# Extended timeout for GGUF operations
max_wait_time = 600 # 10 minutes for GGUF operations
heartbeat_interval = 5 # Every 5 seconds
4. Detailed Progress Updates
Model Loading Progress:
π¦ GGUF Model Loading: Downloading model from microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.ggufβ GGUF Model Loading: Model downloaded successfullyπ§ GGUF Model Loading: Initializing with context=4096, threads=2, gpu_layers=-1β GGUF Model Loading Complete: Model loaded in 19.40s (GPU layers=-1)
Generation Progress:
π§ GGUF Model Loading: Initializing model pipeline...π¦ GGUF Model Loading: Downloading model files...π GGUF Model Ready: Starting text generation...π GGUF Generation: Starting text generation (max_tokens=8192)β GGUF Generation Complete: Generated 1500 words in 45.2sβ GGUF Generation Complete: Processing generated summary...
5. Enhanced SSE Generator
def sse_generator_extended(job_id):
max_wait_time = 600 # 10 minutes for GGUF operations
heartbeat_interval = 5 # Every 5 seconds
# Enhanced logging and progress updates
Expected Behavior Now
Timeline for 5-Minute GGUF Generation:
0:00 - Request starts
0:01 - "π Using extended streaming generator for ALL requests"
0:02 - "π¦ GGUF Model Loading: Downloading model from microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"
0:05 - "β
GGUF Model Loading: Model downloaded successfully"
0:10 - "π§ GGUF Model Loading: Initializing with context=4096, threads=2, gpu_layers=-1"
0:20 - "β
GGUF Model Loading Complete: Model loaded in 19.40s (GPU layers=-1)"
0:21 - "π GGUF Model Ready: Starting text generation..."
0:22 - "π GGUF Generation: Starting text generation (max_tokens=8192)"
0:25 - Heartbeat: "GGUF model operation in progress..."
0:30 - Heartbeat: "GGUF model operation in progress..."
...
4:55 - Heartbeat: "GGUF model operation in progress..."
5:00 - "β
GGUF Generation Complete: Generated 1500 words in 45.2s"
5:01 - "β
GGUF Generation Complete: Processing generated summary..."
5:02 - Final result delivered
Key Benefits
β No More 20-Second Timeout
- Extended 10-minute timeout instead of 20 seconds
- Universal extended streaming for all requests
- Proper detection of GGUF mode
β Detailed Progress Updates
- Every step of model loading is tracked
- Generation progress is monitored
- Heartbeat every 5 seconds during long operations
β Better User Experience
- Continuous feedback throughout the process
- Clear status messages for each step
- No more silent timeouts
β Robust Error Handling
- Proper timeout management
- Clear error messages
- Graceful degradation
Testing
The fix should now work with your exact request format:
{
"mode": "stream",
"patientid": 5635,
"patient_summarizer_model_type": "gguf",
"patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"
}
Debug Output
The system now logs:
"π Using extended streaming generator for ALL requests to prevent timeout issues""β GGUF mode detected - using extended streaming approach"- Detailed progress updates for every step
- Heartbeat messages every 5 seconds
This ensures you can monitor the entire process and track progress throughout the GGUF model loading and generation.