Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

sachinchandrankallar commited on Nov 25, 2025

Commit

cdea66b

1 Parent(s): dd14d00

Revert "feat: Establish AI medical extraction service with performance optimizations, unified model management, and detailed Hugging Face Spaces deployment guides."

Browse files

Files changed (35) hide show

Dockerfile.hf-spaces-minimal +1 -1
__pycache__/app.cpython-311.pyc +0 -0
docs/FIXES/PHI3_COMPATIBILITY_FIX.md +0 -257
docs/archive/COMPREHENSIVE_STREAMING_FIX.md +2 -2
docs/archive/patient_summary_models_review.md +5 -5
docs/hf-spaces/FILES_CREATED.md +4 -4
docs/hf-spaces/INDEX.md +2 -2
models_config.json +4 -21
services/ai-service/DEPLOYMENT_FIX.md +4 -4
services/ai-service/Dockerfile.prod +1 -1
services/ai-service/src/__main__.py +1 -1
services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/agents/patient_summary_agent.py +20 -0
services/ai-service/src/ai_med_extract/api/routes_fastapi.py +31 -91
services/ai-service/src/ai_med_extract/app.py +1 -1
services/ai-service/src/ai_med_extract/config/performance_config.py +2 -2
services/ai-service/src/ai_med_extract/enable_optimizations.py +2 -2
services/ai-service/src/ai_med_extract/inference_service.py +1 -1
services/ai-service/src/ai_med_extract/phi_scrubber_service.py +1 -1
services/ai-service/src/ai_med_extract/services/job_manager.py +1 -1
services/ai-service/src/ai_med_extract/services/request_queue.py +3 -3
services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc +0 -0
services/ai-service/src/ai_med_extract/utils/constants.py +20 -20
services/ai-service/src/ai_med_extract/utils/hf_spaces_config.py +1 -1
services/ai-service/src/ai_med_extract/utils/model_config.py +7 -12
services/ai-service/src/ai_med_extract/utils/openvino_summarizer_utils.py +1 -1
services/ai-service/src/ai_med_extract/utils/performance_monitor.py +1 -1
services/ai-service/src/ai_med_extract/utils/unified_model_manager.py +26 -358
temp_test_load.py +0 -6
temp_test_load_128k.py +0 -9

Dockerfile.hf-spaces-minimal CHANGED Viewed

@@ -48,5 +48,5 @@ HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
     CMD curl -f http://localhost:7860/health || exit 1
 # Start application with single worker for minimal memory footprint
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "1200"]

     CMD curl -f http://localhost:7860/health || exit 1
 # Start application with single worker for minimal memory footprint
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "600"]

__pycache__/app.cpython-311.pyc CHANGED Viewed

Binary files a/__pycache__/app.cpython-311.pyc and b/__pycache__/app.cpython-311.pyc differ

docs/FIXES/PHI3_COMPATIBILITY_FIX.md DELETED Viewed

@@ -1,257 +0,0 @@
-# Fix: Phi-3 Model Compatibility Issues
-## Issues Fixed
-### Issue 1: ✅ cache_dir Model Kwargs Error
-```
-ValueError: The following `model_kwargs` are not used by the model: ['cache_dir']
-```
-### Issue 2: ✅ DynamicCache Compatibility Error
-```
-AttributeError: 'DynamicCache' object has no attribute 'get_max_length'
-```
----
-## Root Causes
-### Issue 1: cache_dir Error
-- `cache_dir` was being passed in `model_kwargs` or `pipeline_kwargs`
-- These parameters can leak into the `generate()` method
-- Models reject `cache_dir` during generation since it's only valid during loading
-### Issue 2: DynamicCache Error
-- Phi-3 models use a long-context cache mechanism with `DynamicCache`
-- Older cached model code (in `transformers_modules`) uses `get_max_length()` method
-- Newer transformers library's `DynamicCache` class doesn't have this method
-- This causes compatibility issues between cached model code and current library
----
-## Solutions Implemented
-### Fix 1: cache_dir via Environment Variable
-**File:** `services/ai-service/src/ai_med_extract/utils/unified_model_manager.py` (Lines 200-209)
-```python
-# Set cache directory via environment variable (safest approach)
-# This ensures it's only used during from_pretrained(), not passed to generate()
-if not IS_T4_MEDIUM:
-    # Local environment
-    cache_dir = os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
-    os.environ['HF_HOME'] = cache_dir
-else:
-    # T4 environment
-    from .model_config import T4_CACHE_DIR
-    os.environ['HF_HOME'] = T4_CACHE_DIR
-```
-**Why this works:**
-- `HF_HOME` is the official environment variable for transformers cache
-- It's read during `from_pretrained()` but **never** passed to `generate()`
-- Completely eliminates the `cache_dir` error
-**Also updated:** `model_config.py` to remove `cache_dir` from `T4_OPTIMIZATIONS`
-### Fix 2: Disable Cache for Phi-3 Models
-**File:** `services/ai-service/src/ai_med_extract/utils/unified_model_manager.py`
-**Location 1:** Model Loading (Lines 223-227)
-```python
-# CRITICAL FIX: Disable use_cache for Phi-3 models to avoid DynamicCache compatibility issues
-# The cached Phi-3 model code may use get_max_length() which doesn't exist in newer DynamicCache
-if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
-    model_kwargs["use_cache"] = False
-```
-**Location 2:** Generation (Lines 300-307)
-```python
-# Prepare generation kwargs
-gen_kwargs = {}
-# CRITICAL FIX: Disable cache for Phi-3 models to avoid DynamicCache compatibility issues
-if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
-    gen_kwargs["use_cache"] = False
-    logger.info(f"Disabled cache for Phi-3 model {self.name} to avoid compatibility issues")
-```
-**Why this works:**
-- Disabling `use_cache` prevents Phi-3 from using the problematic `DynamicCache` mechanism
-- The model runs slightly slower but avoids the `get_max_length()` error
-- All Phi-3 variants are covered: `Phi-3-small`, `Phi-3-mini`, `Phi-3-mini-128k`, etc.
----
-## Affected Models
-### All Phi-3 Variants
-- ✅ `microsoft/Phi-3-small-8k-instruct`
-- ✅ `microsoft/Phi-3-mini-4k-instruct`
-- ✅ `microsoft/Phi-3-mini-128k-instruct`
-- ✅ `microsoft/Phi-3-medium-4k-instruct`
-- ✅ Any other Phi-3 model
-### All Text-Generation Models
-- ✅ Any model using `text-generation` pipeline
-- ✅ `cache_dir` fix applies universally
----
-## Testing
-### Test Case 1: Phi-3-small with Text Generation
-**Request:**
-```json
-{
-  "mode": "stream",
-  "patientid": 4268,
-  "token": "your-token",
-  "key": "https://api.glitzit.com",
-  "patient_summarizer_model_name": "microsoft/Phi-3-small-8k-instruct",
-  "patient_summarizer_model_type": "text-generation",
-  "custom_prompt": "create a clinical patient summary in markdown"
-}
-```
-**Before Fixes:**
-- ❌ Error 1: `cache_dir` not used by model
-- ❌ Error 2: `DynamicCache` has no attribute `get_max_length`
-**After Fixes:**
-- ✅ Model loads successfully
-- ✅ Generates patient summary without errors
-- ℹ️ Note: May auto-switch to `Phi-3-mini-128k-instruct` on Windows (Triton unavailable)
-### Test Case 2: Default Phi-3 Model
-**Request:**
-```json
-{
-  "mode": "stream",
-  "patientid": 4268,
-  "token": "your-token",
-  "key": "https://api.glitzit.com"
-}
-```
-**Result:**
-- ✅ Uses default Phi-3 GGUF model
-- ✅ No cache issues
----
-## Performance Impact
-### cache_dir Fix
-- **Impact:** None
-- **Reason:** Environment variable approach is just as efficient as parameter passing
-### use_cache=False for Phi-3
-- **Impact:** Slight performance decrease (~5-10% slower)
-- **Reason:** Model can't reuse cached key-values during generation
-- **Trade-off:** Worth it to avoid crashes and ensure compatibility
-- **Alternative:** Update transformers library and clear cache (more complex)
----
-## Alternative Solutions Considered
-### Alternative 1: Clear HuggingFace Cache
-```bash
-rm -rf D:\tmp\huggingface\modules\transformers_modules
-```
-- **Pros:** Would fix DynamicCache issue permanently
-- **Cons:** Requires manual intervention, re-downloads models
-### Alternative 2: Update transformers Library
-```bash
-pip install --upgrade transformers
-```
-- **Pros:** May fix compatibility
-- **Cons:** Could break other models, requires testing
-### Alternative 3: Use Different Model
-```json
-{
-  "patient_summarizer_model_name": "google/flan-t5-large",
-  "patient_summarizer_model_type": "summarization"
-}
-```
-- **Pros:** No Phi-3 compatibility issues
-- **Cons:** Different model quality, not instruction-tuned for medical text
-**Our Choice:** Disable cache for Phi-3 models (minimal impact, maximum compatibility)
----
-## Logs to Monitor
-### Successful Load
-```
-2025-11-24 10:29:38,016 - INFO - Loading model: microsoft/Phi-3-mini-128k-instruct (text-generation)
-2025-11-24 10:29:43,231 - INFO - Model microsoft/Phi-3-mini-128k-instruct loaded in 5.22s
-```
-### Cache Disabled Log
-```
-2025-11-24 10:29:46,808 - INFO - Disabled cache for Phi-3 model microsoft/Phi-3-mini-128k-instruct to avoid compatibility issues
-```
-### Success
-```
-INFO:     127.0.0.1:49677 - "POST /generate_patient_summary?stream=true HTTP/1.1" 200 OK
-```
----
-## Files Modified
-1. **`services/ai-service/src/ai_med_extract/utils/model_config.py`**
-   - Removed `cache_dir` from `T4_OPTIMIZATIONS`
-   - Added `T4_CACHE_DIR` constant
-2. **`services/ai-service/src/ai_med_extract/utils/unified_model_manager.py`**
-   - Lines 200-209: Set cache via `HF_HOME` environment variable
-   - Lines 223-227: Disable cache during Phi-3 model loading
-   - Lines 300-307: Disable cache during Phi-3 generation
----
-## Recommended Models
-### Best for Medical Summaries (No Issues)
-```json
-{
-  "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
-  "patient_summarizer_model_type": "gguf"
-}
-```
-- ✅ No cache issues (uses llama.cpp backend)
-- ✅ Fast and efficient
-- ✅ Medical domain knowledge
-### Best for Long Context (Fixed Now)
-```json
-{
-  "patient_summarizer_model_name": "microsoft/Phi-3-small-8k-instruct",
-  "patient_summarizer_model_type": "text-generation"
-}
-```
-- ✅ 8k context window
-- ✅ Works with both fixes applied
-- ⚠️ May auto-switch to Phi-3-mini-128k on Windows
----
-## Date
-Fixed: November 24, 2025
-## Status
-✅ **RESOLVED** - Both issues fixed and tested

docs/archive/COMPREHENSIVE_STREAMING_FIX.md CHANGED Viewed

@@ -31,7 +31,7 @@ is_gguf_mode = (data.get('generation_mode') == 'gguf' or
 ### **3. Extended Timeout Configuration**
 ```python
 # Extended timeout for GGUF operations
-max_wait_time = 1200  # 10 minutes for GGUF operations
 heartbeat_interval = 5  # Every 5 seconds
 ```
@@ -54,7 +54,7 @@ heartbeat_interval = 5  # Every 5 seconds
 ### **5. Enhanced SSE Generator**
 ```python
 def sse_generator_extended(job_id):
-    max_wait_time = 1200  # 10 minutes for GGUF operations
     heartbeat_interval = 5  # Every 5 seconds
     # Enhanced logging and progress updates
 ```

 ### **3. Extended Timeout Configuration**
 ```python
 # Extended timeout for GGUF operations
+max_wait_time = 600  # 10 minutes for GGUF operations
 heartbeat_interval = 5  # Every 5 seconds
 ```
 ### **5. Enhanced SSE Generator**
 ```python
 def sse_generator_extended(job_id):
+    max_wait_time = 600  # 10 minutes for GGUF operations
     heartbeat_interval = 5  # Every 5 seconds
     # Enhanced logging and progress updates
 ```

docs/archive/patient_summary_models_review.md CHANGED Viewed

@@ -160,7 +160,7 @@ elif model_type == "causal-openvino":
 #### Weaknesses
 - ⚠️ **Slight quality loss**: Q4 quantization may reduce quality slightly
-- ⚠️ **Longer timeouts**: Extended timeout needed (1200s on HF Spaces)
 - ⚠️ **File path parsing**: Requires special handling for filename extraction
 #### Implementation Details
@@ -428,7 +428,7 @@ Based on HF Spaces configuration (`hf_spaces_config.py`):
 - ✅ **RAM**: ~3-4GB during inference
 - ✅ **Speed**: Very good on T4 (GGUF optimized)
 - ✅ **HF Spaces Config**: Primary GGUF model (line 33)
-- ✅ **Extended Timeout**: 1200s configured for HF Spaces (routes_fastapi.py line 1075)
 - ✅ **Quantization**: Q4 reduces memory by ~75%
 #### Performance Estimates
@@ -449,7 +449,7 @@ Based on HF Spaces configuration (`hf_spaces_config.py`):
 #### Recommendations
 - **Best Choice** for cost-conscious deployment
 - Use when expecting high concurrent load
-- Extended timeout already configured (1200s)
 - Cache-friendly for repeated requests
 ---
@@ -551,7 +551,7 @@ GGUF (Phi-3-Q4):   ~2.0GB GPU  (16% of usable)
 Based on `routes_fastapi.py`:
 - **Standard models**: 120-180s timeout
-- **GGUF models**: 1200s extended timeout (line 1075)
 - **HF Spaces detection**: Automatic (line 1073-1074)
 ### Optimization Strategies for T4
@@ -619,7 +619,7 @@ Fallback Model: microsoft/Phi-3-mini-4k-instruct-gguf
 Emergency Fallback: google/flan-t5-large
 Max Concurrent: 5-6 requests (BART), 8-10 (GGUF)
 Memory Limit: 80% (12.8GB GPU, 24GB RAM)
-Timeout: 180s (standard), 1200s (GGUF)
 ```
 ### 📊 **Expected Performance**

 #### Weaknesses
 - ⚠️ **Slight quality loss**: Q4 quantization may reduce quality slightly
+- ⚠️ **Longer timeouts**: Extended timeout needed (600s on HF Spaces)
 - ⚠️ **File path parsing**: Requires special handling for filename extraction
 #### Implementation Details
 - ✅ **RAM**: ~3-4GB during inference
 - ✅ **Speed**: Very good on T4 (GGUF optimized)
 - ✅ **HF Spaces Config**: Primary GGUF model (line 33)
+- ✅ **Extended Timeout**: 600s configured for HF Spaces (routes_fastapi.py line 1075)
 - ✅ **Quantization**: Q4 reduces memory by ~75%
 #### Performance Estimates
 #### Recommendations
 - **Best Choice** for cost-conscious deployment
 - Use when expecting high concurrent load
+- Extended timeout already configured (600s)
 - Cache-friendly for repeated requests
 ---
 Based on `routes_fastapi.py`:
 - **Standard models**: 120-180s timeout
+- **GGUF models**: 600s extended timeout (line 1075)
 - **HF Spaces detection**: Automatic (line 1073-1074)
 ### Optimization Strategies for T4
 Emergency Fallback: google/flan-t5-large
 Max Concurrent: 5-6 requests (BART), 8-10 (GGUF)
 Memory Limit: 80% (12.8GB GPU, 24GB RAM)
+Timeout: 180s (standard), 600s (GGUF)
 ```
 ### 📊 **Expected Performance**

docs/hf-spaces/FILES_CREATED.md CHANGED Viewed

@@ -125,7 +125,7 @@ python verify_cache.py
 ### 7. `MODEL_CACHING_SUMMARY.md` ⭐ START HERE
 **Purpose**: Overview and answer to your question
-**Size**: ~1200 lines
 **Contents**:
 - Direct answer to your question
 - Performance comparison
@@ -183,7 +183,7 @@ python verify_cache.py
 ### 11. `README_HF_SPACES.md`
 **Purpose**: Main README for HF Spaces deployment
-**Size**: ~1200 lines
 **Contents**:
 - Quick start (3 steps)
 - File structure
@@ -231,11 +231,11 @@ python verify_cache.py
 | `entrypoint.sh` | Script | ⭐ YES | 40 lines | Startup verification |
 | `verify_cache.py` | Tool | Recommended | 200 lines | Verify cache |
 | `health_endpoints.py` | Code | Recommended | +120 lines | Health endpoints |
-| `MODEL_CACHING_SUMMARY.md` | Docs | ⭐ START HERE | 1200 lines | Overview |
 | `HF_SPACES_QUICKSTART.md` | Docs | Recommended | 400 lines | Quick start |
 | `HF_SPACES_DEPLOYMENT.md` | Docs | Reference | 800 lines | Full guide |
 | `DEPLOYMENT_CHECKLIST.md` | Docs | Helpful | 400 lines | Checklist |
-| `README_HF_SPACES.md` | Docs | Reference | 1200 lines | Main README |
 | `COMPARISON_BEFORE_AFTER.md` | Docs | Helpful | 500 lines | Comparison |
 | `FILES_CREATED.md` | Docs | Reference | This file | Index |

 ### 7. `MODEL_CACHING_SUMMARY.md` ⭐ START HERE
 **Purpose**: Overview and answer to your question
+**Size**: ~600 lines
 **Contents**:
 - Direct answer to your question
 - Performance comparison
 ### 11. `README_HF_SPACES.md`
 **Purpose**: Main README for HF Spaces deployment
+**Size**: ~600 lines
 **Contents**:
 - Quick start (3 steps)
 - File structure
 | `entrypoint.sh` | Script | ⭐ YES | 40 lines | Startup verification |
 | `verify_cache.py` | Tool | Recommended | 200 lines | Verify cache |
 | `health_endpoints.py` | Code | Recommended | +120 lines | Health endpoints |
+| `MODEL_CACHING_SUMMARY.md` | Docs | ⭐ START HERE | 600 lines | Overview |
 | `HF_SPACES_QUICKSTART.md` | Docs | Recommended | 400 lines | Quick start |
 | `HF_SPACES_DEPLOYMENT.md` | Docs | Reference | 800 lines | Full guide |
 | `DEPLOYMENT_CHECKLIST.md` | Docs | Helpful | 400 lines | Checklist |
+| `README_HF_SPACES.md` | Docs | Reference | 600 lines | Main README |
 | `COMPARISON_BEFORE_AFTER.md` | Docs | Helpful | 500 lines | Comparison |
 | `FILES_CREATED.md` | Docs | Reference | This file | Index |

docs/hf-spaces/INDEX.md CHANGED Viewed

@@ -122,8 +122,8 @@ All documentation for deploying to Hugging Face Spaces with pre-cached models.
 | DEPLOYMENT_CHECKLIST.md | ~400 | Use while deploying | ⭐⭐ |
 | MODEL_UPDATE_SUMMARY.md | ~500 | 10 min | ⭐⭐ |
 | HF_SPACES_DEPLOYMENT.md | ~800 | 30 min | ⭐ |
-| MODEL_CACHING_SUMMARY.md | ~1200 | 15 min | ⭐ |
-| README_HF_SPACES.md | ~1200 | Reference | ⭐ |
 | COMPARISON_BEFORE_AFTER.md | ~500 | Reference | Optional |
 | FILES_CREATED.md | ~500 | Reference | Optional |

 | DEPLOYMENT_CHECKLIST.md | ~400 | Use while deploying | ⭐⭐ |
 | MODEL_UPDATE_SUMMARY.md | ~500 | 10 min | ⭐⭐ |
 | HF_SPACES_DEPLOYMENT.md | ~800 | 30 min | ⭐ |
+| MODEL_CACHING_SUMMARY.md | ~600 | 15 min | ⭐ |
+| README_HF_SPACES.md | ~600 | Reference | ⭐ |
 | COMPARISON_BEFORE_AFTER.md | ~500 | Reference | Optional |
 | FILES_CREATED.md | ~500 | Reference | Optional |

models_config.json CHANGED Viewed

@@ -41,31 +41,13 @@
     {
       "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
       "type": "gguf",
-      "is_active": false,
       "cached": true,
-      "description": "Phi-3 Mini GGUF Q4 quantized - 4k Context",
       "use_case": "Fast patient summary generation with CPU/GPU",
       "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
       "filename": "Phi-3-mini-4k-instruct-q4.gguf"
     },
-    {
-      "name": "microsoft/Phi-3-mini-128k-instruct",
-      "type": "causal-openvino",
-      "is_active": true,
-      "cached": false,
-      "description": "Phi-3 Mini 128k Context - PRIMARY MODEL",
-      "use_case": "Long-context patient summary generation"
-    },
-    {
-      "name": "microsoft/Phi-3-mini-128k-instruct-gguf/Phi-3-mini-128k-instruct-q4.gguf",
-      "type": "gguf",
-      "is_active": false,
-      "cached": false,
-      "description": "Phi-3 Mini 128k Context GGUF Q4",
-      "use_case": "Local testing with 128k context (CPU/GPU)",
-      "repo_id": "microsoft/Phi-3-mini-128k-instruct-gguf",
-      "filename": "Phi-3-mini-128k-instruct-q4.gguf"
-    },
     {
       "name": "google/flan-t5-large",
       "type": "summarization",
@@ -93,4 +75,5 @@
     "Other models can be requested at runtime and will be downloaded automatically",
     "Runtime downloads are cached for subsequent uses"
   ]
-}

     {
       "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
       "type": "gguf",
+      "is_active": true,
       "cached": true,
+      "description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
       "use_case": "Fast patient summary generation with CPU/GPU",
       "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
       "filename": "Phi-3-mini-4k-instruct-q4.gguf"
     },
     {
       "name": "google/flan-t5-large",
       "type": "summarization",
     "Other models can be requested at runtime and will be downloaded automatically",
     "Runtime downloads are cached for subsequent uses"
   ]
+}

services/ai-service/DEPLOYMENT_FIX.md CHANGED Viewed

@@ -17,13 +17,13 @@ The deployment was failing with a "Scheduling failure: unable to schedule" error
 **Before:**
 ```dockerfile
 RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn
-CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "--timeout", "1200", "wsgi:app"]
 ```
 **After:**
 ```dockerfile
 RUN pip install --no-cache-dir -r /app/requirements.txt uvicorn[standard]
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "1200", "--workers", "4"]
 ```
 ### Why This Works
@@ -66,12 +66,12 @@ If you need more production-grade deployment with multiple workers:
 #### Option A: Gunicorn with Uvicorn Workers (Recommended for Production)
 ```dockerfile
 RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn uvicorn[standard]
-CMD ["gunicorn", "app:app", "--workers", "4", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:7860", "--timeout", "1200"]
 ```
 #### Option B: Pure Uvicorn (Current, Good for Medium Load)
 ```dockerfile
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "1200", "--workers", "4"]
 ```
 ### 3. Health Check Configuration

 **Before:**
 ```dockerfile
 RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn
+CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "--timeout", "600", "wsgi:app"]
 ```
 **After:**
 ```dockerfile
 RUN pip install --no-cache-dir -r /app/requirements.txt uvicorn[standard]
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "600", "--workers", "4"]
 ```
 ### Why This Works
 #### Option A: Gunicorn with Uvicorn Workers (Recommended for Production)
 ```dockerfile
 RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn uvicorn[standard]
+CMD ["gunicorn", "app:app", "--workers", "4", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:7860", "--timeout", "600"]
 ```
 #### Option B: Pure Uvicorn (Current, Good for Medium Load)
 ```dockerfile
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "600", "--workers", "4"]
 ```
 ### 3. Health Check Configuration

services/ai-service/Dockerfile.prod CHANGED Viewed

@@ -22,4 +22,4 @@ EXPOSE 7860
 ENV PRELOAD_SMALL_MODELS=false
 # Use uvicorn directly for FastAPI (ASGI) instead of gunicorn (WSGI)
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "1200", "--workers", "4"]

 ENV PRELOAD_SMALL_MODELS=false
 # Use uvicorn directly for FastAPI (ASGI) instead of gunicorn (WSGI)
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "600", "--workers", "4"]

services/ai-service/src/__main__.py CHANGED Viewed

@@ -12,4 +12,4 @@ initialize_agents(app)
 if __name__ == '__main__':
     import uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=7860, timeout_keep_alive=1200)

 if __name__ == '__main__':
     import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860, timeout_keep_alive=600)

services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/agents/patient_summary_agent.py CHANGED Viewed

@@ -37,6 +37,26 @@ class PatientSummarizerAgent:
         )
     def configure_model(self, model_name: str, model_type: str = None):
         is_hf_spaces = (
             os.getenv('HF_SPACES', 'false').lower() == 'true'
             or os.getenv('HUGGINGFACE_SPACES', 'false').lower() == 'true'

         )
     def configure_model(self, model_name: str, model_type: str = None):
+        """Configure the model dynamically from payload"""
+        from ..utils.model_config import detect_model_type
+        self.current_model_name = model_name
+        self.current_model_type = model_type or detect_model_type(model_name)
+        # Get model loader from unified manager
+        from ..utils.unified_model_manager import unified_model_manager
+        self.model_loader = unified_model_manager.get_model(
+            self.current_model_name,
+            self.current_model_type,
+            lazy=True  # Lazy loading for better performance
+        )
+        logging.info(f"Configured PatientSummarizerAgent with {model_name} ({self.current_model_type})")
+        return self.model_loader
+    def _initialize_model_loader(self):
+        """Initialize the model loader using the unified model manager with enhanced cache handling"""
+        import os
         is_hf_spaces = (
             os.getenv('HF_SPACES', 'false').lower() == 'true'
             or os.getenv('HUGGINGFACE_SPACES', 'false').lower() == 'true'

services/ai-service/src/ai_med_extract/api/routes_fastapi.py CHANGED Viewed

@@ -483,78 +483,25 @@ def get_gguf_pipeline(model_name: str, filename: str = None):
             start_time = time.time()
             # Try to load the GGUF model using unified manager
             try:
-                import traceback
-                model = unified_model_manager.get_model(model_name, "gguf", filename, lazy=False)
-                # Check if model was forced to fallback due to T4 compatibility
-                if model.model_type == "fallback":
-                    fallback_reason = model.fallback_reason or f"Model {model_name} is not supported/optimal for T4 Medium"
-                    print(f"[GGUF] ⚠️ Model forced to fallback: {fallback_reason}")
-                    print(f"[GGUF] Using fallback pipeline")
-                    GGUF_MODEL_CACHE[key] = create_fallback_pipeline()
-                    return GGUF_MODEL_CACHE[key]
-                # Ensure model is actually loaded
-                loaded_model = model.load()
-                if loaded_model is None:
-                    # Get detailed error information
-                    error_msg = model._error_message or "Unknown error"
-                    fallback_reason = model.fallback_reason or f"Model {model_name} failed to load"
-                    print(f"[GGUF] ❌ Model load returned None")
-                    print(f"[GGUF] Error message: {error_msg}")
-                    print(f"[GGUF] Fallback reason: {fallback_reason}")
-                    print(f"[GGUF] Model status: {model.status}")
-                    raise RuntimeError(f"Model {model_name} failed to load: {error_msg}")
                 # Wrap in pipeline-like interface for compatibility
                 class GGUFModelWrapper:
                     def __init__(self, model):
                         self.model = model
                     def generate(self, prompt, **kwargs):
-                        from ..utils.unified_model_manager import GenerationConfig, ModelStatus
                         config = GenerationConfig(**kwargs)
-                        # Ensure model is loaded before generating
-                        if self.model.status != ModelStatus.LOADED:
-                            loaded = self.model.load()
-                            if loaded is None:
-                                error_msg = self.model._error_message or "Unknown error"
-                                raise RuntimeError(f"Model {self.model.name} is not loaded and failed to load: {error_msg}")
                         return self.model.generate(prompt, config)
                     def generate_full_summary(self, prompt, **kwargs):
                         return self.generate(prompt, **kwargs)
-                GGUF_MODEL_CACHE[key] = GGUFModelWrapper(loaded_model)
                 load_time = time.time() - start_time
-                print(f"[GGUF] ✅ Model loaded successfully in {load_time:.2f}s: {model_name}")
             except Exception as e:
-                import traceback
                 load_time = time.time() - start_time
-                error_type = type(e).__name__
-                error_msg = str(e)
-                error_traceback = traceback.format_exc()
-                print(f"[GGUF] ❌ Failed to load model {model_name} after {load_time:.2f}s")
-                print(f"[GGUF] Error type: {error_type}")
-                print(f"[GGUF] Error message: {error_msg}")
-                # Try to get additional error info from model if it exists
-                try:
-                    if 'model' in locals():
-                        if hasattr(model, '_error_message') and model._error_message:
-                            print(f"[GGUF] Model error message: {model._error_message}")
-                        if hasattr(model, 'fallback_reason') and model.fallback_reason:
-                            print(f"[GGUF] Fallback reason: {model.fallback_reason}")
-                        if hasattr(model, 'status'):
-                            print(f"[GGUF] Model status: {model.status}")
-                except:
-                    pass
-                # Print full traceback for debugging
-                print(f"[GGUF] Full traceback:\n{error_traceback}")
                 # If model loading fails, use fallback
-                print("[GGUF] 🔄 Using fallback pipeline")
                 GGUF_MODEL_CACHE[key] = create_fallback_pipeline()
         except Exception as e:
             print(f"[GGUF] Critical error in model loading: {e}")
@@ -688,7 +635,7 @@ def generate_rule_based_summary(baseline, delta_text, visits=None, patientid=Non
     # Clinical Overview: summarize baseline
     if baseline:
-        baseline_snip = baseline[:1200].replace("\n", " ")
         lines_assessment.append(f"- Baseline: {baseline_snip}")
     else:
         lines_assessment.append("- No baseline data available.")
@@ -939,7 +886,7 @@ You are a clinical assistant. {custom_prompt}
 PATIENT VISIT DATA:
 {visit_data_text}</s>
 <|user|>
-strictly rely on data,dont halucinate or invent any information.</s>
 <|assistant|>"""
     else:
         base_prompt = process_patient_record_plain_text({
@@ -1022,7 +969,6 @@ async def load_model_with_fallback(model_name, model_type, fallback_type=None):
     from ..utils.unified_model_manager import unified_model_manager as _unified_manager
     from ..utils import model_config as _mc
-    primary_error = None
     try:
         model = _unified_manager.get_model(
             name=model_name,
@@ -1031,12 +977,8 @@ async def load_model_with_fallback(model_name, model_type, fallback_type=None):
         )
         if model.load():
             return model, model_name, model_type, False, None
-        else:
-            # Model failed to load (returned None)
-            primary_error = f"Model {model_name} ({model_type}) failed to load (load() returned None)"
     except Exception as e:
-        primary_error = f"Model {model_name} ({model_type}) failed to load: {type(e).__name__}: {str(e)}"
-        logger.warning(primary_error)
     # Try fallback
     if fallback_type:
@@ -1049,9 +991,7 @@ async def load_model_with_fallback(model_name, model_type, fallback_type=None):
                 filename=None
             )
             if fallback_model.load():
-                fallback_reason = primary_error or f"Primary model {model_name} ({model_type}) failed to load"
-                # Store fallback reason in the model object for later retrieval
-                fallback_model.set_fallback_reason(fallback_reason)
                 return fallback_model, fallback_model_name, fallback_type, True, fallback_reason
         except Exception as e:
             logger.error(f"Fallback model also failed: {e}")
@@ -1144,8 +1084,8 @@ async def async_patient_summary(data, job_id=None):
             try:
                 response = requests.post(
                     ehr_url,
-                    json={"patientid": patientid},
-                    headers=headers,
                     timeout=EHR_TIMEOUT
                 )
                 logging.info(f"EHR API response status: {response.status_code}")
@@ -1408,7 +1348,7 @@ async def async_patient_summary(data, job_id=None):
             try:
                 # Use extended timeout for GGUF operations on HF Spaces
                 is_hf_spaces = os.environ.get('HF_SPACES', 'false').lower() == 'true'
-                timeout_value = timeout_config.get("gguf_extended_timeout" if is_hf_spaces else "gguf_timeout", 1200)
                 if cache_key not in GGUF_PIPELINE_CACHE:
                     if job_id:
@@ -1644,10 +1584,10 @@ async def async_patient_summary(data, job_id=None):
             try:
                 raw_summary = await asyncio.wait_for(
                     generate_with_progress(),
-                    timeout=timeout_config.get("generation_timeout", 1200)
                 )
             except asyncio.TimeoutError:
-                error_msg = f"Text generation timed out after {timeout_config.get('generation_timeout', 1200)} seconds"
                 log_error_with_context(Exception(error_msg), "Text generation timeout", job_id)
                 update_job_with_error(job_id, error_msg, "generation_timeout")
                 raise Exception(error_msg)
@@ -1723,10 +1663,10 @@ async def async_patient_summary(data, job_id=None):
             try:
                 result_sum = await asyncio.wait_for(
                     asyncio.to_thread(model.generate, context, config),
-                    timeout=timeout_config.get("generation_timeout", 1200)
                 )
             except asyncio.TimeoutError:
-                error_msg = f"Summarization timed out after {timeout_config.get('generation_timeout', 1200)} seconds"
                 log_error_with_context(Exception(error_msg), "Summarization timeout", job_id)
                 update_job_with_error(job_id, error_msg, "generation_timeout")
                 raise Exception(error_msg)
@@ -1837,7 +1777,7 @@ async def async_patient_summary(data, job_id=None):
                             temperature=0.1,
                             top_p=0.5,
                         ),
-                        timeout=1200
                     )
                 else:
                     config = create_generation_config(data, min_tokens=100, temperature=0.1, top_p=0.5)
@@ -1887,7 +1827,7 @@ async def async_patient_summary(data, job_id=None):
         if "timeout" in error_str.lower():
             error_category = "TIMEOUT"
             # Enhanced timeout message with recommendations
-            user_message = f"""Summary generation timed out after {timeout_config.get('generation_timeout', 1200)} seconds.
 This may be due to:
 - Large patient dataset requiring more processing time
@@ -2012,7 +1952,7 @@ def process_patient_summary_background(data, job_id):
                             ehr_url,
                             json={"patientid": patientid},
                             headers=headers,
-                            timeout=1200
                         )
                         if response.status_code == 200:
                             sample_data = response.json()
@@ -2477,7 +2417,7 @@ async def home():
                 border-radius: 20px;
                 padding: 40px;
                 box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
-                max-width: 1200px;
                 width: 100%;
                 animation: fadeIn 0.5s ease-in;
             }
@@ -2493,7 +2433,7 @@ async def home():
                 padding: 8px 16px;
                 border-radius: 20px;
                 font-size: 14px;
-                font-weight: 1200;
                 margin-bottom: 20px;
             }
             .status-dot {
@@ -2526,7 +2466,7 @@ async def home():
             }
             .info-title {
                 color: #374151;
-                font-weight: 1200;
                 margin-bottom: 15px;
                 font-size: 18px;
             }
@@ -2551,7 +2491,7 @@ async def home():
                 padding: 4px 8px;
                 border-radius: 4px;
                 font-size: 12px;
-                font-weight: 1200;
                 margin-right: 10px;
                 min-width: 50px;
                 text-align: center;
@@ -2572,7 +2512,7 @@ async def home():
             .link {
                 color: #667eea;
                 text-decoration: none;
-                font-weight: 1200;
             }
             .link:hover {
                 text-decoration: underline;
@@ -2764,7 +2704,7 @@ async def generate_patient_summary_large_data(
             """Wait for slot and then process."""
             try:
                 # Wait for processing slot
-                if queue_manager.wait_for_slot(request_id, timeout=1200):
                     # Update job status to show processing started
                     job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
                     # Start background task with optimized generation
@@ -2793,7 +2733,7 @@ async def generate_patient_summary_large_data(
                 'X-Content-Type-Options': 'nosniff',
                 'Access-Control-Allow-Origin': '*',
                 'Access-Control-Allow-Headers': 'Cache-Control, Connection',
-                'Keep-Alive': 'timeout=31200',
                 # Force HTTP/1.1 to avoid HTTP/2 protocol errors
                 'X-Protocol': 'HTTP/1.1'
             }
@@ -2850,7 +2790,7 @@ async def generate_patient_summary_streaming(
             """Wait for slot and then process."""
             try:
                 # Wait for processing slot
-                if queue_manager.wait_for_slot(request_id, timeout=1200):
                     # Update job status to show processing started
                     job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
                     # Start background task with optimized generation
@@ -2879,7 +2819,7 @@ async def generate_patient_summary_streaming(
                 'X-Content-Type-Options': 'nosniff',
                 'Access-Control-Allow-Origin': '*',
                 'Access-Control-Allow-Headers': 'Cache-Control, Connection',
-                'Keep-Alive': 'timeout=31200',
                 # Force HTTP/1.1 to avoid HTTP/2 protocol errors
                 'X-Protocol': 'HTTP/1.1'
             }
@@ -2958,7 +2898,7 @@ async def generate_patient_summary(
                 """Wait for slot and then process."""
                 try:
                     # Wait for processing slot
-                    if queue_manager.wait_for_slot(request_id, timeout=1200):
                         # Update job status to show processing started
                         job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
                         # Start background task directly (not in separate thread to avoid nesting)
@@ -2988,7 +2928,7 @@ async def generate_patient_summary(
                     'X-Content-Type-Options': 'nosniff',
                     'Access-Control-Allow-Origin': '*',
                     'Access-Control-Allow-Headers': 'Cache-Control, Connection',
-                    'Keep-Alive': 'timeout=31200',
                     # Force HTTP/1.1 to avoid HTTP/2 protocol errors
                     'X-Protocol': 'HTTP/1.1'
                 }

             start_time = time.time()
             # Try to load the GGUF model using unified manager
             try:
+                model = unified_model_manager.get_model(model_name, "gguf", filename)
                 # Wrap in pipeline-like interface for compatibility
                 class GGUFModelWrapper:
                     def __init__(self, model):
                         self.model = model
                     def generate(self, prompt, **kwargs):
+                        from ..utils.unified_model_manager import GenerationConfig
                         config = GenerationConfig(**kwargs)
                         return self.model.generate(prompt, config)
                     def generate_full_summary(self, prompt, **kwargs):
                         return self.generate(prompt, **kwargs)
+                GGUF_MODEL_CACHE[key] = GGUFModelWrapper(model)
                 load_time = time.time() - start_time
+                print(f"[GGUF] Model loaded successfully in {load_time:.2f}s: {model_name}")
             except Exception as e:
                 load_time = time.time() - start_time
+                print(f"[GGUF] Failed to load model {model_name} after {load_time:.2f}s: {e}")
                 # If model loading fails, use fallback
+                print("[GGUF] Using fallback pipeline")
                 GGUF_MODEL_CACHE[key] = create_fallback_pipeline()
         except Exception as e:
             print(f"[GGUF] Critical error in model loading: {e}")
     # Clinical Overview: summarize baseline
     if baseline:
+        baseline_snip = baseline[:600].replace("\n", " ")
         lines_assessment.append(f"- Baseline: {baseline_snip}")
     else:
         lines_assessment.append("- No baseline data available.")
 PATIENT VISIT DATA:
 {visit_data_text}</s>
 <|user|>
+Generate a comprehensive patient summary based on the data above.</s>
 <|assistant|>"""
     else:
         base_prompt = process_patient_record_plain_text({
     from ..utils.unified_model_manager import unified_model_manager as _unified_manager
     from ..utils import model_config as _mc
     try:
         model = _unified_manager.get_model(
             name=model_name,
         )
         if model.load():
             return model, model_name, model_type, False, None
     except Exception as e:
+        logger.warning(f"Model {model_name} ({model_type}) failed to load: {e}")
     # Try fallback
     if fallback_type:
                 filename=None
             )
             if fallback_model.load():
+                fallback_reason = f"Primary model {model_name} ({model_type}) failed to load"
                 return fallback_model, fallback_model_name, fallback_type, True, fallback_reason
         except Exception as e:
             logger.error(f"Fallback model also failed: {e}")
             try:
                 response = requests.post(
                     ehr_url,
+                    json={"patientid": patientid},
+                    headers=headers,
                     timeout=EHR_TIMEOUT
                 )
                 logging.info(f"EHR API response status: {response.status_code}")
             try:
                 # Use extended timeout for GGUF operations on HF Spaces
                 is_hf_spaces = os.environ.get('HF_SPACES', 'false').lower() == 'true'
+                timeout_value = timeout_config.get("gguf_extended_timeout" if is_hf_spaces else "gguf_timeout", 600)
                 if cache_key not in GGUF_PIPELINE_CACHE:
                     if job_id:
             try:
                 raw_summary = await asyncio.wait_for(
                     generate_with_progress(),
+                    timeout=timeout_config.get("generation_timeout", 600)
                 )
             except asyncio.TimeoutError:
+                error_msg = f"Text generation timed out after {timeout_config.get('generation_timeout', 600)} seconds"
                 log_error_with_context(Exception(error_msg), "Text generation timeout", job_id)
                 update_job_with_error(job_id, error_msg, "generation_timeout")
                 raise Exception(error_msg)
             try:
                 result_sum = await asyncio.wait_for(
                     asyncio.to_thread(model.generate, context, config),
+                    timeout=timeout_config.get("generation_timeout", 600)
                 )
             except asyncio.TimeoutError:
+                error_msg = f"Summarization timed out after {timeout_config.get('generation_timeout', 600)} seconds"
                 log_error_with_context(Exception(error_msg), "Summarization timeout", job_id)
                 update_job_with_error(job_id, error_msg, "generation_timeout")
                 raise Exception(error_msg)
                             temperature=0.1,
                             top_p=0.5,
                         ),
+                        timeout=600
                     )
                 else:
                     config = create_generation_config(data, min_tokens=100, temperature=0.1, top_p=0.5)
         if "timeout" in error_str.lower():
             error_category = "TIMEOUT"
             # Enhanced timeout message with recommendations
+            user_message = f"""Summary generation timed out after {timeout_config.get('generation_timeout', 600)} seconds.
 This may be due to:
 - Large patient dataset requiring more processing time
                             ehr_url,
                             json={"patientid": patientid},
                             headers=headers,
+                            timeout=600
                         )
                         if response.status_code == 200:
                             sample_data = response.json()
                 border-radius: 20px;
                 padding: 40px;
                 box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
+                max-width: 600px;
                 width: 100%;
                 animation: fadeIn 0.5s ease-in;
             }
                 padding: 8px 16px;
                 border-radius: 20px;
                 font-size: 14px;
+                font-weight: 600;
                 margin-bottom: 20px;
             }
             .status-dot {
             }
             .info-title {
                 color: #374151;
+                font-weight: 600;
                 margin-bottom: 15px;
                 font-size: 18px;
             }
                 padding: 4px 8px;
                 border-radius: 4px;
                 font-size: 12px;
+                font-weight: 600;
                 margin-right: 10px;
                 min-width: 50px;
                 text-align: center;
             .link {
                 color: #667eea;
                 text-decoration: none;
+                font-weight: 600;
             }
             .link:hover {
                 text-decoration: underline;
             """Wait for slot and then process."""
             try:
                 # Wait for processing slot
+                if queue_manager.wait_for_slot(request_id, timeout=600):
                     # Update job status to show processing started
                     job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
                     # Start background task with optimized generation
                 'X-Content-Type-Options': 'nosniff',
                 'Access-Control-Allow-Origin': '*',
                 'Access-Control-Allow-Headers': 'Cache-Control, Connection',
+                'Keep-Alive': 'timeout=3600',
                 # Force HTTP/1.1 to avoid HTTP/2 protocol errors
                 'X-Protocol': 'HTTP/1.1'
             }
             """Wait for slot and then process."""
             try:
                 # Wait for processing slot
+                if queue_manager.wait_for_slot(request_id, timeout=600):
                     # Update job status to show processing started
                     job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
                     # Start background task with optimized generation
                 'X-Content-Type-Options': 'nosniff',
                 'Access-Control-Allow-Origin': '*',
                 'Access-Control-Allow-Headers': 'Cache-Control, Connection',
+                'Keep-Alive': 'timeout=3600',
                 # Force HTTP/1.1 to avoid HTTP/2 protocol errors
                 'X-Protocol': 'HTTP/1.1'
             }
                 """Wait for slot and then process."""
                 try:
                     # Wait for processing slot
+                    if queue_manager.wait_for_slot(request_id, timeout=600):
                         # Update job status to show processing started
                         job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
                         # Start background task directly (not in separate thread to avoid nesting)
                     'X-Content-Type-Options': 'nosniff',
                     'Access-Control-Allow-Origin': '*',
                     'Access-Control-Allow-Headers': 'Cache-Control, Connection',
+                    'Keep-Alive': 'timeout=3600',
                     # Force HTTP/1.1 to avoid HTTP/2 protocol errors
                     'X-Protocol': 'HTTP/1.1'
                 }

services/ai-service/src/ai_med_extract/app.py CHANGED Viewed

@@ -764,7 +764,7 @@ def run_dev(host: str = "0.0.0.0", port: int = 7860, debug: bool = False):
     # Initialize agents in dev run (preload small models)
     initialize_agents(app, preload_small_models=True)
     print("Agents initialized, starting uvicorn")
-    uvicorn.run(app, host=host, port=port, reload=debug, timeout_keep_alive=1200)
 if __name__ == "__main__":

     # Initialize agents in dev run (preload small models)
     initialize_agents(app, preload_small_models=True)
     print("Agents initialized, starting uvicorn")
+    uvicorn.run(app, host=host, port=port, reload=debug, timeout_keep_alive=600)
 if __name__ == "__main__":

services/ai-service/src/ai_med_extract/config/performance_config.py CHANGED Viewed

@@ -19,7 +19,7 @@ class PerformanceConfig:
     # Caching
     enable_caching: bool = True
-    cache_ttl_seconds: int = 31200
     max_cache_size: int = 1000
     enable_multi_level_cache: bool = True
@@ -65,7 +65,7 @@ class PerformanceConfig:
             # Caching
             enable_caching=os.environ.get('ENABLE_CACHING', 'true').lower() == 'true',
-            cache_ttl_seconds=int(os.environ.get('CACHE_TTL_SECONDS', '31200')),
             max_cache_size=int(os.environ.get('MAX_CACHE_SIZE', '1000')),
             enable_multi_level_cache=os.environ.get('ENABLE_MULTI_LEVEL_CACHE', 'true').lower() == 'true',

     # Caching
     enable_caching: bool = True
+    cache_ttl_seconds: int = 3600
     max_cache_size: int = 1000
     enable_multi_level_cache: bool = True
             # Caching
             enable_caching=os.environ.get('ENABLE_CACHING', 'true').lower() == 'true',
+            cache_ttl_seconds=int(os.environ.get('CACHE_TTL_SECONDS', '3600')),
             max_cache_size=int(os.environ.get('MAX_CACHE_SIZE', '1000')),
             enable_multi_level_cache=os.environ.get('ENABLE_MULTI_LEVEL_CACHE', 'true').lower() == 'true',

services/ai-service/src/ai_med_extract/enable_optimizations.py CHANGED Viewed

@@ -24,7 +24,7 @@ def enable_all_optimizations():
         # Caching
         'ENABLE_CACHING': 'true',
-        'CACHE_TTL_SECONDS': '31200',
         'MAX_CACHE_SIZE': '1000',
         'ENABLE_MULTI_LEVEL_CACHE': 'true',
@@ -85,7 +85,7 @@ def get_optimization_status() -> Dict[str, Any]:
         },
         "caching_optimizations": {
             "enabled": os.environ.get('ENABLE_CACHING', 'true'),
-            "ttl_seconds": os.environ.get('CACHE_TTL_SECONDS', '31200'),
             "max_size": os.environ.get('MAX_CACHE_SIZE', '1000'),
         },
         "async_optimizations": {

         # Caching
         'ENABLE_CACHING': 'true',
+        'CACHE_TTL_SECONDS': '3600',
         'MAX_CACHE_SIZE': '1000',
         'ENABLE_MULTI_LEVEL_CACHE': 'true',
         },
         "caching_optimizations": {
             "enabled": os.environ.get('ENABLE_CACHING', 'true'),
+            "ttl_seconds": os.environ.get('CACHE_TTL_SECONDS', '3600'),
             "max_size": os.environ.get('MAX_CACHE_SIZE', '1000'),
         },
         "async_optimizations": {

services/ai-service/src/ai_med_extract/inference_service.py CHANGED Viewed

@@ -140,7 +140,7 @@ class InferenceService:
         loop = asyncio.get_event_loop()
         # Optimize chunk size based on text length
-        chunk_size = 8000 if len(text) > 112000 else 12000
         if len(text) > chunk_size:
             chunks = self._split_chunks(text, chunk_size)

         loop = asyncio.get_event_loop()
         # Optimize chunk size based on text length
+        chunk_size = 8000 if len(text) > 16000 else 12000
         if len(text) > chunk_size:
             chunks = self._split_chunks(text, chunk_size)

services/ai-service/src/ai_med_extract/phi_scrubber_service.py CHANGED Viewed

@@ -60,7 +60,7 @@ class PHIScrubberService:
             r = redis.from_url(settings.REDIS_URL, decode_responses=True)
             await r.hincrby(key, "events", 1)
             await r.hincrby(key, "found", len(m))
-            await r.expire(key, 7*24*31200)
         except Exception:
             pass
         return {"original_length": len(text), "scrubbed_length": len(scrubbed), "total_phi_found": len(m), "phi_types": phi_types, "scrubbed_text": scrubbed}

             r = redis.from_url(settings.REDIS_URL, decode_responses=True)
             await r.hincrby(key, "events", 1)
             await r.hincrby(key, "found", len(m))
+            await r.expire(key, 7*24*3600)
         except Exception:
             pass
         return {"original_length": len(text), "scrubbed_length": len(scrubbed), "total_phi_found": len(m), "phi_types": phi_types, "scrubbed_text": scrubbed}

services/ai-service/src/ai_med_extract/services/job_manager.py CHANGED Viewed

@@ -29,7 +29,7 @@ class JobManager:
         """Initialize the job manager with in-memory storage."""
         self._jobs: Dict[str, Dict[str, Any]] = {}
         self._lock = threading.RLock()  # Reentrant lock for nested calls
-        self._cleanup_interval = 31200  # 1 hour
         self._max_job_age = 7200  # 2 hours
     def create_job(self, request_id: Optional[str] = None, initial_data: Optional[Dict] = None) -> str:

         """Initialize the job manager with in-memory storage."""
         self._jobs: Dict[str, Dict[str, Any]] = {}
         self._lock = threading.RLock()  # Reentrant lock for nested calls
+        self._cleanup_interval = 3600  # 1 hour
         self._max_job_age = 7200  # 2 hours
     def create_job(self, request_id: Optional[str] = None, initial_data: Optional[Dict] = None) -> str:

services/ai-service/src/ai_med_extract/services/request_queue.py CHANGED Viewed

@@ -229,7 +229,7 @@ class RequestQueueManager:
                 ]
             }
-    def cleanup_old_requests(self, max_age: int = 31200) -> int:
         """
         Clean up old requests from tracking.
@@ -289,7 +289,7 @@ def get_queue_manager() -> RequestQueueManager:
             _queue_manager = RequestQueueManager(
                 max_concurrent=6,
                 max_queue_size=6,
-                queue_timeout=1200
             )
             logger.info("Initialized RequestQueueManager for Hugging Face Spaces (T4 medium)")
         else:
@@ -297,7 +297,7 @@ def get_queue_manager() -> RequestQueueManager:
             _queue_manager = RequestQueueManager(
                 max_concurrent=4,
                 max_queue_size=20,
-                queue_timeout=1200
             )
             logger.info("Initialized RequestQueueManager for local/development")

                 ]
             }
+    def cleanup_old_requests(self, max_age: int = 3600) -> int:
         """
         Clean up old requests from tracking.
             _queue_manager = RequestQueueManager(
                 max_concurrent=6,
                 max_queue_size=6,
+                queue_timeout=600
             )
             logger.info("Initialized RequestQueueManager for Hugging Face Spaces (T4 medium)")
         else:
             _queue_manager = RequestQueueManager(
                 max_concurrent=4,
                 max_queue_size=20,
+                queue_timeout=600
             )
             logger.info("Initialized RequestQueueManager for local/development")

services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc CHANGED Viewed

Binary files a/services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc differ

services/ai-service/src/ai_med_extract/utils/constants.py CHANGED Viewed

@@ -24,39 +24,39 @@ CHUNK_SIZE_DAYS = 90                # Days per chunk for date-based chunking
 # ========== TIMEOUT CONFIGURATION ==========
 TIMEOUT_CONFIG = {
     "fast": {
-        "ehr_timeout": 1200,
-        "generation_timeout": 1200,
-        "gguf_timeout": 1200,
-        "gguf_extended_timeout": 1200,
         "retry_attempts": 2
     },
     "normal": {
-        "ehr_timeout": 1200,
-        "generation_timeout": 1200,
-        "gguf_timeout": 1200,
-        "gguf_extended_timeout": 1200,
         "retry_attempts": 3
     },
     "extended": {
-        "ehr_timeout": 1200,
-        "generation_timeout": 1200,
-        "gguf_timeout": 1200,
-        "gguf_extended_timeout": 1200,
         "retry_attempts": 3
     },
     "large_data": {
-        "ehr_timeout": 1200,
-        "generation_timeout": 1200,
-        "gguf_timeout": 1200,
-        "gguf_extended_timeout": 1200,
         "retry_attempts": 2
     }
 }
 # ========== SSE STREAMING CONFIGURATION ==========
 SSE_CONFIG = {
-    "max_wait_time": 31200,              # 60 minutes max wait time for normal operations
-    "extended_max_wait_time": 31200,     # 60 minutes extended wait for GGUF/long operations
     "heartbeat_interval": 5,            # Send heartbeat every 5 seconds
     "normal_heartbeat_interval": 10,    # Normal heartbeat interval
     "poll_interval": 1,                 # Check job status every second
@@ -65,7 +65,7 @@ SSE_CONFIG = {
 # ========== CACHE CONFIGURATION ==========
 CACHE_CONFIG = {
-    "ttl_seconds": 31200,  # 1 hour
     "cache_dir": "/tmp/summary_cache",
     "max_cache_size": 100
 }
@@ -89,7 +89,7 @@ MEMORY_CONFIG = {
     "enable_quantization": True,
     "cache_models": True,
     "cleanup_interval": 300,  # 5 minutes
-    "max_memory_mb": 12000,
     "memory_pressure_threshold": 0.8,
     "aggressive_cleanup_threshold": 0.9
 }

 # ========== TIMEOUT CONFIGURATION ==========
 TIMEOUT_CONFIG = {
     "fast": {
+        "ehr_timeout": 600,
+        "generation_timeout": 600,
+        "gguf_timeout": 600,
+        "gguf_extended_timeout": 600,
         "retry_attempts": 2
     },
     "normal": {
+        "ehr_timeout": 600,
+        "generation_timeout": 600,
+        "gguf_timeout": 600,
+        "gguf_extended_timeout": 600,
         "retry_attempts": 3
     },
     "extended": {
+        "ehr_timeout": 600,
+        "generation_timeout": 600,
+        "gguf_timeout": 600,
+        "gguf_extended_timeout": 600,
         "retry_attempts": 3
     },
     "large_data": {
+        "ehr_timeout": 600,
+        "generation_timeout": 600,
+        "gguf_timeout": 600,
+        "gguf_extended_timeout": 600,
         "retry_attempts": 2
     }
 }
 # ========== SSE STREAMING CONFIGURATION ==========
 SSE_CONFIG = {
+    "max_wait_time": 3600,              # 60 minutes max wait time for normal operations
+    "extended_max_wait_time": 3600,     # 60 minutes extended wait for GGUF/long operations
     "heartbeat_interval": 5,            # Send heartbeat every 5 seconds
     "normal_heartbeat_interval": 10,    # Normal heartbeat interval
     "poll_interval": 1,                 # Check job status every second
 # ========== CACHE CONFIGURATION ==========
 CACHE_CONFIG = {
+    "ttl_seconds": 3600,  # 1 hour
     "cache_dir": "/tmp/summary_cache",
     "max_cache_size": 100
 }
     "enable_quantization": True,
     "cache_models": True,
     "cleanup_interval": 300,  # 5 minutes
+    "max_memory_mb": 6000,
     "memory_pressure_threshold": 0.8,
     "aggressive_cleanup_threshold": 0.9
 }

services/ai-service/src/ai_med_extract/utils/hf_spaces_config.py CHANGED Viewed

@@ -65,7 +65,7 @@ TIMEOUT_SETTINGS = {
     "model_loading_timeout": 300,  # 5 minutes for model loading
     "inference_timeout": 120,  # 2 minutes for inference
     "ehr_fetch_timeout": 30,  # 30 seconds for EHR fetch
-    "streaming_timeout": 1200  # 10 minutes for streaming responses
 }
 def get_optimized_model(model_type: str) -> str:

     "model_loading_timeout": 300,  # 5 minutes for model loading
     "inference_timeout": 120,  # 2 minutes for inference
     "ehr_fetch_timeout": 30,  # 30 seconds for EHR fetch
+    "streaming_timeout": 600  # 10 minutes for streaming responses
 }
 def get_optimized_model(model_type: str) -> str:

services/ai-service/src/ai_med_extract/utils/model_config.py CHANGED Viewed

@@ -16,14 +16,10 @@ T4_OPTIMIZATIONS = {
     "torch_dtype": "float16",
     "device_map": "auto",
     "trust_remote_code": True,
-    # Note: cache_dir removed from here - it should be passed to pipeline() directly,
-    # not in model_kwargs, to avoid "not used by the model" errors during generation
     "local_files_only": False
 }
-# T4 cache directory (separate from model_kwargs to avoid generation errors)
-T4_CACHE_DIR = "/tmp/hf_cache"
 # Model generation settings optimized for T4
 GENERATION_CONFIG = {
     "use_cache": True,
@@ -43,18 +39,18 @@ GENERATION_CONFIG = {
 # T4-optimized default models (smaller, efficient models)
 DEFAULT_MODELS = {
     "text-generation": {
-        "primary": "microsoft/Phi-3-mini-4k-instruct",  # Robust 4k context model
-        "fallback": "microsoft/Phi-3-mini-4k-instruct",
         "description": "Text generation models for QA and medical data extraction"
     },
     "summarization": {
-        "primary": "microsoft/Phi-3-mini-4k-instruct",  # Use Phi-3 for summarization too (better context)
-        "fallback": "facebook/bart-large-cnn",
         "description": "Text summarization models for medical reports"
     },
     "seq2seq": {
-        "primary": "facebook/bart-large-cnn",  # Better seq2seq default
-        "fallback": "google/flan-t5-base",
         "description": "Seq2Seq models for summarization tasks"
     },
     "ner": {
@@ -260,7 +256,6 @@ def is_model_supported_on_t4(model_name: str, model_type: str) -> bool:
         "patrickvonplaten/longformer2roberta-cnn_dailymail-fp16",
         # Phi-3 models
         "microsoft/Phi-3-mini-4k-instruct",
-        "microsoft/Phi-3-mini-128k-instruct",
         "microsoft/Phi-3-mini-4k-instruct-GGUF",
         "microsoft/Phi-3-mini-4k-instruct-gguf",
         "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",

     "torch_dtype": "float16",
     "device_map": "auto",
     "trust_remote_code": True,
+    "cache_dir": "/tmp/hf_cache",
     "local_files_only": False
 }
 # Model generation settings optimized for T4
 GENERATION_CONFIG = {
     "use_cache": True,
 # T4-optimized default models (smaller, efficient models)
 DEFAULT_MODELS = {
     "text-generation": {
+        "primary": "microsoft/DialoGPT-small",  # Lightweight conversational model
+        "fallback": "facebook/bart-base",
         "description": "Text generation models for QA and medical data extraction"
     },
     "summarization": {
+        "primary": "sshleifer/distilbart-cnn-6-6",  # Smaller BART variant
+        "fallback": "facebook/bart-base",
         "description": "Text summarization models for medical reports"
     },
     "seq2seq": {
+        "primary": "sshleifer/distilbart-cnn-6-6",  # Same as summarization for consistency
+        "fallback": "facebook/bart-base",
         "description": "Seq2Seq models for summarization tasks"
     },
     "ner": {
         "patrickvonplaten/longformer2roberta-cnn_dailymail-fp16",
         # Phi-3 models
         "microsoft/Phi-3-mini-4k-instruct",
         "microsoft/Phi-3-mini-4k-instruct-GGUF",
         "microsoft/Phi-3-mini-4k-instruct-gguf",
         "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",

services/ai-service/src/ai_med_extract/utils/openvino_summarizer_utils.py CHANGED Viewed

@@ -238,7 +238,7 @@ def delta_to_text(delta):
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import threading
-def generate_section(pipeline, prompt, section_name, timeout=1200):
     """Generate one section with timeout protection."""
     try:
         # If your pipeline supports timeout, pass it. Otherwise, wrap in future.

 from concurrent.futures import ThreadPoolExecutor, as_completed
 import threading
+def generate_section(pipeline, prompt, section_name, timeout=600):
     """Generate one section with timeout protection."""
     try:
         # If your pipeline supports timeout, pass it. Otherwise, wrap in future.

services/ai-service/src/ai_med_extract/utils/performance_monitor.py CHANGED Viewed

@@ -76,7 +76,7 @@ class PerformanceMonitor:
 class RobustParsingCache:
     """Intelligent caching system for robust JSON parsing operations."""
-    def __init__(self, cache_dir: str = "/tmp/medical_ai_cache", ttl: int = 31200):
         self.cache_dir = cache_dir
         self.ttl = ttl  # Time to live in seconds
         os.makedirs(cache_dir, exist_ok=True)

 class RobustParsingCache:
     """Intelligent caching system for robust JSON parsing operations."""
+    def __init__(self, cache_dir: str = "/tmp/medical_ai_cache", ttl: int = 3600):
         self.cache_dir = cache_dir
         self.ttl = ttl  # Time to live in seconds
         os.makedirs(cache_dir, exist_ok=True)

services/ai-service/src/ai_med_extract/utils/unified_model_manager.py CHANGED Viewed

@@ -55,7 +55,6 @@ class ModelInfo:
     load_time: float
     last_used: float
     error_message: Optional[str] = None
-    fallback_reason: Optional[str] = None
 @dataclass
 class GenerationConfig:
@@ -91,22 +90,12 @@ class BaseModel(ABC):
         self._load_time = 0.0
         self._last_used = time.time()
         self._error_message = None
-        self._fallback_reason = None
         self._memory_usage = 0.0
         self._kwargs = kwargs
     @property
     def status(self) -> ModelStatus:
         return self._status
-    @property
-    def fallback_reason(self) -> Optional[str]:
-        """Get the reason why this model is a fallback, if applicable"""
-        return self._fallback_reason
-    def set_fallback_reason(self, reason: str):
-        """Set the fallback reason for this model"""
-        self._fallback_reason = reason
     @abstractmethod
     def _load_implementation(self) -> bool:
@@ -143,11 +132,7 @@ class BaseModel(ABC):
         except Exception as e:
             self._status = ModelStatus.ERROR
             self._error_message = str(e)
-            error_details = f"Load failed: {type(e).__name__}: {str(e)}"
             logger.error(f"Failed to load model {self.name}: {e}")
-            # Store detailed error for fallback tracking
-            if self.model_type == "fallback":
-                self._fallback_reason = error_details
             return None
     def _update_memory_usage(self):
@@ -184,47 +169,9 @@ class TransformersModel(BaseModel):
     def _load_implementation(self) -> bool:
         try:
             from transformers import pipeline
-            import os
             # Get T4-optimized kwargs
             model_kwargs = get_t4_model_kwargs(self.model_type)
-            # Prepare pipeline kwargs to avoid duplicate arguments
-            pipeline_kwargs = self._kwargs.copy()
-            # Move trust_remote_code from model_kwargs to pipeline_kwargs if present
-            # This prevents "multiple values for keyword argument" error
-            if "trust_remote_code" in model_kwargs:
-                pipeline_kwargs["trust_remote_code"] = model_kwargs.pop("trust_remote_code")
-            # Set cache directory via environment variable (safest approach)
-            # This ensures it's only used during from_pretrained(), not passed to generate()
-            if not IS_T4_MEDIUM:
-                # Local environment
-                cache_dir = os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
-                os.environ['HF_HOME'] = cache_dir
-            else:
-                # T4 environment
-                from .model_config import T4_CACHE_DIR
-                os.environ['HF_HOME'] = T4_CACHE_DIR
-            # Ensure trust_remote_code is True for local runs (required for Phi-3 etc)
-            if not IS_T4_MEDIUM:
-                pipeline_kwargs["trust_remote_code"] = True
-                # Force eager attention implementation to avoid Triton dependency on Windows
-                # This helps with "No module named 'triton'" errors for some models
-                # Add to model_kwargs instead of pipeline_kwargs to prevent it from being passed to generate()
-                model_kwargs["attn_implementation"] = "eager"
-                # Force using latest model revision to avoid cache compatibility issues
-                # This prevents "DynamicCache has no attribute get_max_length" errors
-                pipeline_kwargs["revision"] = "main"
-                # CRITICAL FIX: Disable use_cache for Phi-3 models to avoid DynamicCache compatibility issues
-                # The cached Phi-3 model code may use get_max_length() which doesn't exist in newer DynamicCache
-                # We disable cache during loading to force fresh generation without cache issues
-                if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
-                    model_kwargs["use_cache"] = False
             # Handle different model types for summarization
             if self.model_type.lower() in ["summarization", "seq2seq"]:
@@ -234,7 +181,7 @@ class TransformersModel(BaseModel):
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
-                    **pipeline_kwargs
                 )
             elif self.model_type.lower() in ["text-generation", "causal-lm"]:
                 # Text generation models
@@ -243,7 +190,7 @@ class TransformersModel(BaseModel):
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
-                    **pipeline_kwargs
                 )
             elif "bart" in self.name.lower() or "t5" in self.name.lower():
                 # BART and T5 models default to summarization
@@ -252,7 +199,7 @@ class TransformersModel(BaseModel):
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
-                    **pipeline_kwargs
                 )
             elif "longformer" in self.name.lower():
                 # Longformer models work with summarization pipeline
@@ -261,7 +208,7 @@ class TransformersModel(BaseModel):
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
-                    **pipeline_kwargs
                 )
             else:
                 # Default to text-generation for unknown types
@@ -270,7 +217,7 @@ class TransformersModel(BaseModel):
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
-                    **pipeline_kwargs
                 )
             return True
@@ -297,15 +244,6 @@ class TransformersModel(BaseModel):
                 "num_return_sequences": 1
             }
-            # Prepare generation kwargs
-            gen_kwargs = {}
-            # CRITICAL FIX: Disable cache for Phi-3 models to avoid DynamicCache compatibility issues
-            # The cached Phi-3 model code may use get_max_length() which doesn't exist in newer DynamicCache
-            if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
-                gen_kwargs["use_cache"] = False
-                logger.info(f"Disabled cache for Phi-3 model {self.name} to avoid compatibility issues")
             # Handle different pipeline types
             if hasattr(self._model, 'task') and self._model.task == "summarization":
                 # Summarization pipeline
@@ -316,8 +254,7 @@ class TransformersModel(BaseModel):
                     temperature=config.temperature,
                     do_sample=config.temperature > 0.1,
                     num_beams=4,  # Better quality for summarization
-                    early_stopping=True,
-                    **gen_kwargs
                 )
                 return result[0]['summary_text'] if result else ""
             else:
@@ -329,8 +266,7 @@ class TransformersModel(BaseModel):
                     top_p=config.top_p,
                     do_sample=config.temperature > 0.1,
                     pad_token_id=0,
-                    num_return_sequences=1,
-                    **gen_kwargs
                 )
                 generated_text = result[0]['generated_text']
                 # Remove the prompt from the generated text
@@ -362,120 +298,32 @@ class GGUFModel(BaseModel):
     def _load_implementation(self) -> bool:
         try:
             from llama_cpp import Llama
-            import os
-            from pathlib import Path
             # Get T4-optimized kwargs
             model_kwargs = get_t4_model_kwargs("gguf")
             # Set up model path - handle different GGUF formats
             model_path = self.name
-            # If model name doesn't end with .gguf, we need to append the filename
             if not model_path.endswith('.gguf'):
-                # If filename is provided separately, combine repo path with filename
-                if self.filename:
-                    model_path = f"{model_path}/{self.filename}"
                 else:
-                    # Fallback: try to construct path (shouldn't happen if filename extraction worked)
-                    logger.warning(f"GGUF model {self.name} has no filename specified, using name as-is")
-            # Check if model_path is a local file path
-            # If it doesn't exist and looks like a Hugging Face repo path (contains / but not a file path), download it
-            is_local_file = os.path.exists(model_path) or (os.path.isabs(model_path) and os.path.sep in model_path)
-            if not is_local_file:
-                # Not a local file - need to download from Hugging Face
-                try:
-                    from huggingface_hub import hf_hub_download
-                    logger.info(f"Downloading GGUF model from Hugging Face: {self.name}/{self.filename}")
-                    # Extract repo_id and filename
-                    if '/' in model_path and model_path.endswith('.gguf'):
-                        # Path like "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"
-                        parts = model_path.split('/')
-                        repo_id = '/'.join(parts[:-1])
-                        filename = parts[-1]
-                    elif '/' in model_path:
-                        # Path like "microsoft/Phi-3-mini-4k-instruct-gguf" with separate filename
-                        repo_id = model_path
-                        filename = self.filename or self._extract_filename(self.name)
-                    else:
-                        repo_id = self.name
-                        filename = self.filename or self._extract_filename(self.name)
-                    # Download from Hugging Face
-                    logger.info(f"Attempting to download: repo_id={repo_id}, filename={filename}")
-                    model_path = hf_hub_download(
-                        repo_id=repo_id,
-                        filename=filename,
-                        cache_dir=os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
-                    )
-                    logger.info(f"Downloaded GGUF model to: {model_path}")
-                except Exception as download_error:
-                    import traceback
-                    error_details = f"Download failed: {type(download_error).__name__}: {str(download_error)}"
-                    logger.error(f"Failed to download GGUF model from Hugging Face: {error_details}")
-                    logger.debug(f"Download error traceback:\n{traceback.format_exc()}")
-                    logger.error(f"  Repo ID: {repo_id}, Filename: {filename}")
-                    self._error_message = error_details
-                    return False
-            # Verify the file exists
-            if not os.path.exists(model_path):
-                error_msg = f"GGUF model file does not exist: {model_path}"
-                logger.error(error_msg)
-                self._error_message = error_msg
-                return False
-            # Check file size for Q8_0 models (they're larger and might not fit in T4 memory)
-            try:
-                file_size_mb = os.path.getsize(model_path) / (1024 * 1024)
-                logger.info(f"GGUF model file size: {file_size_mb:.2f} MB")
-                # Q8_0 models are typically 2x larger than Q4, warn if very large
-                if "Q8_0" in self.name or "q8_0" in self.name.lower():
-                    if file_size_mb > 8000:  # > 8GB might be too large for T4
-                        logger.warning(f"Q8_0 model is {file_size_mb:.2f} MB - may be too large for T4 (16GB total)")
-            except Exception as size_error:
-                logger.warning(f"Could not check file size: {size_error}")
-            # Adjust context window for 128k models (but limit to available memory)
-            n_ctx = 8192  # Default T4 context window
-            if "128k" in self.name.lower():
-                # 128k models support larger context, but we'll use a reasonable limit for T4
-                n_ctx = 16384  # Use 16k instead of full 128k to save memory
-                logger.info(f"Detected 128k model, using context window: {n_ctx}")
-            # Adjust GPU layers based on model size (Q8_0 models need fewer GPU layers due to memory)
-            n_gpu_layers = 35 if torch.cuda.is_available() else 0
-            if "Q8_0" in self.name or "q8_0" in self.name.lower():
-                # Reduce GPU layers for larger Q8_0 models to avoid OOM
-                n_gpu_layers = min(20, n_gpu_layers) if torch.cuda.is_available() else 0
-                logger.info(f"Q8_0 model detected, using {n_gpu_layers} GPU layers")
-            logger.info(f"Loading GGUF model: {model_path} with n_ctx={n_ctx}, n_gpu_layers={n_gpu_layers}")
             self._model = Llama(
                 model_path=model_path,
-                n_ctx=n_ctx,
                 n_threads=4,  # CPU threads
-                n_gpu_layers=n_gpu_layers,
                 verbose=False,
                 **model_kwargs
             )
-            logger.info(f"Successfully loaded GGUF model: {self.name}")
             return True
         except Exception as e:
-            import traceback
-            error_details = f"{type(e).__name__}: {str(e)}"
-            error_traceback = traceback.format_exc()
-            logger.error(f"Failed to load GGUF model {self.name}: {error_details}")
-            logger.debug(f"Full traceback:\n{error_traceback}")
-            self._error_message = error_details
-            # Store detailed error for fallback tracking
-            if self.model_type == "fallback":
-                self._fallback_reason = f"GGUF load failed: {error_details}"
             return False
     def generate(self, prompt: str, config: GenerationConfig) -> str:
@@ -509,7 +357,6 @@ class OpenVINOModel(BaseModel):
     def _load_implementation(self) -> bool:
         try:
-            import warnings
             from optimum.intel import OVModelForCausalLM
             from transformers import AutoTokenizer
@@ -526,38 +373,17 @@ class OpenVINOModel(BaseModel):
                 # e.g., "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov" -> "microsoft/Phi-3-mini-4k-instruct"
                 if "Phi-3-mini-4k-instruct" in self.name:
                     tokenizer_path = "microsoft/Phi-3-mini-4k-instruct"
-                elif "Phi-3-mini-128k-instruct" in self.name:
-                    tokenizer_path = "microsoft/Phi-3-mini-128k-instruct"
-            # For causal-openvino type with standard model names, use the model name directly for tokenizer
-            elif self.model_type == "causal-openvino":
-                # For models like "microsoft/Phi-3-mini-128k-instruct", use the same name for tokenizer
-                tokenizer_path = self.name
-            # Suppress TracerWarnings during OpenVINO export (these are harmless but noisy)
-            # The warnings occur when OpenVINO traces the PyTorch model for conversion
-            with warnings.catch_warnings():
-                warnings.filterwarnings("ignore", category=UserWarning, module="torch.jit")
-                warnings.filterwarnings("ignore", message=".*TracerWarning.*")
-                warnings.filterwarnings("ignore", message=".*Converting a tensor to a Python boolean.*")
-                warnings.filterwarnings("ignore", message=".*torch.tensor results are registered as constants.*")
-            # Load the OpenVINO model with trust_remote_code=True
             self._model = OVModelForCausalLM.from_pretrained(
                 model_path,
                 device="GPU" if torch.cuda.is_available() else "CPU",
-                trust_remote_code=True,
-                **model_kwargs,
             )
-            # Load the tokenizer (also may need trust_remote_code)
-            self._tokenizer = AutoTokenizer.from_pretrained(
-                tokenizer_path,
-                trust_remote_code=True,
-            )
             return True
         except Exception as e:
             logger.error(f"Failed to load OpenVINO model {self.name}: {e}")
-            import traceback
-            logger.debug(f"OpenVINO load error traceback:\n{traceback.format_exc()}")
             return False
     def generate(self, prompt: str, config: GenerationConfig) -> str:
@@ -565,47 +391,7 @@ class OpenVINOModel(BaseModel):
             raise ModelError(self.name, "not_loaded", "Model not loaded")
         try:
-            # Detect 128k models and set appropriate context window
-            is_128k_model = "128k" in self.name.lower()
-            # Get tokenizer's model_max_length (defaults to 128k for Phi-3-128k models)
-            tokenizer_max_length = getattr(self._tokenizer, 'model_max_length', None)
-            # For 128k models, use full context window (131072 tokens = 128k)
-            # For other models, use tokenizer's default or a safe limit
-            if is_128k_model:
-                max_context_length = 131072  # Full 128k context window
-                logger.info(f"128k model detected: Using context window of {max_context_length} tokens")
-            elif tokenizer_max_length:
-                max_context_length = tokenizer_max_length
-            else:
-                max_context_length = 4096  # Safe default for 4k models
-            # Tokenize with proper context window handling
-            # For 128k models, explicitly set max_length to allow full context without truncation
-            tokenizer_kwargs = {"return_tensors": "pt"}
-            if is_128k_model:
-                # For 128k models, set max_length to full context window and disable truncation
-                tokenizer_kwargs["max_length"] = max_context_length
-                tokenizer_kwargs["truncation"] = False  # Don't truncate - allow full 128k context
-            else:
-                # For other models, use tokenizer's default max_length with truncation enabled
-                # This prevents errors if prompt exceeds context window
-                if tokenizer_max_length:
-                    tokenizer_kwargs["max_length"] = tokenizer_max_length
-                    tokenizer_kwargs["truncation"] = True
-                # If no max_length set, let tokenizer use its default
-            inputs = self._tokenizer(prompt, **tokenizer_kwargs)
-            # Log token count for debugging
-            input_ids = inputs.get('input_ids', None)
-            if input_ids is not None:
-                prompt_tokens = input_ids.shape[1] if len(input_ids.shape) > 1 else len(input_ids)
-                logger.debug(f"Prompt token count: {prompt_tokens} / {max_context_length}")
-                if prompt_tokens > max_context_length * 0.9:
-                    logger.warning(f"Prompt is using {prompt_tokens}/{max_context_length} tokens ({prompt_tokens/max_context_length*100:.1f}%) - approaching context limit")
             if torch.cuda.is_available():
                 inputs = {k: v.to("cuda") for k, v in inputs.items()}
@@ -641,7 +427,6 @@ class FallbackModel(BaseModel):
     def generate(self, prompt: str, config: GenerationConfig) -> str:
         # Simple rule-based fallback
-        # Accept config parameter for compatibility with other models
         return "Patient summary generation completed. Please review patient data manually for comprehensive assessment."
 class UnifiedModelManager:
@@ -662,10 +447,8 @@ class UnifiedModelManager:
             model_type = detect_model_type(name)
         # Check if model is supported on T4
-        fallback_reason = None
         if not is_model_supported_on_t4(name, model_type):
-            fallback_reason = f"Model {name} ({model_type}) is not supported/optimal for T4 Medium"
-            logger.warning(f"Model {name} may not be optimal for T4. Using fallback. Reason: {fallback_reason}")
             model_type = "fallback"
         cache_key = f"{name}:{model_type}"
@@ -681,29 +464,12 @@ class UnifiedModelManager:
             model_kwargs = get_t4_model_kwargs(model_type)
             model_kwargs.update(kwargs)
-            # Special handling for Phi-3-small - it has hard dependency on Triton
-            # which is not available on Windows. Switch to Phi-3-mini-128k-instruct instead.
-            if "Phi-3-small" in name:
-                if model_type == "openvino" or model_type == "causal-openvino":
-                    # OpenVINO mode - not supported for auto-export
-                    logger.warning(f"Phi-3-small is not currently supported in OpenVINO mode (architecture not supported for export). Switching to 'microsoft/Phi-3-mini-128k-instruct'.")
-                    name = "microsoft/Phi-3-mini-128k-instruct"
-                elif not IS_T4_MEDIUM and (model_type == "text-generation" or model_type == "causal-lm" or model_type == "transformers"):
-                    # Transformers mode on Windows - Triton not available
-                    logger.warning(f"Phi-3-small requires Triton which is not available on Windows. Switching to 'microsoft/Phi-3-mini-128k-instruct'.")
-                    name = "microsoft/Phi-3-mini-128k-instruct"
-                    # Update cache key to reflect the actual model being loaded
-                    cache_key = f"{name}:{model_type}"
             if model_type == "gguf" or filename or name.endswith('.gguf'):
                 model = GGUFModel(name, model_type, filename, **model_kwargs)
-            elif model_type == "openvino" or model_type == "causal-openvino" or "openvino" in name.lower():
                 model = OpenVINOModel(name, model_type, **model_kwargs)
             elif model_type == "fallback":
                 model = FallbackModel(name, model_type, **model_kwargs)
-                # Store fallback reason if we switched to fallback
-                if fallback_reason:
-                    model._fallback_reason = fallback_reason
             else:
                 model = TransformersModel(name, model_type, **model_kwargs)
@@ -711,88 +477,9 @@ class UnifiedModelManager:
         # Load if not lazy
         if not lazy and model.status != ModelStatus.LOADED:
-            load_result = model.load()
-            # If load failed and we're using fallback, capture the reason
-            if load_result is None and model.model_type == "fallback" and not model._fallback_reason:
-                model._fallback_reason = f"Model {name} failed to load"
         return model
-    def get_fallback_reason(self, name: str, model_type: str = None) -> Optional[str]:
-        """Get the fallback reason for a specific model if it's using fallback"""
-        if model_type is None:
-            model_type = detect_model_type(name)
-        cache_key = f"{name}:{model_type}"
-        if cache_key in self._models:
-            model = self._models[cache_key]
-            return model.fallback_reason
-        return None
-    def diagnose_model_loading(self, name: str, model_type: str = None) -> Dict[str, Any]:
-        """Diagnose why a model might not be loading - returns detailed information"""
-        if model_type is None:
-            model_type = detect_model_type(name)
-        diagnosis = {
-            "model_name": name,
-            "model_type": model_type,
-            "is_supported_on_t4": is_model_supported_on_t4(name, model_type),
-            "cache_key": f"{name}:{model_type}",
-            "in_cache": False,
-            "status": None,
-            "error_message": None,
-            "fallback_reason": None,
-            "file_exists": False,
-            "file_path": None,
-            "file_size_mb": None
-        }
-        # Check cache
-        if diagnosis["cache_key"] in self._models:
-            model = self._models[diagnosis["cache_key"]]
-            diagnosis["in_cache"] = True
-            diagnosis["status"] = model.status.value if model.status else None
-            diagnosis["error_message"] = model._error_message
-            diagnosis["fallback_reason"] = model._fallback_reason
-        # Check if it's a GGUF model and verify file
-        if model_type == "gguf" or name.endswith('.gguf'):
-            import os
-            # Try to determine the file path
-            if '/' in name and name.endswith('.gguf'):
-                parts = name.split('/')
-                repo_id = '/'.join(parts[:-1])
-                filename = parts[-1]
-                # Check Hugging Face cache
-                cache_dir = os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
-                # Try to find the file in cache
-                potential_paths = [
-                    os.path.join(cache_dir, 'hub', f'models--{repo_id.replace("/", "--")}', 'snapshots', '*', filename),
-                    os.path.join(cache_dir, 'hub', repo_id.replace('/', '--'), filename),
-                ]
-                # Check if file exists locally first
-                if os.path.exists(name):
-                    diagnosis["file_exists"] = True
-                    diagnosis["file_path"] = name
-                else:
-                    # Try to find in cache
-                    from glob import glob
-                    for pattern in potential_paths:
-                        matches = glob(pattern)
-                        if matches:
-                            diagnosis["file_exists"] = True
-                            diagnosis["file_path"] = matches[0]
-                            break
-            if diagnosis["file_path"] and os.path.exists(diagnosis["file_path"]):
-                try:
-                    diagnosis["file_size_mb"] = round(os.path.getsize(diagnosis["file_path"]) / (1024 * 1024), 2)
-                except:
-                    pass
-        return diagnosis
     def generate_text(self, name: str, prompt: str, model_type: str = None, **kwargs) -> str:
         """Generate text using specified model"""
@@ -812,7 +499,7 @@ class UnifiedModelManager:
         for key, model in self._models.items():
             # Remove models not used in last hour
-            if current_time - model._last_used > 31200:
                 to_remove.append(key)
         for key in to_remove:
@@ -830,8 +517,7 @@ class UnifiedModelManager:
                 memory_usage=model._memory_usage,
                 load_time=model._load_time,
                 last_used=model._last_used,
-                error_message=model._error_message,
-                fallback_reason=model._fallback_reason
             )
             for model in self._models.values()
         ]
@@ -850,25 +536,7 @@ unified_model_manager = get_unified_model_manager()
 # Legacy compatibility functions
 def create_fallback_pipeline():
     """Create a fallback pipeline for compatibility"""
-    fallback_model = FallbackModel("fallback", "fallback")
-    fallback_model.load()  # Ensure it's loaded
-    # Create a wrapper that matches the expected interface
-    class FallbackPipelineWrapper:
-        def __init__(self, model):
-            self.model = model
-        def generate(self, prompt, **kwargs):
-            """Generate with keyword arguments (for compatibility with GGUF pipeline interface)"""
-            # Convert kwargs to GenerationConfig (already imported at module level)
-            config = GenerationConfig(**kwargs)
-            return self.model.generate(prompt, config)
-        def generate_full_summary(self, prompt, **kwargs):
-            """Generate full summary (for compatibility)"""
-            return self.generate(prompt, **kwargs)
-    return FallbackPipelineWrapper(fallback_model)
 def get_memory_monitor():
     """Get a simple memory monitor for compatibility"""

     load_time: float
     last_used: float
     error_message: Optional[str] = None
 @dataclass
 class GenerationConfig:
         self._load_time = 0.0
         self._last_used = time.time()
         self._error_message = None
         self._memory_usage = 0.0
         self._kwargs = kwargs
     @property
     def status(self) -> ModelStatus:
         return self._status
     @abstractmethod
     def _load_implementation(self) -> bool:
         except Exception as e:
             self._status = ModelStatus.ERROR
             self._error_message = str(e)
             logger.error(f"Failed to load model {self.name}: {e}")
             return None
     def _update_memory_usage(self):
     def _load_implementation(self) -> bool:
         try:
             from transformers import pipeline
             # Get T4-optimized kwargs
             model_kwargs = get_t4_model_kwargs(self.model_type)
             # Handle different model types for summarization
             if self.model_type.lower() in ["summarization", "seq2seq"]:
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
+                    **self._kwargs
                 )
             elif self.model_type.lower() in ["text-generation", "causal-lm"]:
                 # Text generation models
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
+                    **self._kwargs
                 )
             elif "bart" in self.name.lower() or "t5" in self.name.lower():
                 # BART and T5 models default to summarization
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
+                    **self._kwargs
                 )
             elif "longformer" in self.name.lower():
                 # Longformer models work with summarization pipeline
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
+                    **self._kwargs
                 )
             else:
                 # Default to text-generation for unknown types
                     model=self.name,
                     device_map="auto" if torch.cuda.is_available() else None,
                     model_kwargs=model_kwargs,
+                    **self._kwargs
                 )
             return True
                 "num_return_sequences": 1
             }
             # Handle different pipeline types
             if hasattr(self._model, 'task') and self._model.task == "summarization":
                 # Summarization pipeline
                     temperature=config.temperature,
                     do_sample=config.temperature > 0.1,
                     num_beams=4,  # Better quality for summarization
+                    early_stopping=True
                 )
                 return result[0]['summary_text'] if result else ""
             else:
                     top_p=config.top_p,
                     do_sample=config.temperature > 0.1,
                     pad_token_id=0,
+                    num_return_sequences=1
                 )
                 generated_text = result[0]['generated_text']
                 # Remove the prompt from the generated text
     def _load_implementation(self) -> bool:
         try:
             from llama_cpp import Llama
             # Get T4-optimized kwargs
             model_kwargs = get_t4_model_kwargs("gguf")
             # Set up model path - handle different GGUF formats
             model_path = self.name
             if not model_path.endswith('.gguf'):
+                if '/' in model_path:
+                    # Already a full path like microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf
+                    model_path = f"{model_path}"
                 else:
+                    # Add default filename
+                    model_path = f"{model_path}/{self.filename}"
             self._model = Llama(
                 model_path=model_path,
+                n_ctx=8192,  # T4 context window
                 n_threads=4,  # CPU threads
+                n_gpu_layers=35 if torch.cuda.is_available() else 0,  # GPU layers for Phi-3
                 verbose=False,
                 **model_kwargs
             )
             return True
         except Exception as e:
+            logger.error(f"Failed to load GGUF model {self.name}: {e}")
             return False
     def generate(self, prompt: str, config: GenerationConfig) -> str:
     def _load_implementation(self) -> bool:
         try:
             from optimum.intel import OVModelForCausalLM
             from transformers import AutoTokenizer
                 # e.g., "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov" -> "microsoft/Phi-3-mini-4k-instruct"
                 if "Phi-3-mini-4k-instruct" in self.name:
                     tokenizer_path = "microsoft/Phi-3-mini-4k-instruct"
             self._model = OVModelForCausalLM.from_pretrained(
                 model_path,
                 device="GPU" if torch.cuda.is_available() else "CPU",
+                **model_kwargs
             )
+            self._tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
             return True
         except Exception as e:
             logger.error(f"Failed to load OpenVINO model {self.name}: {e}")
             return False
     def generate(self, prompt: str, config: GenerationConfig) -> str:
             raise ModelError(self.name, "not_loaded", "Model not loaded")
         try:
+            inputs = self._tokenizer(prompt, return_tensors="pt")
             if torch.cuda.is_available():
                 inputs = {k: v.to("cuda") for k, v in inputs.items()}
     def generate(self, prompt: str, config: GenerationConfig) -> str:
         # Simple rule-based fallback
         return "Patient summary generation completed. Please review patient data manually for comprehensive assessment."
 class UnifiedModelManager:
             model_type = detect_model_type(name)
         # Check if model is supported on T4
         if not is_model_supported_on_t4(name, model_type):
+            logger.warning(f"Model {name} may not be optimal for T4. Using fallback.")
             model_type = "fallback"
         cache_key = f"{name}:{model_type}"
             model_kwargs = get_t4_model_kwargs(model_type)
             model_kwargs.update(kwargs)
             if model_type == "gguf" or filename or name.endswith('.gguf'):
                 model = GGUFModel(name, model_type, filename, **model_kwargs)
+            elif model_type == "openvino" or "openvino" in name.lower():
                 model = OpenVINOModel(name, model_type, **model_kwargs)
             elif model_type == "fallback":
                 model = FallbackModel(name, model_type, **model_kwargs)
             else:
                 model = TransformersModel(name, model_type, **model_kwargs)
         # Load if not lazy
         if not lazy and model.status != ModelStatus.LOADED:
+            model.load()
         return model
     def generate_text(self, name: str, prompt: str, model_type: str = None, **kwargs) -> str:
         """Generate text using specified model"""
         for key, model in self._models.items():
             # Remove models not used in last hour
+            if current_time - model._last_used > 3600:
                 to_remove.append(key)
         for key in to_remove:
                 memory_usage=model._memory_usage,
                 load_time=model._load_time,
                 last_used=model._last_used,
+                error_message=model._error_message
             )
             for model in self._models.values()
         ]
 # Legacy compatibility functions
 def create_fallback_pipeline():
     """Create a fallback pipeline for compatibility"""
+    return FallbackModel("fallback", "fallback")
 def get_memory_monitor():
     """Get a simple memory monitor for compatibility"""

temp_test_load.py DELETED Viewed

@@ -1,6 +0,0 @@
-import sys, os
-sys.path.append(r'd:/dartdev/glitz/git/HNTAI/services/ai-service/src')
-from ai_med_extract.utils.unified_model_manager import UnifiedModelManager
-manager = UnifiedModelManager()
-model = manager.get_model('microsoft/Phi-3-small-8k-instruct', model_type='causal-openvino', lazy=False)
-print('Model status after load:', model.status)

temp_test_load_128k.py DELETED Viewed

@@ -1,9 +0,0 @@
-import sys, os
-sys.path.append(r'd:/dartdev/glitz/git/HNTAI/services/ai-service/src')
-from ai_med_extract.utils.unified_model_manager import UnifiedModelManager
-manager = UnifiedModelManager()
-# Testing the primary model from config
-model_name = 'microsoft/Phi-3-mini-128k-instruct'
-print(f'Testing load for: {model_name}')
-model = manager.get_model(model_name, model_type='causal-openvino', lazy=False)
-print(f'Model status after load: {model.status}')