sachinchandrankallar commited on
Commit
cdea66b
·
1 Parent(s): dd14d00

Revert "feat: Establish AI medical extraction service with performance optimizations, unified model management, and detailed Hugging Face Spaces deployment guides."

Browse files
Files changed (35) hide show
  1. Dockerfile.hf-spaces-minimal +1 -1
  2. __pycache__/app.cpython-311.pyc +0 -0
  3. docs/FIXES/PHI3_COMPATIBILITY_FIX.md +0 -257
  4. docs/archive/COMPREHENSIVE_STREAMING_FIX.md +2 -2
  5. docs/archive/patient_summary_models_review.md +5 -5
  6. docs/hf-spaces/FILES_CREATED.md +4 -4
  7. docs/hf-spaces/INDEX.md +2 -2
  8. models_config.json +4 -21
  9. services/ai-service/DEPLOYMENT_FIX.md +4 -4
  10. services/ai-service/Dockerfile.prod +1 -1
  11. services/ai-service/src/__main__.py +1 -1
  12. services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc +0 -0
  13. services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc +0 -0
  14. services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc +0 -0
  15. services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc +0 -0
  16. services/ai-service/src/ai_med_extract/agents/patient_summary_agent.py +20 -0
  17. services/ai-service/src/ai_med_extract/api/routes_fastapi.py +31 -91
  18. services/ai-service/src/ai_med_extract/app.py +1 -1
  19. services/ai-service/src/ai_med_extract/config/performance_config.py +2 -2
  20. services/ai-service/src/ai_med_extract/enable_optimizations.py +2 -2
  21. services/ai-service/src/ai_med_extract/inference_service.py +1 -1
  22. services/ai-service/src/ai_med_extract/phi_scrubber_service.py +1 -1
  23. services/ai-service/src/ai_med_extract/services/job_manager.py +1 -1
  24. services/ai-service/src/ai_med_extract/services/request_queue.py +3 -3
  25. services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc +0 -0
  26. services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc +0 -0
  27. services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc +0 -0
  28. services/ai-service/src/ai_med_extract/utils/constants.py +20 -20
  29. services/ai-service/src/ai_med_extract/utils/hf_spaces_config.py +1 -1
  30. services/ai-service/src/ai_med_extract/utils/model_config.py +7 -12
  31. services/ai-service/src/ai_med_extract/utils/openvino_summarizer_utils.py +1 -1
  32. services/ai-service/src/ai_med_extract/utils/performance_monitor.py +1 -1
  33. services/ai-service/src/ai_med_extract/utils/unified_model_manager.py +26 -358
  34. temp_test_load.py +0 -6
  35. temp_test_load_128k.py +0 -9
Dockerfile.hf-spaces-minimal CHANGED
@@ -48,5 +48,5 @@ HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
48
  CMD curl -f http://localhost:7860/health || exit 1
49
 
50
  # Start application with single worker for minimal memory footprint
51
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "1200"]
52
 
 
48
  CMD curl -f http://localhost:7860/health || exit 1
49
 
50
  # Start application with single worker for minimal memory footprint
51
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "600"]
52
 
__pycache__/app.cpython-311.pyc CHANGED
Binary files a/__pycache__/app.cpython-311.pyc and b/__pycache__/app.cpython-311.pyc differ
 
docs/FIXES/PHI3_COMPATIBILITY_FIX.md DELETED
@@ -1,257 +0,0 @@
1
- # Fix: Phi-3 Model Compatibility Issues
2
-
3
- ## Issues Fixed
4
-
5
- ### Issue 1: ✅ cache_dir Model Kwargs Error
6
- ```
7
- ValueError: The following `model_kwargs` are not used by the model: ['cache_dir']
8
- ```
9
-
10
- ### Issue 2: ✅ DynamicCache Compatibility Error
11
- ```
12
- AttributeError: 'DynamicCache' object has no attribute 'get_max_length'
13
- ```
14
-
15
- ---
16
-
17
- ## Root Causes
18
-
19
- ### Issue 1: cache_dir Error
20
- - `cache_dir` was being passed in `model_kwargs` or `pipeline_kwargs`
21
- - These parameters can leak into the `generate()` method
22
- - Models reject `cache_dir` during generation since it's only valid during loading
23
-
24
- ### Issue 2: DynamicCache Error
25
- - Phi-3 models use a long-context cache mechanism with `DynamicCache`
26
- - Older cached model code (in `transformers_modules`) uses `get_max_length()` method
27
- - Newer transformers library's `DynamicCache` class doesn't have this method
28
- - This causes compatibility issues between cached model code and current library
29
-
30
- ---
31
-
32
- ## Solutions Implemented
33
-
34
- ### Fix 1: cache_dir via Environment Variable
35
-
36
- **File:** `services/ai-service/src/ai_med_extract/utils/unified_model_manager.py` (Lines 200-209)
37
-
38
- ```python
39
- # Set cache directory via environment variable (safest approach)
40
- # This ensures it's only used during from_pretrained(), not passed to generate()
41
- if not IS_T4_MEDIUM:
42
- # Local environment
43
- cache_dir = os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
44
- os.environ['HF_HOME'] = cache_dir
45
- else:
46
- # T4 environment
47
- from .model_config import T4_CACHE_DIR
48
- os.environ['HF_HOME'] = T4_CACHE_DIR
49
- ```
50
-
51
- **Why this works:**
52
- - `HF_HOME` is the official environment variable for transformers cache
53
- - It's read during `from_pretrained()` but **never** passed to `generate()`
54
- - Completely eliminates the `cache_dir` error
55
-
56
- **Also updated:** `model_config.py` to remove `cache_dir` from `T4_OPTIMIZATIONS`
57
-
58
- ### Fix 2: Disable Cache for Phi-3 Models
59
-
60
- **File:** `services/ai-service/src/ai_med_extract/utils/unified_model_manager.py`
61
-
62
- **Location 1:** Model Loading (Lines 223-227)
63
- ```python
64
- # CRITICAL FIX: Disable use_cache for Phi-3 models to avoid DynamicCache compatibility issues
65
- # The cached Phi-3 model code may use get_max_length() which doesn't exist in newer DynamicCache
66
- if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
67
- model_kwargs["use_cache"] = False
68
- ```
69
-
70
- **Location 2:** Generation (Lines 300-307)
71
- ```python
72
- # Prepare generation kwargs
73
- gen_kwargs = {}
74
-
75
- # CRITICAL FIX: Disable cache for Phi-3 models to avoid DynamicCache compatibility issues
76
- if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
77
- gen_kwargs["use_cache"] = False
78
- logger.info(f"Disabled cache for Phi-3 model {self.name} to avoid compatibility issues")
79
- ```
80
-
81
- **Why this works:**
82
- - Disabling `use_cache` prevents Phi-3 from using the problematic `DynamicCache` mechanism
83
- - The model runs slightly slower but avoids the `get_max_length()` error
84
- - All Phi-3 variants are covered: `Phi-3-small`, `Phi-3-mini`, `Phi-3-mini-128k`, etc.
85
-
86
- ---
87
-
88
- ## Affected Models
89
-
90
- ### All Phi-3 Variants
91
- - ✅ `microsoft/Phi-3-small-8k-instruct`
92
- - ✅ `microsoft/Phi-3-mini-4k-instruct`
93
- - ✅ `microsoft/Phi-3-mini-128k-instruct`
94
- - ✅ `microsoft/Phi-3-medium-4k-instruct`
95
- - ✅ Any other Phi-3 model
96
-
97
- ### All Text-Generation Models
98
- - ✅ Any model using `text-generation` pipeline
99
- - ✅ `cache_dir` fix applies universally
100
-
101
- ---
102
-
103
- ## Testing
104
-
105
- ### Test Case 1: Phi-3-small with Text Generation
106
-
107
- **Request:**
108
- ```json
109
- {
110
- "mode": "stream",
111
- "patientid": 4268,
112
- "token": "your-token",
113
- "key": "https://api.glitzit.com",
114
- "patient_summarizer_model_name": "microsoft/Phi-3-small-8k-instruct",
115
- "patient_summarizer_model_type": "text-generation",
116
- "custom_prompt": "create a clinical patient summary in markdown"
117
- }
118
- ```
119
-
120
- **Before Fixes:**
121
- - ❌ Error 1: `cache_dir` not used by model
122
- - ❌ Error 2: `DynamicCache` has no attribute `get_max_length`
123
-
124
- **After Fixes:**
125
- - ✅ Model loads successfully
126
- - ✅ Generates patient summary without errors
127
- - ℹ️ Note: May auto-switch to `Phi-3-mini-128k-instruct` on Windows (Triton unavailable)
128
-
129
- ### Test Case 2: Default Phi-3 Model
130
-
131
- **Request:**
132
- ```json
133
- {
134
- "mode": "stream",
135
- "patientid": 4268,
136
- "token": "your-token",
137
- "key": "https://api.glitzit.com"
138
- }
139
- ```
140
-
141
- **Result:**
142
- - ✅ Uses default Phi-3 GGUF model
143
- - ✅ No cache issues
144
-
145
- ---
146
-
147
- ## Performance Impact
148
-
149
- ### cache_dir Fix
150
- - **Impact:** None
151
- - **Reason:** Environment variable approach is just as efficient as parameter passing
152
-
153
- ### use_cache=False for Phi-3
154
- - **Impact:** Slight performance decrease (~5-10% slower)
155
- - **Reason:** Model can't reuse cached key-values during generation
156
- - **Trade-off:** Worth it to avoid crashes and ensure compatibility
157
- - **Alternative:** Update transformers library and clear cache (more complex)
158
-
159
- ---
160
-
161
- ## Alternative Solutions Considered
162
-
163
- ### Alternative 1: Clear HuggingFace Cache
164
- ```bash
165
- rm -rf D:\tmp\huggingface\modules\transformers_modules
166
- ```
167
- - **Pros:** Would fix DynamicCache issue permanently
168
- - **Cons:** Requires manual intervention, re-downloads models
169
-
170
- ### Alternative 2: Update transformers Library
171
- ```bash
172
- pip install --upgrade transformers
173
- ```
174
- - **Pros:** May fix compatibility
175
- - **Cons:** Could break other models, requires testing
176
-
177
- ### Alternative 3: Use Different Model
178
- ```json
179
- {
180
- "patient_summarizer_model_name": "google/flan-t5-large",
181
- "patient_summarizer_model_type": "summarization"
182
- }
183
- ```
184
- - **Pros:** No Phi-3 compatibility issues
185
- - **Cons:** Different model quality, not instruction-tuned for medical text
186
-
187
- **Our Choice:** Disable cache for Phi-3 models (minimal impact, maximum compatibility)
188
-
189
- ---
190
-
191
- ## Logs to Monitor
192
-
193
- ### Successful Load
194
- ```
195
- 2025-11-24 10:29:38,016 - INFO - Loading model: microsoft/Phi-3-mini-128k-instruct (text-generation)
196
- 2025-11-24 10:29:43,231 - INFO - Model microsoft/Phi-3-mini-128k-instruct loaded in 5.22s
197
- ```
198
-
199
- ### Cache Disabled Log
200
- ```
201
- 2025-11-24 10:29:46,808 - INFO - Disabled cache for Phi-3 model microsoft/Phi-3-mini-128k-instruct to avoid compatibility issues
202
- ```
203
-
204
- ### Success
205
- ```
206
- INFO: 127.0.0.1:49677 - "POST /generate_patient_summary?stream=true HTTP/1.1" 200 OK
207
- ```
208
-
209
- ---
210
-
211
- ## Files Modified
212
-
213
- 1. **`services/ai-service/src/ai_med_extract/utils/model_config.py`**
214
- - Removed `cache_dir` from `T4_OPTIMIZATIONS`
215
- - Added `T4_CACHE_DIR` constant
216
-
217
- 2. **`services/ai-service/src/ai_med_extract/utils/unified_model_manager.py`**
218
- - Lines 200-209: Set cache via `HF_HOME` environment variable
219
- - Lines 223-227: Disable cache during Phi-3 model loading
220
- - Lines 300-307: Disable cache during Phi-3 generation
221
-
222
- ---
223
-
224
- ## Recommended Models
225
-
226
- ### Best for Medical Summaries (No Issues)
227
- ```json
228
- {
229
- "patient_summarizer_model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
230
- "patient_summarizer_model_type": "gguf"
231
- }
232
- ```
233
- - ✅ No cache issues (uses llama.cpp backend)
234
- - ✅ Fast and efficient
235
- - ✅ Medical domain knowledge
236
-
237
- ### Best for Long Context (Fixed Now)
238
- ```json
239
- {
240
- "patient_summarizer_model_name": "microsoft/Phi-3-small-8k-instruct",
241
- "patient_summarizer_model_type": "text-generation"
242
- }
243
- ```
244
- - ✅ 8k context window
245
- - ✅ Works with both fixes applied
246
- - ⚠️ May auto-switch to Phi-3-mini-128k on Windows
247
-
248
- ---
249
-
250
- ## Date
251
-
252
- Fixed: November 24, 2025
253
-
254
- ## Status
255
-
256
- ✅ **RESOLVED** - Both issues fixed and tested
257
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/archive/COMPREHENSIVE_STREAMING_FIX.md CHANGED
@@ -31,7 +31,7 @@ is_gguf_mode = (data.get('generation_mode') == 'gguf' or
31
  ### **3. Extended Timeout Configuration**
32
  ```python
33
  # Extended timeout for GGUF operations
34
- max_wait_time = 1200 # 10 minutes for GGUF operations
35
  heartbeat_interval = 5 # Every 5 seconds
36
  ```
37
 
@@ -54,7 +54,7 @@ heartbeat_interval = 5 # Every 5 seconds
54
  ### **5. Enhanced SSE Generator**
55
  ```python
56
  def sse_generator_extended(job_id):
57
- max_wait_time = 1200 # 10 minutes for GGUF operations
58
  heartbeat_interval = 5 # Every 5 seconds
59
  # Enhanced logging and progress updates
60
  ```
 
31
  ### **3. Extended Timeout Configuration**
32
  ```python
33
  # Extended timeout for GGUF operations
34
+ max_wait_time = 600 # 10 minutes for GGUF operations
35
  heartbeat_interval = 5 # Every 5 seconds
36
  ```
37
 
 
54
  ### **5. Enhanced SSE Generator**
55
  ```python
56
  def sse_generator_extended(job_id):
57
+ max_wait_time = 600 # 10 minutes for GGUF operations
58
  heartbeat_interval = 5 # Every 5 seconds
59
  # Enhanced logging and progress updates
60
  ```
docs/archive/patient_summary_models_review.md CHANGED
@@ -160,7 +160,7 @@ elif model_type == "causal-openvino":
160
 
161
  #### Weaknesses
162
  - ⚠️ **Slight quality loss**: Q4 quantization may reduce quality slightly
163
- - ⚠️ **Longer timeouts**: Extended timeout needed (1200s on HF Spaces)
164
  - ⚠️ **File path parsing**: Requires special handling for filename extraction
165
 
166
  #### Implementation Details
@@ -428,7 +428,7 @@ Based on HF Spaces configuration (`hf_spaces_config.py`):
428
  - ✅ **RAM**: ~3-4GB during inference
429
  - ✅ **Speed**: Very good on T4 (GGUF optimized)
430
  - ✅ **HF Spaces Config**: Primary GGUF model (line 33)
431
- - ✅ **Extended Timeout**: 1200s configured for HF Spaces (routes_fastapi.py line 1075)
432
  - ✅ **Quantization**: Q4 reduces memory by ~75%
433
 
434
  #### Performance Estimates
@@ -449,7 +449,7 @@ Based on HF Spaces configuration (`hf_spaces_config.py`):
449
  #### Recommendations
450
  - **Best Choice** for cost-conscious deployment
451
  - Use when expecting high concurrent load
452
- - Extended timeout already configured (1200s)
453
  - Cache-friendly for repeated requests
454
 
455
  ---
@@ -551,7 +551,7 @@ GGUF (Phi-3-Q4): ~2.0GB GPU (16% of usable)
551
 
552
  Based on `routes_fastapi.py`:
553
  - **Standard models**: 120-180s timeout
554
- - **GGUF models**: 1200s extended timeout (line 1075)
555
  - **HF Spaces detection**: Automatic (line 1073-1074)
556
 
557
  ### Optimization Strategies for T4
@@ -619,7 +619,7 @@ Fallback Model: microsoft/Phi-3-mini-4k-instruct-gguf
619
  Emergency Fallback: google/flan-t5-large
620
  Max Concurrent: 5-6 requests (BART), 8-10 (GGUF)
621
  Memory Limit: 80% (12.8GB GPU, 24GB RAM)
622
- Timeout: 180s (standard), 1200s (GGUF)
623
  ```
624
 
625
  ### 📊 **Expected Performance**
 
160
 
161
  #### Weaknesses
162
  - ⚠️ **Slight quality loss**: Q4 quantization may reduce quality slightly
163
+ - ⚠️ **Longer timeouts**: Extended timeout needed (600s on HF Spaces)
164
  - ⚠️ **File path parsing**: Requires special handling for filename extraction
165
 
166
  #### Implementation Details
 
428
  - ✅ **RAM**: ~3-4GB during inference
429
  - ✅ **Speed**: Very good on T4 (GGUF optimized)
430
  - ✅ **HF Spaces Config**: Primary GGUF model (line 33)
431
+ - ✅ **Extended Timeout**: 600s configured for HF Spaces (routes_fastapi.py line 1075)
432
  - ✅ **Quantization**: Q4 reduces memory by ~75%
433
 
434
  #### Performance Estimates
 
449
  #### Recommendations
450
  - **Best Choice** for cost-conscious deployment
451
  - Use when expecting high concurrent load
452
+ - Extended timeout already configured (600s)
453
  - Cache-friendly for repeated requests
454
 
455
  ---
 
551
 
552
  Based on `routes_fastapi.py`:
553
  - **Standard models**: 120-180s timeout
554
+ - **GGUF models**: 600s extended timeout (line 1075)
555
  - **HF Spaces detection**: Automatic (line 1073-1074)
556
 
557
  ### Optimization Strategies for T4
 
619
  Emergency Fallback: google/flan-t5-large
620
  Max Concurrent: 5-6 requests (BART), 8-10 (GGUF)
621
  Memory Limit: 80% (12.8GB GPU, 24GB RAM)
622
+ Timeout: 180s (standard), 600s (GGUF)
623
  ```
624
 
625
  ### 📊 **Expected Performance**
docs/hf-spaces/FILES_CREATED.md CHANGED
@@ -125,7 +125,7 @@ python verify_cache.py
125
 
126
  ### 7. `MODEL_CACHING_SUMMARY.md` ⭐ START HERE
127
  **Purpose**: Overview and answer to your question
128
- **Size**: ~1200 lines
129
  **Contents**:
130
  - Direct answer to your question
131
  - Performance comparison
@@ -183,7 +183,7 @@ python verify_cache.py
183
 
184
  ### 11. `README_HF_SPACES.md`
185
  **Purpose**: Main README for HF Spaces deployment
186
- **Size**: ~1200 lines
187
  **Contents**:
188
  - Quick start (3 steps)
189
  - File structure
@@ -231,11 +231,11 @@ python verify_cache.py
231
  | `entrypoint.sh` | Script | ⭐ YES | 40 lines | Startup verification |
232
  | `verify_cache.py` | Tool | Recommended | 200 lines | Verify cache |
233
  | `health_endpoints.py` | Code | Recommended | +120 lines | Health endpoints |
234
- | `MODEL_CACHING_SUMMARY.md` | Docs | ⭐ START HERE | 1200 lines | Overview |
235
  | `HF_SPACES_QUICKSTART.md` | Docs | Recommended | 400 lines | Quick start |
236
  | `HF_SPACES_DEPLOYMENT.md` | Docs | Reference | 800 lines | Full guide |
237
  | `DEPLOYMENT_CHECKLIST.md` | Docs | Helpful | 400 lines | Checklist |
238
- | `README_HF_SPACES.md` | Docs | Reference | 1200 lines | Main README |
239
  | `COMPARISON_BEFORE_AFTER.md` | Docs | Helpful | 500 lines | Comparison |
240
  | `FILES_CREATED.md` | Docs | Reference | This file | Index |
241
 
 
125
 
126
  ### 7. `MODEL_CACHING_SUMMARY.md` ⭐ START HERE
127
  **Purpose**: Overview and answer to your question
128
+ **Size**: ~600 lines
129
  **Contents**:
130
  - Direct answer to your question
131
  - Performance comparison
 
183
 
184
  ### 11. `README_HF_SPACES.md`
185
  **Purpose**: Main README for HF Spaces deployment
186
+ **Size**: ~600 lines
187
  **Contents**:
188
  - Quick start (3 steps)
189
  - File structure
 
231
  | `entrypoint.sh` | Script | ⭐ YES | 40 lines | Startup verification |
232
  | `verify_cache.py` | Tool | Recommended | 200 lines | Verify cache |
233
  | `health_endpoints.py` | Code | Recommended | +120 lines | Health endpoints |
234
+ | `MODEL_CACHING_SUMMARY.md` | Docs | ⭐ START HERE | 600 lines | Overview |
235
  | `HF_SPACES_QUICKSTART.md` | Docs | Recommended | 400 lines | Quick start |
236
  | `HF_SPACES_DEPLOYMENT.md` | Docs | Reference | 800 lines | Full guide |
237
  | `DEPLOYMENT_CHECKLIST.md` | Docs | Helpful | 400 lines | Checklist |
238
+ | `README_HF_SPACES.md` | Docs | Reference | 600 lines | Main README |
239
  | `COMPARISON_BEFORE_AFTER.md` | Docs | Helpful | 500 lines | Comparison |
240
  | `FILES_CREATED.md` | Docs | Reference | This file | Index |
241
 
docs/hf-spaces/INDEX.md CHANGED
@@ -122,8 +122,8 @@ All documentation for deploying to Hugging Face Spaces with pre-cached models.
122
  | DEPLOYMENT_CHECKLIST.md | ~400 | Use while deploying | ⭐⭐ |
123
  | MODEL_UPDATE_SUMMARY.md | ~500 | 10 min | ⭐⭐ |
124
  | HF_SPACES_DEPLOYMENT.md | ~800 | 30 min | ⭐ |
125
- | MODEL_CACHING_SUMMARY.md | ~1200 | 15 min | ⭐ |
126
- | README_HF_SPACES.md | ~1200 | Reference | ⭐ |
127
  | COMPARISON_BEFORE_AFTER.md | ~500 | Reference | Optional |
128
  | FILES_CREATED.md | ~500 | Reference | Optional |
129
 
 
122
  | DEPLOYMENT_CHECKLIST.md | ~400 | Use while deploying | ⭐⭐ |
123
  | MODEL_UPDATE_SUMMARY.md | ~500 | 10 min | ⭐⭐ |
124
  | HF_SPACES_DEPLOYMENT.md | ~800 | 30 min | ⭐ |
125
+ | MODEL_CACHING_SUMMARY.md | ~600 | 15 min | ⭐ |
126
+ | README_HF_SPACES.md | ~600 | Reference | ⭐ |
127
  | COMPARISON_BEFORE_AFTER.md | ~500 | Reference | Optional |
128
  | FILES_CREATED.md | ~500 | Reference | Optional |
129
 
models_config.json CHANGED
@@ -41,31 +41,13 @@
41
  {
42
  "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
43
  "type": "gguf",
44
- "is_active": false,
45
  "cached": true,
46
- "description": "Phi-3 Mini GGUF Q4 quantized - 4k Context",
47
  "use_case": "Fast patient summary generation with CPU/GPU",
48
  "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
49
  "filename": "Phi-3-mini-4k-instruct-q4.gguf"
50
  },
51
- {
52
- "name": "microsoft/Phi-3-mini-128k-instruct",
53
- "type": "causal-openvino",
54
- "is_active": true,
55
- "cached": false,
56
- "description": "Phi-3 Mini 128k Context - PRIMARY MODEL",
57
- "use_case": "Long-context patient summary generation"
58
- },
59
- {
60
- "name": "microsoft/Phi-3-mini-128k-instruct-gguf/Phi-3-mini-128k-instruct-q4.gguf",
61
- "type": "gguf",
62
- "is_active": false,
63
- "cached": false,
64
- "description": "Phi-3 Mini 128k Context GGUF Q4",
65
- "use_case": "Local testing with 128k context (CPU/GPU)",
66
- "repo_id": "microsoft/Phi-3-mini-128k-instruct-gguf",
67
- "filename": "Phi-3-mini-128k-instruct-q4.gguf"
68
- },
69
  {
70
  "name": "google/flan-t5-large",
71
  "type": "summarization",
@@ -93,4 +75,5 @@
93
  "Other models can be requested at runtime and will be downloaded automatically",
94
  "Runtime downloads are cached for subsequent uses"
95
  ]
96
- }
 
 
41
  {
42
  "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
43
  "type": "gguf",
44
+ "is_active": true,
45
  "cached": true,
46
+ "description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
47
  "use_case": "Fast patient summary generation with CPU/GPU",
48
  "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
49
  "filename": "Phi-3-mini-4k-instruct-q4.gguf"
50
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  {
52
  "name": "google/flan-t5-large",
53
  "type": "summarization",
 
75
  "Other models can be requested at runtime and will be downloaded automatically",
76
  "Runtime downloads are cached for subsequent uses"
77
  ]
78
+ }
79
+
services/ai-service/DEPLOYMENT_FIX.md CHANGED
@@ -17,13 +17,13 @@ The deployment was failing with a "Scheduling failure: unable to schedule" error
17
  **Before:**
18
  ```dockerfile
19
  RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn
20
- CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "--timeout", "1200", "wsgi:app"]
21
  ```
22
 
23
  **After:**
24
  ```dockerfile
25
  RUN pip install --no-cache-dir -r /app/requirements.txt uvicorn[standard]
26
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "1200", "--workers", "4"]
27
  ```
28
 
29
  ### Why This Works
@@ -66,12 +66,12 @@ If you need more production-grade deployment with multiple workers:
66
  #### Option A: Gunicorn with Uvicorn Workers (Recommended for Production)
67
  ```dockerfile
68
  RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn uvicorn[standard]
69
- CMD ["gunicorn", "app:app", "--workers", "4", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:7860", "--timeout", "1200"]
70
  ```
71
 
72
  #### Option B: Pure Uvicorn (Current, Good for Medium Load)
73
  ```dockerfile
74
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "1200", "--workers", "4"]
75
  ```
76
 
77
  ### 3. Health Check Configuration
 
17
  **Before:**
18
  ```dockerfile
19
  RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn
20
+ CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "--timeout", "600", "wsgi:app"]
21
  ```
22
 
23
  **After:**
24
  ```dockerfile
25
  RUN pip install --no-cache-dir -r /app/requirements.txt uvicorn[standard]
26
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "600", "--workers", "4"]
27
  ```
28
 
29
  ### Why This Works
 
66
  #### Option A: Gunicorn with Uvicorn Workers (Recommended for Production)
67
  ```dockerfile
68
  RUN pip install --no-cache-dir -r /app/requirements.txt gunicorn uvicorn[standard]
69
+ CMD ["gunicorn", "app:app", "--workers", "4", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:7860", "--timeout", "600"]
70
  ```
71
 
72
  #### Option B: Pure Uvicorn (Current, Good for Medium Load)
73
  ```dockerfile
74
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "600", "--workers", "4"]
75
  ```
76
 
77
  ### 3. Health Check Configuration
services/ai-service/Dockerfile.prod CHANGED
@@ -22,4 +22,4 @@ EXPOSE 7860
22
  ENV PRELOAD_SMALL_MODELS=false
23
 
24
  # Use uvicorn directly for FastAPI (ASGI) instead of gunicorn (WSGI)
25
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "1200", "--workers", "4"]
 
22
  ENV PRELOAD_SMALL_MODELS=false
23
 
24
  # Use uvicorn directly for FastAPI (ASGI) instead of gunicorn (WSGI)
25
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "600", "--workers", "4"]
services/ai-service/src/__main__.py CHANGED
@@ -12,4 +12,4 @@ initialize_agents(app)
12
 
13
  if __name__ == '__main__':
14
  import uvicorn
15
- uvicorn.run(app, host="0.0.0.0", port=7860, timeout_keep_alive=1200)
 
12
 
13
  if __name__ == '__main__':
14
  import uvicorn
15
+ uvicorn.run(app, host="0.0.0.0", port=7860, timeout_keep_alive=600)
services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/__pycache__/inference_service.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/__pycache__/phi_scrubber_service.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/agents/__pycache__/patient_summary_agent.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/agents/__pycache__/summarizer.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/agents/patient_summary_agent.py CHANGED
@@ -37,6 +37,26 @@ class PatientSummarizerAgent:
37
  )
38
 
39
  def configure_model(self, model_name: str, model_type: str = None):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  is_hf_spaces = (
41
  os.getenv('HF_SPACES', 'false').lower() == 'true'
42
  or os.getenv('HUGGINGFACE_SPACES', 'false').lower() == 'true'
 
37
  )
38
 
39
  def configure_model(self, model_name: str, model_type: str = None):
40
+ """Configure the model dynamically from payload"""
41
+ from ..utils.model_config import detect_model_type
42
+
43
+ self.current_model_name = model_name
44
+ self.current_model_type = model_type or detect_model_type(model_name)
45
+
46
+ # Get model loader from unified manager
47
+ from ..utils.unified_model_manager import unified_model_manager
48
+ self.model_loader = unified_model_manager.get_model(
49
+ self.current_model_name,
50
+ self.current_model_type,
51
+ lazy=True # Lazy loading for better performance
52
+ )
53
+
54
+ logging.info(f"Configured PatientSummarizerAgent with {model_name} ({self.current_model_type})")
55
+ return self.model_loader
56
+
57
+ def _initialize_model_loader(self):
58
+ """Initialize the model loader using the unified model manager with enhanced cache handling"""
59
+ import os
60
  is_hf_spaces = (
61
  os.getenv('HF_SPACES', 'false').lower() == 'true'
62
  or os.getenv('HUGGINGFACE_SPACES', 'false').lower() == 'true'
services/ai-service/src/ai_med_extract/api/routes_fastapi.py CHANGED
@@ -483,78 +483,25 @@ def get_gguf_pipeline(model_name: str, filename: str = None):
483
  start_time = time.time()
484
  # Try to load the GGUF model using unified manager
485
  try:
486
- import traceback
487
- model = unified_model_manager.get_model(model_name, "gguf", filename, lazy=False)
488
-
489
- # Check if model was forced to fallback due to T4 compatibility
490
- if model.model_type == "fallback":
491
- fallback_reason = model.fallback_reason or f"Model {model_name} is not supported/optimal for T4 Medium"
492
- print(f"[GGUF] ⚠️ Model forced to fallback: {fallback_reason}")
493
- print(f"[GGUF] Using fallback pipeline")
494
- GGUF_MODEL_CACHE[key] = create_fallback_pipeline()
495
- return GGUF_MODEL_CACHE[key]
496
-
497
- # Ensure model is actually loaded
498
- loaded_model = model.load()
499
- if loaded_model is None:
500
- # Get detailed error information
501
- error_msg = model._error_message or "Unknown error"
502
- fallback_reason = model.fallback_reason or f"Model {model_name} failed to load"
503
- print(f"[GGUF] ❌ Model load returned None")
504
- print(f"[GGUF] Error message: {error_msg}")
505
- print(f"[GGUF] Fallback reason: {fallback_reason}")
506
- print(f"[GGUF] Model status: {model.status}")
507
- raise RuntimeError(f"Model {model_name} failed to load: {error_msg}")
508
-
509
  # Wrap in pipeline-like interface for compatibility
510
  class GGUFModelWrapper:
511
  def __init__(self, model):
512
  self.model = model
513
  def generate(self, prompt, **kwargs):
514
- from ..utils.unified_model_manager import GenerationConfig, ModelStatus
515
  config = GenerationConfig(**kwargs)
516
- # Ensure model is loaded before generating
517
- if self.model.status != ModelStatus.LOADED:
518
- loaded = self.model.load()
519
- if loaded is None:
520
- error_msg = self.model._error_message or "Unknown error"
521
- raise RuntimeError(f"Model {self.model.name} is not loaded and failed to load: {error_msg}")
522
  return self.model.generate(prompt, config)
523
  def generate_full_summary(self, prompt, **kwargs):
524
  return self.generate(prompt, **kwargs)
525
-
526
- GGUF_MODEL_CACHE[key] = GGUFModelWrapper(loaded_model)
527
  load_time = time.time() - start_time
528
- print(f"[GGUF] Model loaded successfully in {load_time:.2f}s: {model_name}")
529
-
530
  except Exception as e:
531
- import traceback
532
  load_time = time.time() - start_time
533
- error_type = type(e).__name__
534
- error_msg = str(e)
535
- error_traceback = traceback.format_exc()
536
-
537
- print(f"[GGUF] ❌ Failed to load model {model_name} after {load_time:.2f}s")
538
- print(f"[GGUF] Error type: {error_type}")
539
- print(f"[GGUF] Error message: {error_msg}")
540
-
541
- # Try to get additional error info from model if it exists
542
- try:
543
- if 'model' in locals():
544
- if hasattr(model, '_error_message') and model._error_message:
545
- print(f"[GGUF] Model error message: {model._error_message}")
546
- if hasattr(model, 'fallback_reason') and model.fallback_reason:
547
- print(f"[GGUF] Fallback reason: {model.fallback_reason}")
548
- if hasattr(model, 'status'):
549
- print(f"[GGUF] Model status: {model.status}")
550
- except:
551
- pass
552
-
553
- # Print full traceback for debugging
554
- print(f"[GGUF] Full traceback:\n{error_traceback}")
555
-
556
  # If model loading fails, use fallback
557
- print("[GGUF] 🔄 Using fallback pipeline")
558
  GGUF_MODEL_CACHE[key] = create_fallback_pipeline()
559
  except Exception as e:
560
  print(f"[GGUF] Critical error in model loading: {e}")
@@ -688,7 +635,7 @@ def generate_rule_based_summary(baseline, delta_text, visits=None, patientid=Non
688
 
689
  # Clinical Overview: summarize baseline
690
  if baseline:
691
- baseline_snip = baseline[:1200].replace("\n", " ")
692
  lines_assessment.append(f"- Baseline: {baseline_snip}")
693
  else:
694
  lines_assessment.append("- No baseline data available.")
@@ -939,7 +886,7 @@ You are a clinical assistant. {custom_prompt}
939
  PATIENT VISIT DATA:
940
  {visit_data_text}</s>
941
  <|user|>
942
- strictly rely on data,dont halucinate or invent any information.</s>
943
  <|assistant|>"""
944
  else:
945
  base_prompt = process_patient_record_plain_text({
@@ -1022,7 +969,6 @@ async def load_model_with_fallback(model_name, model_type, fallback_type=None):
1022
  from ..utils.unified_model_manager import unified_model_manager as _unified_manager
1023
  from ..utils import model_config as _mc
1024
 
1025
- primary_error = None
1026
  try:
1027
  model = _unified_manager.get_model(
1028
  name=model_name,
@@ -1031,12 +977,8 @@ async def load_model_with_fallback(model_name, model_type, fallback_type=None):
1031
  )
1032
  if model.load():
1033
  return model, model_name, model_type, False, None
1034
- else:
1035
- # Model failed to load (returned None)
1036
- primary_error = f"Model {model_name} ({model_type}) failed to load (load() returned None)"
1037
  except Exception as e:
1038
- primary_error = f"Model {model_name} ({model_type}) failed to load: {type(e).__name__}: {str(e)}"
1039
- logger.warning(primary_error)
1040
 
1041
  # Try fallback
1042
  if fallback_type:
@@ -1049,9 +991,7 @@ async def load_model_with_fallback(model_name, model_type, fallback_type=None):
1049
  filename=None
1050
  )
1051
  if fallback_model.load():
1052
- fallback_reason = primary_error or f"Primary model {model_name} ({model_type}) failed to load"
1053
- # Store fallback reason in the model object for later retrieval
1054
- fallback_model.set_fallback_reason(fallback_reason)
1055
  return fallback_model, fallback_model_name, fallback_type, True, fallback_reason
1056
  except Exception as e:
1057
  logger.error(f"Fallback model also failed: {e}")
@@ -1144,8 +1084,8 @@ async def async_patient_summary(data, job_id=None):
1144
  try:
1145
  response = requests.post(
1146
  ehr_url,
1147
- json={"patientid": patientid},
1148
- headers=headers,
1149
  timeout=EHR_TIMEOUT
1150
  )
1151
  logging.info(f"EHR API response status: {response.status_code}")
@@ -1408,7 +1348,7 @@ async def async_patient_summary(data, job_id=None):
1408
  try:
1409
  # Use extended timeout for GGUF operations on HF Spaces
1410
  is_hf_spaces = os.environ.get('HF_SPACES', 'false').lower() == 'true'
1411
- timeout_value = timeout_config.get("gguf_extended_timeout" if is_hf_spaces else "gguf_timeout", 1200)
1412
 
1413
  if cache_key not in GGUF_PIPELINE_CACHE:
1414
  if job_id:
@@ -1644,10 +1584,10 @@ async def async_patient_summary(data, job_id=None):
1644
  try:
1645
  raw_summary = await asyncio.wait_for(
1646
  generate_with_progress(),
1647
- timeout=timeout_config.get("generation_timeout", 1200)
1648
  )
1649
  except asyncio.TimeoutError:
1650
- error_msg = f"Text generation timed out after {timeout_config.get('generation_timeout', 1200)} seconds"
1651
  log_error_with_context(Exception(error_msg), "Text generation timeout", job_id)
1652
  update_job_with_error(job_id, error_msg, "generation_timeout")
1653
  raise Exception(error_msg)
@@ -1723,10 +1663,10 @@ async def async_patient_summary(data, job_id=None):
1723
  try:
1724
  result_sum = await asyncio.wait_for(
1725
  asyncio.to_thread(model.generate, context, config),
1726
- timeout=timeout_config.get("generation_timeout", 1200)
1727
  )
1728
  except asyncio.TimeoutError:
1729
- error_msg = f"Summarization timed out after {timeout_config.get('generation_timeout', 1200)} seconds"
1730
  log_error_with_context(Exception(error_msg), "Summarization timeout", job_id)
1731
  update_job_with_error(job_id, error_msg, "generation_timeout")
1732
  raise Exception(error_msg)
@@ -1837,7 +1777,7 @@ async def async_patient_summary(data, job_id=None):
1837
  temperature=0.1,
1838
  top_p=0.5,
1839
  ),
1840
- timeout=1200
1841
  )
1842
  else:
1843
  config = create_generation_config(data, min_tokens=100, temperature=0.1, top_p=0.5)
@@ -1887,7 +1827,7 @@ async def async_patient_summary(data, job_id=None):
1887
  if "timeout" in error_str.lower():
1888
  error_category = "TIMEOUT"
1889
  # Enhanced timeout message with recommendations
1890
- user_message = f"""Summary generation timed out after {timeout_config.get('generation_timeout', 1200)} seconds.
1891
 
1892
  This may be due to:
1893
  - Large patient dataset requiring more processing time
@@ -2012,7 +1952,7 @@ def process_patient_summary_background(data, job_id):
2012
  ehr_url,
2013
  json={"patientid": patientid},
2014
  headers=headers,
2015
- timeout=1200
2016
  )
2017
  if response.status_code == 200:
2018
  sample_data = response.json()
@@ -2477,7 +2417,7 @@ async def home():
2477
  border-radius: 20px;
2478
  padding: 40px;
2479
  box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
2480
- max-width: 1200px;
2481
  width: 100%;
2482
  animation: fadeIn 0.5s ease-in;
2483
  }
@@ -2493,7 +2433,7 @@ async def home():
2493
  padding: 8px 16px;
2494
  border-radius: 20px;
2495
  font-size: 14px;
2496
- font-weight: 1200;
2497
  margin-bottom: 20px;
2498
  }
2499
  .status-dot {
@@ -2526,7 +2466,7 @@ async def home():
2526
  }
2527
  .info-title {
2528
  color: #374151;
2529
- font-weight: 1200;
2530
  margin-bottom: 15px;
2531
  font-size: 18px;
2532
  }
@@ -2551,7 +2491,7 @@ async def home():
2551
  padding: 4px 8px;
2552
  border-radius: 4px;
2553
  font-size: 12px;
2554
- font-weight: 1200;
2555
  margin-right: 10px;
2556
  min-width: 50px;
2557
  text-align: center;
@@ -2572,7 +2512,7 @@ async def home():
2572
  .link {
2573
  color: #667eea;
2574
  text-decoration: none;
2575
- font-weight: 1200;
2576
  }
2577
  .link:hover {
2578
  text-decoration: underline;
@@ -2764,7 +2704,7 @@ async def generate_patient_summary_large_data(
2764
  """Wait for slot and then process."""
2765
  try:
2766
  # Wait for processing slot
2767
- if queue_manager.wait_for_slot(request_id, timeout=1200):
2768
  # Update job status to show processing started
2769
  job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
2770
  # Start background task with optimized generation
@@ -2793,7 +2733,7 @@ async def generate_patient_summary_large_data(
2793
  'X-Content-Type-Options': 'nosniff',
2794
  'Access-Control-Allow-Origin': '*',
2795
  'Access-Control-Allow-Headers': 'Cache-Control, Connection',
2796
- 'Keep-Alive': 'timeout=31200',
2797
  # Force HTTP/1.1 to avoid HTTP/2 protocol errors
2798
  'X-Protocol': 'HTTP/1.1'
2799
  }
@@ -2850,7 +2790,7 @@ async def generate_patient_summary_streaming(
2850
  """Wait for slot and then process."""
2851
  try:
2852
  # Wait for processing slot
2853
- if queue_manager.wait_for_slot(request_id, timeout=1200):
2854
  # Update job status to show processing started
2855
  job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
2856
  # Start background task with optimized generation
@@ -2879,7 +2819,7 @@ async def generate_patient_summary_streaming(
2879
  'X-Content-Type-Options': 'nosniff',
2880
  'Access-Control-Allow-Origin': '*',
2881
  'Access-Control-Allow-Headers': 'Cache-Control, Connection',
2882
- 'Keep-Alive': 'timeout=31200',
2883
  # Force HTTP/1.1 to avoid HTTP/2 protocol errors
2884
  'X-Protocol': 'HTTP/1.1'
2885
  }
@@ -2958,7 +2898,7 @@ async def generate_patient_summary(
2958
  """Wait for slot and then process."""
2959
  try:
2960
  # Wait for processing slot
2961
- if queue_manager.wait_for_slot(request_id, timeout=1200):
2962
  # Update job status to show processing started
2963
  job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
2964
  # Start background task directly (not in separate thread to avoid nesting)
@@ -2988,7 +2928,7 @@ async def generate_patient_summary(
2988
  'X-Content-Type-Options': 'nosniff',
2989
  'Access-Control-Allow-Origin': '*',
2990
  'Access-Control-Allow-Headers': 'Cache-Control, Connection',
2991
- 'Keep-Alive': 'timeout=31200',
2992
  # Force HTTP/1.1 to avoid HTTP/2 protocol errors
2993
  'X-Protocol': 'HTTP/1.1'
2994
  }
 
483
  start_time = time.time()
484
  # Try to load the GGUF model using unified manager
485
  try:
486
+ model = unified_model_manager.get_model(model_name, "gguf", filename)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
487
  # Wrap in pipeline-like interface for compatibility
488
  class GGUFModelWrapper:
489
  def __init__(self, model):
490
  self.model = model
491
  def generate(self, prompt, **kwargs):
492
+ from ..utils.unified_model_manager import GenerationConfig
493
  config = GenerationConfig(**kwargs)
 
 
 
 
 
 
494
  return self.model.generate(prompt, config)
495
  def generate_full_summary(self, prompt, **kwargs):
496
  return self.generate(prompt, **kwargs)
497
+ GGUF_MODEL_CACHE[key] = GGUFModelWrapper(model)
 
498
  load_time = time.time() - start_time
499
+ print(f"[GGUF] Model loaded successfully in {load_time:.2f}s: {model_name}")
 
500
  except Exception as e:
 
501
  load_time = time.time() - start_time
502
+ print(f"[GGUF] Failed to load model {model_name} after {load_time:.2f}s: {e}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
503
  # If model loading fails, use fallback
504
+ print("[GGUF] Using fallback pipeline")
505
  GGUF_MODEL_CACHE[key] = create_fallback_pipeline()
506
  except Exception as e:
507
  print(f"[GGUF] Critical error in model loading: {e}")
 
635
 
636
  # Clinical Overview: summarize baseline
637
  if baseline:
638
+ baseline_snip = baseline[:600].replace("\n", " ")
639
  lines_assessment.append(f"- Baseline: {baseline_snip}")
640
  else:
641
  lines_assessment.append("- No baseline data available.")
 
886
  PATIENT VISIT DATA:
887
  {visit_data_text}</s>
888
  <|user|>
889
+ Generate a comprehensive patient summary based on the data above.</s>
890
  <|assistant|>"""
891
  else:
892
  base_prompt = process_patient_record_plain_text({
 
969
  from ..utils.unified_model_manager import unified_model_manager as _unified_manager
970
  from ..utils import model_config as _mc
971
 
 
972
  try:
973
  model = _unified_manager.get_model(
974
  name=model_name,
 
977
  )
978
  if model.load():
979
  return model, model_name, model_type, False, None
 
 
 
980
  except Exception as e:
981
+ logger.warning(f"Model {model_name} ({model_type}) failed to load: {e}")
 
982
 
983
  # Try fallback
984
  if fallback_type:
 
991
  filename=None
992
  )
993
  if fallback_model.load():
994
+ fallback_reason = f"Primary model {model_name} ({model_type}) failed to load"
 
 
995
  return fallback_model, fallback_model_name, fallback_type, True, fallback_reason
996
  except Exception as e:
997
  logger.error(f"Fallback model also failed: {e}")
 
1084
  try:
1085
  response = requests.post(
1086
  ehr_url,
1087
+ json={"patientid": patientid},
1088
+ headers=headers,
1089
  timeout=EHR_TIMEOUT
1090
  )
1091
  logging.info(f"EHR API response status: {response.status_code}")
 
1348
  try:
1349
  # Use extended timeout for GGUF operations on HF Spaces
1350
  is_hf_spaces = os.environ.get('HF_SPACES', 'false').lower() == 'true'
1351
+ timeout_value = timeout_config.get("gguf_extended_timeout" if is_hf_spaces else "gguf_timeout", 600)
1352
 
1353
  if cache_key not in GGUF_PIPELINE_CACHE:
1354
  if job_id:
 
1584
  try:
1585
  raw_summary = await asyncio.wait_for(
1586
  generate_with_progress(),
1587
+ timeout=timeout_config.get("generation_timeout", 600)
1588
  )
1589
  except asyncio.TimeoutError:
1590
+ error_msg = f"Text generation timed out after {timeout_config.get('generation_timeout', 600)} seconds"
1591
  log_error_with_context(Exception(error_msg), "Text generation timeout", job_id)
1592
  update_job_with_error(job_id, error_msg, "generation_timeout")
1593
  raise Exception(error_msg)
 
1663
  try:
1664
  result_sum = await asyncio.wait_for(
1665
  asyncio.to_thread(model.generate, context, config),
1666
+ timeout=timeout_config.get("generation_timeout", 600)
1667
  )
1668
  except asyncio.TimeoutError:
1669
+ error_msg = f"Summarization timed out after {timeout_config.get('generation_timeout', 600)} seconds"
1670
  log_error_with_context(Exception(error_msg), "Summarization timeout", job_id)
1671
  update_job_with_error(job_id, error_msg, "generation_timeout")
1672
  raise Exception(error_msg)
 
1777
  temperature=0.1,
1778
  top_p=0.5,
1779
  ),
1780
+ timeout=600
1781
  )
1782
  else:
1783
  config = create_generation_config(data, min_tokens=100, temperature=0.1, top_p=0.5)
 
1827
  if "timeout" in error_str.lower():
1828
  error_category = "TIMEOUT"
1829
  # Enhanced timeout message with recommendations
1830
+ user_message = f"""Summary generation timed out after {timeout_config.get('generation_timeout', 600)} seconds.
1831
 
1832
  This may be due to:
1833
  - Large patient dataset requiring more processing time
 
1952
  ehr_url,
1953
  json={"patientid": patientid},
1954
  headers=headers,
1955
+ timeout=600
1956
  )
1957
  if response.status_code == 200:
1958
  sample_data = response.json()
 
2417
  border-radius: 20px;
2418
  padding: 40px;
2419
  box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
2420
+ max-width: 600px;
2421
  width: 100%;
2422
  animation: fadeIn 0.5s ease-in;
2423
  }
 
2433
  padding: 8px 16px;
2434
  border-radius: 20px;
2435
  font-size: 14px;
2436
+ font-weight: 600;
2437
  margin-bottom: 20px;
2438
  }
2439
  .status-dot {
 
2466
  }
2467
  .info-title {
2468
  color: #374151;
2469
+ font-weight: 600;
2470
  margin-bottom: 15px;
2471
  font-size: 18px;
2472
  }
 
2491
  padding: 4px 8px;
2492
  border-radius: 4px;
2493
  font-size: 12px;
2494
+ font-weight: 600;
2495
  margin-right: 10px;
2496
  min-width: 50px;
2497
  text-align: center;
 
2512
  .link {
2513
  color: #667eea;
2514
  text-decoration: none;
2515
+ font-weight: 600;
2516
  }
2517
  .link:hover {
2518
  text-decoration: underline;
 
2704
  """Wait for slot and then process."""
2705
  try:
2706
  # Wait for processing slot
2707
+ if queue_manager.wait_for_slot(request_id, timeout=600):
2708
  # Update job status to show processing started
2709
  job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
2710
  # Start background task with optimized generation
 
2733
  'X-Content-Type-Options': 'nosniff',
2734
  'Access-Control-Allow-Origin': '*',
2735
  'Access-Control-Allow-Headers': 'Cache-Control, Connection',
2736
+ 'Keep-Alive': 'timeout=3600',
2737
  # Force HTTP/1.1 to avoid HTTP/2 protocol errors
2738
  'X-Protocol': 'HTTP/1.1'
2739
  }
 
2790
  """Wait for slot and then process."""
2791
  try:
2792
  # Wait for processing slot
2793
+ if queue_manager.wait_for_slot(request_id, timeout=600):
2794
  # Update job status to show processing started
2795
  job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
2796
  # Start background task with optimized generation
 
2819
  'X-Content-Type-Options': 'nosniff',
2820
  'Access-Control-Allow-Origin': '*',
2821
  'Access-Control-Allow-Headers': 'Cache-Control, Connection',
2822
+ 'Keep-Alive': 'timeout=3600',
2823
  # Force HTTP/1.1 to avoid HTTP/2 protocol errors
2824
  'X-Protocol': 'HTTP/1.1'
2825
  }
 
2898
  """Wait for slot and then process."""
2899
  try:
2900
  # Wait for processing slot
2901
+ if queue_manager.wait_for_slot(request_id, timeout=600):
2902
  # Update job status to show processing started
2903
  job_manager.update_job(job_id, JOB_STATUS["STARTED"], progress=5, data={'message': 'Processing slot acquired, starting generation...'})
2904
  # Start background task directly (not in separate thread to avoid nesting)
 
2928
  'X-Content-Type-Options': 'nosniff',
2929
  'Access-Control-Allow-Origin': '*',
2930
  'Access-Control-Allow-Headers': 'Cache-Control, Connection',
2931
+ 'Keep-Alive': 'timeout=3600',
2932
  # Force HTTP/1.1 to avoid HTTP/2 protocol errors
2933
  'X-Protocol': 'HTTP/1.1'
2934
  }
services/ai-service/src/ai_med_extract/app.py CHANGED
@@ -764,7 +764,7 @@ def run_dev(host: str = "0.0.0.0", port: int = 7860, debug: bool = False):
764
  # Initialize agents in dev run (preload small models)
765
  initialize_agents(app, preload_small_models=True)
766
  print("Agents initialized, starting uvicorn")
767
- uvicorn.run(app, host=host, port=port, reload=debug, timeout_keep_alive=1200)
768
 
769
 
770
  if __name__ == "__main__":
 
764
  # Initialize agents in dev run (preload small models)
765
  initialize_agents(app, preload_small_models=True)
766
  print("Agents initialized, starting uvicorn")
767
+ uvicorn.run(app, host=host, port=port, reload=debug, timeout_keep_alive=600)
768
 
769
 
770
  if __name__ == "__main__":
services/ai-service/src/ai_med_extract/config/performance_config.py CHANGED
@@ -19,7 +19,7 @@ class PerformanceConfig:
19
 
20
  # Caching
21
  enable_caching: bool = True
22
- cache_ttl_seconds: int = 31200
23
  max_cache_size: int = 1000
24
  enable_multi_level_cache: bool = True
25
 
@@ -65,7 +65,7 @@ class PerformanceConfig:
65
 
66
  # Caching
67
  enable_caching=os.environ.get('ENABLE_CACHING', 'true').lower() == 'true',
68
- cache_ttl_seconds=int(os.environ.get('CACHE_TTL_SECONDS', '31200')),
69
  max_cache_size=int(os.environ.get('MAX_CACHE_SIZE', '1000')),
70
  enable_multi_level_cache=os.environ.get('ENABLE_MULTI_LEVEL_CACHE', 'true').lower() == 'true',
71
 
 
19
 
20
  # Caching
21
  enable_caching: bool = True
22
+ cache_ttl_seconds: int = 3600
23
  max_cache_size: int = 1000
24
  enable_multi_level_cache: bool = True
25
 
 
65
 
66
  # Caching
67
  enable_caching=os.environ.get('ENABLE_CACHING', 'true').lower() == 'true',
68
+ cache_ttl_seconds=int(os.environ.get('CACHE_TTL_SECONDS', '3600')),
69
  max_cache_size=int(os.environ.get('MAX_CACHE_SIZE', '1000')),
70
  enable_multi_level_cache=os.environ.get('ENABLE_MULTI_LEVEL_CACHE', 'true').lower() == 'true',
71
 
services/ai-service/src/ai_med_extract/enable_optimizations.py CHANGED
@@ -24,7 +24,7 @@ def enable_all_optimizations():
24
 
25
  # Caching
26
  'ENABLE_CACHING': 'true',
27
- 'CACHE_TTL_SECONDS': '31200',
28
  'MAX_CACHE_SIZE': '1000',
29
  'ENABLE_MULTI_LEVEL_CACHE': 'true',
30
 
@@ -85,7 +85,7 @@ def get_optimization_status() -> Dict[str, Any]:
85
  },
86
  "caching_optimizations": {
87
  "enabled": os.environ.get('ENABLE_CACHING', 'true'),
88
- "ttl_seconds": os.environ.get('CACHE_TTL_SECONDS', '31200'),
89
  "max_size": os.environ.get('MAX_CACHE_SIZE', '1000'),
90
  },
91
  "async_optimizations": {
 
24
 
25
  # Caching
26
  'ENABLE_CACHING': 'true',
27
+ 'CACHE_TTL_SECONDS': '3600',
28
  'MAX_CACHE_SIZE': '1000',
29
  'ENABLE_MULTI_LEVEL_CACHE': 'true',
30
 
 
85
  },
86
  "caching_optimizations": {
87
  "enabled": os.environ.get('ENABLE_CACHING', 'true'),
88
+ "ttl_seconds": os.environ.get('CACHE_TTL_SECONDS', '3600'),
89
  "max_size": os.environ.get('MAX_CACHE_SIZE', '1000'),
90
  },
91
  "async_optimizations": {
services/ai-service/src/ai_med_extract/inference_service.py CHANGED
@@ -140,7 +140,7 @@ class InferenceService:
140
  loop = asyncio.get_event_loop()
141
 
142
  # Optimize chunk size based on text length
143
- chunk_size = 8000 if len(text) > 112000 else 12000
144
 
145
  if len(text) > chunk_size:
146
  chunks = self._split_chunks(text, chunk_size)
 
140
  loop = asyncio.get_event_loop()
141
 
142
  # Optimize chunk size based on text length
143
+ chunk_size = 8000 if len(text) > 16000 else 12000
144
 
145
  if len(text) > chunk_size:
146
  chunks = self._split_chunks(text, chunk_size)
services/ai-service/src/ai_med_extract/phi_scrubber_service.py CHANGED
@@ -60,7 +60,7 @@ class PHIScrubberService:
60
  r = redis.from_url(settings.REDIS_URL, decode_responses=True)
61
  await r.hincrby(key, "events", 1)
62
  await r.hincrby(key, "found", len(m))
63
- await r.expire(key, 7*24*31200)
64
  except Exception:
65
  pass
66
  return {"original_length": len(text), "scrubbed_length": len(scrubbed), "total_phi_found": len(m), "phi_types": phi_types, "scrubbed_text": scrubbed}
 
60
  r = redis.from_url(settings.REDIS_URL, decode_responses=True)
61
  await r.hincrby(key, "events", 1)
62
  await r.hincrby(key, "found", len(m))
63
+ await r.expire(key, 7*24*3600)
64
  except Exception:
65
  pass
66
  return {"original_length": len(text), "scrubbed_length": len(scrubbed), "total_phi_found": len(m), "phi_types": phi_types, "scrubbed_text": scrubbed}
services/ai-service/src/ai_med_extract/services/job_manager.py CHANGED
@@ -29,7 +29,7 @@ class JobManager:
29
  """Initialize the job manager with in-memory storage."""
30
  self._jobs: Dict[str, Dict[str, Any]] = {}
31
  self._lock = threading.RLock() # Reentrant lock for nested calls
32
- self._cleanup_interval = 31200 # 1 hour
33
  self._max_job_age = 7200 # 2 hours
34
 
35
  def create_job(self, request_id: Optional[str] = None, initial_data: Optional[Dict] = None) -> str:
 
29
  """Initialize the job manager with in-memory storage."""
30
  self._jobs: Dict[str, Dict[str, Any]] = {}
31
  self._lock = threading.RLock() # Reentrant lock for nested calls
32
+ self._cleanup_interval = 3600 # 1 hour
33
  self._max_job_age = 7200 # 2 hours
34
 
35
  def create_job(self, request_id: Optional[str] = None, initial_data: Optional[Dict] = None) -> str:
services/ai-service/src/ai_med_extract/services/request_queue.py CHANGED
@@ -229,7 +229,7 @@ class RequestQueueManager:
229
  ]
230
  }
231
 
232
- def cleanup_old_requests(self, max_age: int = 31200) -> int:
233
  """
234
  Clean up old requests from tracking.
235
 
@@ -289,7 +289,7 @@ def get_queue_manager() -> RequestQueueManager:
289
  _queue_manager = RequestQueueManager(
290
  max_concurrent=6,
291
  max_queue_size=6,
292
- queue_timeout=1200
293
  )
294
  logger.info("Initialized RequestQueueManager for Hugging Face Spaces (T4 medium)")
295
  else:
@@ -297,7 +297,7 @@ def get_queue_manager() -> RequestQueueManager:
297
  _queue_manager = RequestQueueManager(
298
  max_concurrent=4,
299
  max_queue_size=20,
300
- queue_timeout=1200
301
  )
302
  logger.info("Initialized RequestQueueManager for local/development")
303
 
 
229
  ]
230
  }
231
 
232
+ def cleanup_old_requests(self, max_age: int = 3600) -> int:
233
  """
234
  Clean up old requests from tracking.
235
 
 
289
  _queue_manager = RequestQueueManager(
290
  max_concurrent=6,
291
  max_queue_size=6,
292
+ queue_timeout=600
293
  )
294
  logger.info("Initialized RequestQueueManager for Hugging Face Spaces (T4 medium)")
295
  else:
 
297
  _queue_manager = RequestQueueManager(
298
  max_concurrent=4,
299
  max_queue_size=20,
300
+ queue_timeout=600
301
  )
302
  logger.info("Initialized RequestQueueManager for local/development")
303
 
services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/utils/__pycache__/model_config.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/utils/__pycache__/openvino_summarizer_utils.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc CHANGED
Binary files a/services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc and b/services/ai-service/src/ai_med_extract/utils/__pycache__/performance_monitor.cpython-311.pyc differ
 
services/ai-service/src/ai_med_extract/utils/constants.py CHANGED
@@ -24,39 +24,39 @@ CHUNK_SIZE_DAYS = 90 # Days per chunk for date-based chunking
24
  # ========== TIMEOUT CONFIGURATION ==========
25
  TIMEOUT_CONFIG = {
26
  "fast": {
27
- "ehr_timeout": 1200,
28
- "generation_timeout": 1200,
29
- "gguf_timeout": 1200,
30
- "gguf_extended_timeout": 1200,
31
  "retry_attempts": 2
32
  },
33
  "normal": {
34
- "ehr_timeout": 1200,
35
- "generation_timeout": 1200,
36
- "gguf_timeout": 1200,
37
- "gguf_extended_timeout": 1200,
38
  "retry_attempts": 3
39
  },
40
  "extended": {
41
- "ehr_timeout": 1200,
42
- "generation_timeout": 1200,
43
- "gguf_timeout": 1200,
44
- "gguf_extended_timeout": 1200,
45
  "retry_attempts": 3
46
  },
47
  "large_data": {
48
- "ehr_timeout": 1200,
49
- "generation_timeout": 1200,
50
- "gguf_timeout": 1200,
51
- "gguf_extended_timeout": 1200,
52
  "retry_attempts": 2
53
  }
54
  }
55
 
56
  # ========== SSE STREAMING CONFIGURATION ==========
57
  SSE_CONFIG = {
58
- "max_wait_time": 31200, # 60 minutes max wait time for normal operations
59
- "extended_max_wait_time": 31200, # 60 minutes extended wait for GGUF/long operations
60
  "heartbeat_interval": 5, # Send heartbeat every 5 seconds
61
  "normal_heartbeat_interval": 10, # Normal heartbeat interval
62
  "poll_interval": 1, # Check job status every second
@@ -65,7 +65,7 @@ SSE_CONFIG = {
65
 
66
  # ========== CACHE CONFIGURATION ==========
67
  CACHE_CONFIG = {
68
- "ttl_seconds": 31200, # 1 hour
69
  "cache_dir": "/tmp/summary_cache",
70
  "max_cache_size": 100
71
  }
@@ -89,7 +89,7 @@ MEMORY_CONFIG = {
89
  "enable_quantization": True,
90
  "cache_models": True,
91
  "cleanup_interval": 300, # 5 minutes
92
- "max_memory_mb": 12000,
93
  "memory_pressure_threshold": 0.8,
94
  "aggressive_cleanup_threshold": 0.9
95
  }
 
24
  # ========== TIMEOUT CONFIGURATION ==========
25
  TIMEOUT_CONFIG = {
26
  "fast": {
27
+ "ehr_timeout": 600,
28
+ "generation_timeout": 600,
29
+ "gguf_timeout": 600,
30
+ "gguf_extended_timeout": 600,
31
  "retry_attempts": 2
32
  },
33
  "normal": {
34
+ "ehr_timeout": 600,
35
+ "generation_timeout": 600,
36
+ "gguf_timeout": 600,
37
+ "gguf_extended_timeout": 600,
38
  "retry_attempts": 3
39
  },
40
  "extended": {
41
+ "ehr_timeout": 600,
42
+ "generation_timeout": 600,
43
+ "gguf_timeout": 600,
44
+ "gguf_extended_timeout": 600,
45
  "retry_attempts": 3
46
  },
47
  "large_data": {
48
+ "ehr_timeout": 600,
49
+ "generation_timeout": 600,
50
+ "gguf_timeout": 600,
51
+ "gguf_extended_timeout": 600,
52
  "retry_attempts": 2
53
  }
54
  }
55
 
56
  # ========== SSE STREAMING CONFIGURATION ==========
57
  SSE_CONFIG = {
58
+ "max_wait_time": 3600, # 60 minutes max wait time for normal operations
59
+ "extended_max_wait_time": 3600, # 60 minutes extended wait for GGUF/long operations
60
  "heartbeat_interval": 5, # Send heartbeat every 5 seconds
61
  "normal_heartbeat_interval": 10, # Normal heartbeat interval
62
  "poll_interval": 1, # Check job status every second
 
65
 
66
  # ========== CACHE CONFIGURATION ==========
67
  CACHE_CONFIG = {
68
+ "ttl_seconds": 3600, # 1 hour
69
  "cache_dir": "/tmp/summary_cache",
70
  "max_cache_size": 100
71
  }
 
89
  "enable_quantization": True,
90
  "cache_models": True,
91
  "cleanup_interval": 300, # 5 minutes
92
+ "max_memory_mb": 6000,
93
  "memory_pressure_threshold": 0.8,
94
  "aggressive_cleanup_threshold": 0.9
95
  }
services/ai-service/src/ai_med_extract/utils/hf_spaces_config.py CHANGED
@@ -65,7 +65,7 @@ TIMEOUT_SETTINGS = {
65
  "model_loading_timeout": 300, # 5 minutes for model loading
66
  "inference_timeout": 120, # 2 minutes for inference
67
  "ehr_fetch_timeout": 30, # 30 seconds for EHR fetch
68
- "streaming_timeout": 1200 # 10 minutes for streaming responses
69
  }
70
 
71
  def get_optimized_model(model_type: str) -> str:
 
65
  "model_loading_timeout": 300, # 5 minutes for model loading
66
  "inference_timeout": 120, # 2 minutes for inference
67
  "ehr_fetch_timeout": 30, # 30 seconds for EHR fetch
68
+ "streaming_timeout": 600 # 10 minutes for streaming responses
69
  }
70
 
71
  def get_optimized_model(model_type: str) -> str:
services/ai-service/src/ai_med_extract/utils/model_config.py CHANGED
@@ -16,14 +16,10 @@ T4_OPTIMIZATIONS = {
16
  "torch_dtype": "float16",
17
  "device_map": "auto",
18
  "trust_remote_code": True,
19
- # Note: cache_dir removed from here - it should be passed to pipeline() directly,
20
- # not in model_kwargs, to avoid "not used by the model" errors during generation
21
  "local_files_only": False
22
  }
23
 
24
- # T4 cache directory (separate from model_kwargs to avoid generation errors)
25
- T4_CACHE_DIR = "/tmp/hf_cache"
26
-
27
  # Model generation settings optimized for T4
28
  GENERATION_CONFIG = {
29
  "use_cache": True,
@@ -43,18 +39,18 @@ GENERATION_CONFIG = {
43
  # T4-optimized default models (smaller, efficient models)
44
  DEFAULT_MODELS = {
45
  "text-generation": {
46
- "primary": "microsoft/Phi-3-mini-4k-instruct", # Robust 4k context model
47
- "fallback": "microsoft/Phi-3-mini-4k-instruct",
48
  "description": "Text generation models for QA and medical data extraction"
49
  },
50
  "summarization": {
51
- "primary": "microsoft/Phi-3-mini-4k-instruct", # Use Phi-3 for summarization too (better context)
52
- "fallback": "facebook/bart-large-cnn",
53
  "description": "Text summarization models for medical reports"
54
  },
55
  "seq2seq": {
56
- "primary": "facebook/bart-large-cnn", # Better seq2seq default
57
- "fallback": "google/flan-t5-base",
58
  "description": "Seq2Seq models for summarization tasks"
59
  },
60
  "ner": {
@@ -260,7 +256,6 @@ def is_model_supported_on_t4(model_name: str, model_type: str) -> bool:
260
  "patrickvonplaten/longformer2roberta-cnn_dailymail-fp16",
261
  # Phi-3 models
262
  "microsoft/Phi-3-mini-4k-instruct",
263
- "microsoft/Phi-3-mini-128k-instruct",
264
  "microsoft/Phi-3-mini-4k-instruct-GGUF",
265
  "microsoft/Phi-3-mini-4k-instruct-gguf",
266
  "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
 
16
  "torch_dtype": "float16",
17
  "device_map": "auto",
18
  "trust_remote_code": True,
19
+ "cache_dir": "/tmp/hf_cache",
 
20
  "local_files_only": False
21
  }
22
 
 
 
 
23
  # Model generation settings optimized for T4
24
  GENERATION_CONFIG = {
25
  "use_cache": True,
 
39
  # T4-optimized default models (smaller, efficient models)
40
  DEFAULT_MODELS = {
41
  "text-generation": {
42
+ "primary": "microsoft/DialoGPT-small", # Lightweight conversational model
43
+ "fallback": "facebook/bart-base",
44
  "description": "Text generation models for QA and medical data extraction"
45
  },
46
  "summarization": {
47
+ "primary": "sshleifer/distilbart-cnn-6-6", # Smaller BART variant
48
+ "fallback": "facebook/bart-base",
49
  "description": "Text summarization models for medical reports"
50
  },
51
  "seq2seq": {
52
+ "primary": "sshleifer/distilbart-cnn-6-6", # Same as summarization for consistency
53
+ "fallback": "facebook/bart-base",
54
  "description": "Seq2Seq models for summarization tasks"
55
  },
56
  "ner": {
 
256
  "patrickvonplaten/longformer2roberta-cnn_dailymail-fp16",
257
  # Phi-3 models
258
  "microsoft/Phi-3-mini-4k-instruct",
 
259
  "microsoft/Phi-3-mini-4k-instruct-GGUF",
260
  "microsoft/Phi-3-mini-4k-instruct-gguf",
261
  "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
services/ai-service/src/ai_med_extract/utils/openvino_summarizer_utils.py CHANGED
@@ -238,7 +238,7 @@ def delta_to_text(delta):
238
  from concurrent.futures import ThreadPoolExecutor, as_completed
239
  import threading
240
 
241
- def generate_section(pipeline, prompt, section_name, timeout=1200):
242
  """Generate one section with timeout protection."""
243
  try:
244
  # If your pipeline supports timeout, pass it. Otherwise, wrap in future.
 
238
  from concurrent.futures import ThreadPoolExecutor, as_completed
239
  import threading
240
 
241
+ def generate_section(pipeline, prompt, section_name, timeout=600):
242
  """Generate one section with timeout protection."""
243
  try:
244
  # If your pipeline supports timeout, pass it. Otherwise, wrap in future.
services/ai-service/src/ai_med_extract/utils/performance_monitor.py CHANGED
@@ -76,7 +76,7 @@ class PerformanceMonitor:
76
  class RobustParsingCache:
77
  """Intelligent caching system for robust JSON parsing operations."""
78
 
79
- def __init__(self, cache_dir: str = "/tmp/medical_ai_cache", ttl: int = 31200):
80
  self.cache_dir = cache_dir
81
  self.ttl = ttl # Time to live in seconds
82
  os.makedirs(cache_dir, exist_ok=True)
 
76
  class RobustParsingCache:
77
  """Intelligent caching system for robust JSON parsing operations."""
78
 
79
+ def __init__(self, cache_dir: str = "/tmp/medical_ai_cache", ttl: int = 3600):
80
  self.cache_dir = cache_dir
81
  self.ttl = ttl # Time to live in seconds
82
  os.makedirs(cache_dir, exist_ok=True)
services/ai-service/src/ai_med_extract/utils/unified_model_manager.py CHANGED
@@ -55,7 +55,6 @@ class ModelInfo:
55
  load_time: float
56
  last_used: float
57
  error_message: Optional[str] = None
58
- fallback_reason: Optional[str] = None
59
 
60
  @dataclass
61
  class GenerationConfig:
@@ -91,22 +90,12 @@ class BaseModel(ABC):
91
  self._load_time = 0.0
92
  self._last_used = time.time()
93
  self._error_message = None
94
- self._fallback_reason = None
95
  self._memory_usage = 0.0
96
  self._kwargs = kwargs
97
 
98
  @property
99
  def status(self) -> ModelStatus:
100
  return self._status
101
-
102
- @property
103
- def fallback_reason(self) -> Optional[str]:
104
- """Get the reason why this model is a fallback, if applicable"""
105
- return self._fallback_reason
106
-
107
- def set_fallback_reason(self, reason: str):
108
- """Set the fallback reason for this model"""
109
- self._fallback_reason = reason
110
 
111
  @abstractmethod
112
  def _load_implementation(self) -> bool:
@@ -143,11 +132,7 @@ class BaseModel(ABC):
143
  except Exception as e:
144
  self._status = ModelStatus.ERROR
145
  self._error_message = str(e)
146
- error_details = f"Load failed: {type(e).__name__}: {str(e)}"
147
  logger.error(f"Failed to load model {self.name}: {e}")
148
- # Store detailed error for fallback tracking
149
- if self.model_type == "fallback":
150
- self._fallback_reason = error_details
151
  return None
152
 
153
  def _update_memory_usage(self):
@@ -184,47 +169,9 @@ class TransformersModel(BaseModel):
184
  def _load_implementation(self) -> bool:
185
  try:
186
  from transformers import pipeline
187
- import os
188
 
189
  # Get T4-optimized kwargs
190
  model_kwargs = get_t4_model_kwargs(self.model_type)
191
-
192
- # Prepare pipeline kwargs to avoid duplicate arguments
193
- pipeline_kwargs = self._kwargs.copy()
194
-
195
- # Move trust_remote_code from model_kwargs to pipeline_kwargs if present
196
- # This prevents "multiple values for keyword argument" error
197
- if "trust_remote_code" in model_kwargs:
198
- pipeline_kwargs["trust_remote_code"] = model_kwargs.pop("trust_remote_code")
199
-
200
- # Set cache directory via environment variable (safest approach)
201
- # This ensures it's only used during from_pretrained(), not passed to generate()
202
- if not IS_T4_MEDIUM:
203
- # Local environment
204
- cache_dir = os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
205
- os.environ['HF_HOME'] = cache_dir
206
- else:
207
- # T4 environment
208
- from .model_config import T4_CACHE_DIR
209
- os.environ['HF_HOME'] = T4_CACHE_DIR
210
-
211
- # Ensure trust_remote_code is True for local runs (required for Phi-3 etc)
212
- if not IS_T4_MEDIUM:
213
- pipeline_kwargs["trust_remote_code"] = True
214
- # Force eager attention implementation to avoid Triton dependency on Windows
215
- # This helps with "No module named 'triton'" errors for some models
216
- # Add to model_kwargs instead of pipeline_kwargs to prevent it from being passed to generate()
217
- model_kwargs["attn_implementation"] = "eager"
218
-
219
- # Force using latest model revision to avoid cache compatibility issues
220
- # This prevents "DynamicCache has no attribute get_max_length" errors
221
- pipeline_kwargs["revision"] = "main"
222
-
223
- # CRITICAL FIX: Disable use_cache for Phi-3 models to avoid DynamicCache compatibility issues
224
- # The cached Phi-3 model code may use get_max_length() which doesn't exist in newer DynamicCache
225
- # We disable cache during loading to force fresh generation without cache issues
226
- if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
227
- model_kwargs["use_cache"] = False
228
 
229
  # Handle different model types for summarization
230
  if self.model_type.lower() in ["summarization", "seq2seq"]:
@@ -234,7 +181,7 @@ class TransformersModel(BaseModel):
234
  model=self.name,
235
  device_map="auto" if torch.cuda.is_available() else None,
236
  model_kwargs=model_kwargs,
237
- **pipeline_kwargs
238
  )
239
  elif self.model_type.lower() in ["text-generation", "causal-lm"]:
240
  # Text generation models
@@ -243,7 +190,7 @@ class TransformersModel(BaseModel):
243
  model=self.name,
244
  device_map="auto" if torch.cuda.is_available() else None,
245
  model_kwargs=model_kwargs,
246
- **pipeline_kwargs
247
  )
248
  elif "bart" in self.name.lower() or "t5" in self.name.lower():
249
  # BART and T5 models default to summarization
@@ -252,7 +199,7 @@ class TransformersModel(BaseModel):
252
  model=self.name,
253
  device_map="auto" if torch.cuda.is_available() else None,
254
  model_kwargs=model_kwargs,
255
- **pipeline_kwargs
256
  )
257
  elif "longformer" in self.name.lower():
258
  # Longformer models work with summarization pipeline
@@ -261,7 +208,7 @@ class TransformersModel(BaseModel):
261
  model=self.name,
262
  device_map="auto" if torch.cuda.is_available() else None,
263
  model_kwargs=model_kwargs,
264
- **pipeline_kwargs
265
  )
266
  else:
267
  # Default to text-generation for unknown types
@@ -270,7 +217,7 @@ class TransformersModel(BaseModel):
270
  model=self.name,
271
  device_map="auto" if torch.cuda.is_available() else None,
272
  model_kwargs=model_kwargs,
273
- **pipeline_kwargs
274
  )
275
 
276
  return True
@@ -297,15 +244,6 @@ class TransformersModel(BaseModel):
297
  "num_return_sequences": 1
298
  }
299
 
300
- # Prepare generation kwargs
301
- gen_kwargs = {}
302
-
303
- # CRITICAL FIX: Disable cache for Phi-3 models to avoid DynamicCache compatibility issues
304
- # The cached Phi-3 model code may use get_max_length() which doesn't exist in newer DynamicCache
305
- if "phi-3" in self.name.lower() or "phi3" in self.name.lower():
306
- gen_kwargs["use_cache"] = False
307
- logger.info(f"Disabled cache for Phi-3 model {self.name} to avoid compatibility issues")
308
-
309
  # Handle different pipeline types
310
  if hasattr(self._model, 'task') and self._model.task == "summarization":
311
  # Summarization pipeline
@@ -316,8 +254,7 @@ class TransformersModel(BaseModel):
316
  temperature=config.temperature,
317
  do_sample=config.temperature > 0.1,
318
  num_beams=4, # Better quality for summarization
319
- early_stopping=True,
320
- **gen_kwargs
321
  )
322
  return result[0]['summary_text'] if result else ""
323
  else:
@@ -329,8 +266,7 @@ class TransformersModel(BaseModel):
329
  top_p=config.top_p,
330
  do_sample=config.temperature > 0.1,
331
  pad_token_id=0,
332
- num_return_sequences=1,
333
- **gen_kwargs
334
  )
335
  generated_text = result[0]['generated_text']
336
  # Remove the prompt from the generated text
@@ -362,120 +298,32 @@ class GGUFModel(BaseModel):
362
  def _load_implementation(self) -> bool:
363
  try:
364
  from llama_cpp import Llama
365
- import os
366
- from pathlib import Path
367
 
368
  # Get T4-optimized kwargs
369
  model_kwargs = get_t4_model_kwargs("gguf")
370
 
371
  # Set up model path - handle different GGUF formats
372
  model_path = self.name
373
- # If model name doesn't end with .gguf, we need to append the filename
374
  if not model_path.endswith('.gguf'):
375
- # If filename is provided separately, combine repo path with filename
376
- if self.filename:
377
- model_path = f"{model_path}/{self.filename}"
378
  else:
379
- # Fallback: try to construct path (shouldn't happen if filename extraction worked)
380
- logger.warning(f"GGUF model {self.name} has no filename specified, using name as-is")
381
-
382
- # Check if model_path is a local file path
383
- # If it doesn't exist and looks like a Hugging Face repo path (contains / but not a file path), download it
384
- is_local_file = os.path.exists(model_path) or (os.path.isabs(model_path) and os.path.sep in model_path)
385
-
386
- if not is_local_file:
387
- # Not a local file - need to download from Hugging Face
388
- try:
389
- from huggingface_hub import hf_hub_download
390
- logger.info(f"Downloading GGUF model from Hugging Face: {self.name}/{self.filename}")
391
-
392
- # Extract repo_id and filename
393
- if '/' in model_path and model_path.endswith('.gguf'):
394
- # Path like "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf"
395
- parts = model_path.split('/')
396
- repo_id = '/'.join(parts[:-1])
397
- filename = parts[-1]
398
- elif '/' in model_path:
399
- # Path like "microsoft/Phi-3-mini-4k-instruct-gguf" with separate filename
400
- repo_id = model_path
401
- filename = self.filename or self._extract_filename(self.name)
402
- else:
403
- repo_id = self.name
404
- filename = self.filename or self._extract_filename(self.name)
405
-
406
- # Download from Hugging Face
407
- logger.info(f"Attempting to download: repo_id={repo_id}, filename={filename}")
408
- model_path = hf_hub_download(
409
- repo_id=repo_id,
410
- filename=filename,
411
- cache_dir=os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
412
- )
413
- logger.info(f"Downloaded GGUF model to: {model_path}")
414
- except Exception as download_error:
415
- import traceback
416
- error_details = f"Download failed: {type(download_error).__name__}: {str(download_error)}"
417
- logger.error(f"Failed to download GGUF model from Hugging Face: {error_details}")
418
- logger.debug(f"Download error traceback:\n{traceback.format_exc()}")
419
- logger.error(f" Repo ID: {repo_id}, Filename: {filename}")
420
- self._error_message = error_details
421
- return False
422
-
423
- # Verify the file exists
424
- if not os.path.exists(model_path):
425
- error_msg = f"GGUF model file does not exist: {model_path}"
426
- logger.error(error_msg)
427
- self._error_message = error_msg
428
- return False
429
-
430
- # Check file size for Q8_0 models (they're larger and might not fit in T4 memory)
431
- try:
432
- file_size_mb = os.path.getsize(model_path) / (1024 * 1024)
433
- logger.info(f"GGUF model file size: {file_size_mb:.2f} MB")
434
-
435
- # Q8_0 models are typically 2x larger than Q4, warn if very large
436
- if "Q8_0" in self.name or "q8_0" in self.name.lower():
437
- if file_size_mb > 8000: # > 8GB might be too large for T4
438
- logger.warning(f"Q8_0 model is {file_size_mb:.2f} MB - may be too large for T4 (16GB total)")
439
- except Exception as size_error:
440
- logger.warning(f"Could not check file size: {size_error}")
441
-
442
- # Adjust context window for 128k models (but limit to available memory)
443
- n_ctx = 8192 # Default T4 context window
444
- if "128k" in self.name.lower():
445
- # 128k models support larger context, but we'll use a reasonable limit for T4
446
- n_ctx = 16384 # Use 16k instead of full 128k to save memory
447
- logger.info(f"Detected 128k model, using context window: {n_ctx}")
448
-
449
- # Adjust GPU layers based on model size (Q8_0 models need fewer GPU layers due to memory)
450
- n_gpu_layers = 35 if torch.cuda.is_available() else 0
451
- if "Q8_0" in self.name or "q8_0" in self.name.lower():
452
- # Reduce GPU layers for larger Q8_0 models to avoid OOM
453
- n_gpu_layers = min(20, n_gpu_layers) if torch.cuda.is_available() else 0
454
- logger.info(f"Q8_0 model detected, using {n_gpu_layers} GPU layers")
455
-
456
- logger.info(f"Loading GGUF model: {model_path} with n_ctx={n_ctx}, n_gpu_layers={n_gpu_layers}")
457
-
458
  self._model = Llama(
459
  model_path=model_path,
460
- n_ctx=n_ctx,
461
  n_threads=4, # CPU threads
462
- n_gpu_layers=n_gpu_layers,
463
  verbose=False,
464
  **model_kwargs
465
  )
466
 
467
- logger.info(f"Successfully loaded GGUF model: {self.name}")
468
  return True
469
  except Exception as e:
470
- import traceback
471
- error_details = f"{type(e).__name__}: {str(e)}"
472
- error_traceback = traceback.format_exc()
473
- logger.error(f"Failed to load GGUF model {self.name}: {error_details}")
474
- logger.debug(f"Full traceback:\n{error_traceback}")
475
- self._error_message = error_details
476
- # Store detailed error for fallback tracking
477
- if self.model_type == "fallback":
478
- self._fallback_reason = f"GGUF load failed: {error_details}"
479
  return False
480
 
481
  def generate(self, prompt: str, config: GenerationConfig) -> str:
@@ -509,7 +357,6 @@ class OpenVINOModel(BaseModel):
509
 
510
  def _load_implementation(self) -> bool:
511
  try:
512
- import warnings
513
  from optimum.intel import OVModelForCausalLM
514
  from transformers import AutoTokenizer
515
 
@@ -526,38 +373,17 @@ class OpenVINOModel(BaseModel):
526
  # e.g., "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov" -> "microsoft/Phi-3-mini-4k-instruct"
527
  if "Phi-3-mini-4k-instruct" in self.name:
528
  tokenizer_path = "microsoft/Phi-3-mini-4k-instruct"
529
- elif "Phi-3-mini-128k-instruct" in self.name:
530
- tokenizer_path = "microsoft/Phi-3-mini-128k-instruct"
531
- # For causal-openvino type with standard model names, use the model name directly for tokenizer
532
- elif self.model_type == "causal-openvino":
533
- # For models like "microsoft/Phi-3-mini-128k-instruct", use the same name for tokenizer
534
- tokenizer_path = self.name
535
-
536
- # Suppress TracerWarnings during OpenVINO export (these are harmless but noisy)
537
- # The warnings occur when OpenVINO traces the PyTorch model for conversion
538
- with warnings.catch_warnings():
539
- warnings.filterwarnings("ignore", category=UserWarning, module="torch.jit")
540
- warnings.filterwarnings("ignore", message=".*TracerWarning.*")
541
- warnings.filterwarnings("ignore", message=".*Converting a tensor to a Python boolean.*")
542
- warnings.filterwarnings("ignore", message=".*torch.tensor results are registered as constants.*")
543
- # Load the OpenVINO model with trust_remote_code=True
544
  self._model = OVModelForCausalLM.from_pretrained(
545
  model_path,
546
  device="GPU" if torch.cuda.is_available() else "CPU",
547
- trust_remote_code=True,
548
- **model_kwargs,
549
  )
550
 
551
- # Load the tokenizer (also may need trust_remote_code)
552
- self._tokenizer = AutoTokenizer.from_pretrained(
553
- tokenizer_path,
554
- trust_remote_code=True,
555
- )
556
  return True
557
  except Exception as e:
558
  logger.error(f"Failed to load OpenVINO model {self.name}: {e}")
559
- import traceback
560
- logger.debug(f"OpenVINO load error traceback:\n{traceback.format_exc()}")
561
  return False
562
 
563
  def generate(self, prompt: str, config: GenerationConfig) -> str:
@@ -565,47 +391,7 @@ class OpenVINOModel(BaseModel):
565
  raise ModelError(self.name, "not_loaded", "Model not loaded")
566
 
567
  try:
568
- # Detect 128k models and set appropriate context window
569
- is_128k_model = "128k" in self.name.lower()
570
-
571
- # Get tokenizer's model_max_length (defaults to 128k for Phi-3-128k models)
572
- tokenizer_max_length = getattr(self._tokenizer, 'model_max_length', None)
573
-
574
- # For 128k models, use full context window (131072 tokens = 128k)
575
- # For other models, use tokenizer's default or a safe limit
576
- if is_128k_model:
577
- max_context_length = 131072 # Full 128k context window
578
- logger.info(f"128k model detected: Using context window of {max_context_length} tokens")
579
- elif tokenizer_max_length:
580
- max_context_length = tokenizer_max_length
581
- else:
582
- max_context_length = 4096 # Safe default for 4k models
583
-
584
- # Tokenize with proper context window handling
585
- # For 128k models, explicitly set max_length to allow full context without truncation
586
- tokenizer_kwargs = {"return_tensors": "pt"}
587
- if is_128k_model:
588
- # For 128k models, set max_length to full context window and disable truncation
589
- tokenizer_kwargs["max_length"] = max_context_length
590
- tokenizer_kwargs["truncation"] = False # Don't truncate - allow full 128k context
591
- else:
592
- # For other models, use tokenizer's default max_length with truncation enabled
593
- # This prevents errors if prompt exceeds context window
594
- if tokenizer_max_length:
595
- tokenizer_kwargs["max_length"] = tokenizer_max_length
596
- tokenizer_kwargs["truncation"] = True
597
- # If no max_length set, let tokenizer use its default
598
-
599
- inputs = self._tokenizer(prompt, **tokenizer_kwargs)
600
-
601
- # Log token count for debugging
602
- input_ids = inputs.get('input_ids', None)
603
- if input_ids is not None:
604
- prompt_tokens = input_ids.shape[1] if len(input_ids.shape) > 1 else len(input_ids)
605
- logger.debug(f"Prompt token count: {prompt_tokens} / {max_context_length}")
606
- if prompt_tokens > max_context_length * 0.9:
607
- logger.warning(f"Prompt is using {prompt_tokens}/{max_context_length} tokens ({prompt_tokens/max_context_length*100:.1f}%) - approaching context limit")
608
-
609
  if torch.cuda.is_available():
610
  inputs = {k: v.to("cuda") for k, v in inputs.items()}
611
 
@@ -641,7 +427,6 @@ class FallbackModel(BaseModel):
641
 
642
  def generate(self, prompt: str, config: GenerationConfig) -> str:
643
  # Simple rule-based fallback
644
- # Accept config parameter for compatibility with other models
645
  return "Patient summary generation completed. Please review patient data manually for comprehensive assessment."
646
 
647
  class UnifiedModelManager:
@@ -662,10 +447,8 @@ class UnifiedModelManager:
662
  model_type = detect_model_type(name)
663
 
664
  # Check if model is supported on T4
665
- fallback_reason = None
666
  if not is_model_supported_on_t4(name, model_type):
667
- fallback_reason = f"Model {name} ({model_type}) is not supported/optimal for T4 Medium"
668
- logger.warning(f"Model {name} may not be optimal for T4. Using fallback. Reason: {fallback_reason}")
669
  model_type = "fallback"
670
 
671
  cache_key = f"{name}:{model_type}"
@@ -681,29 +464,12 @@ class UnifiedModelManager:
681
  model_kwargs = get_t4_model_kwargs(model_type)
682
  model_kwargs.update(kwargs)
683
 
684
- # Special handling for Phi-3-small - it has hard dependency on Triton
685
- # which is not available on Windows. Switch to Phi-3-mini-128k-instruct instead.
686
- if "Phi-3-small" in name:
687
- if model_type == "openvino" or model_type == "causal-openvino":
688
- # OpenVINO mode - not supported for auto-export
689
- logger.warning(f"Phi-3-small is not currently supported in OpenVINO mode (architecture not supported for export). Switching to 'microsoft/Phi-3-mini-128k-instruct'.")
690
- name = "microsoft/Phi-3-mini-128k-instruct"
691
- elif not IS_T4_MEDIUM and (model_type == "text-generation" or model_type == "causal-lm" or model_type == "transformers"):
692
- # Transformers mode on Windows - Triton not available
693
- logger.warning(f"Phi-3-small requires Triton which is not available on Windows. Switching to 'microsoft/Phi-3-mini-128k-instruct'.")
694
- name = "microsoft/Phi-3-mini-128k-instruct"
695
- # Update cache key to reflect the actual model being loaded
696
- cache_key = f"{name}:{model_type}"
697
-
698
  if model_type == "gguf" or filename or name.endswith('.gguf'):
699
  model = GGUFModel(name, model_type, filename, **model_kwargs)
700
- elif model_type == "openvino" or model_type == "causal-openvino" or "openvino" in name.lower():
701
  model = OpenVINOModel(name, model_type, **model_kwargs)
702
  elif model_type == "fallback":
703
  model = FallbackModel(name, model_type, **model_kwargs)
704
- # Store fallback reason if we switched to fallback
705
- if fallback_reason:
706
- model._fallback_reason = fallback_reason
707
  else:
708
  model = TransformersModel(name, model_type, **model_kwargs)
709
 
@@ -711,88 +477,9 @@ class UnifiedModelManager:
711
 
712
  # Load if not lazy
713
  if not lazy and model.status != ModelStatus.LOADED:
714
- load_result = model.load()
715
- # If load failed and we're using fallback, capture the reason
716
- if load_result is None and model.model_type == "fallback" and not model._fallback_reason:
717
- model._fallback_reason = f"Model {name} failed to load"
718
 
719
  return model
720
-
721
- def get_fallback_reason(self, name: str, model_type: str = None) -> Optional[str]:
722
- """Get the fallback reason for a specific model if it's using fallback"""
723
- if model_type is None:
724
- model_type = detect_model_type(name)
725
-
726
- cache_key = f"{name}:{model_type}"
727
- if cache_key in self._models:
728
- model = self._models[cache_key]
729
- return model.fallback_reason
730
-
731
- return None
732
-
733
- def diagnose_model_loading(self, name: str, model_type: str = None) -> Dict[str, Any]:
734
- """Diagnose why a model might not be loading - returns detailed information"""
735
- if model_type is None:
736
- model_type = detect_model_type(name)
737
-
738
- diagnosis = {
739
- "model_name": name,
740
- "model_type": model_type,
741
- "is_supported_on_t4": is_model_supported_on_t4(name, model_type),
742
- "cache_key": f"{name}:{model_type}",
743
- "in_cache": False,
744
- "status": None,
745
- "error_message": None,
746
- "fallback_reason": None,
747
- "file_exists": False,
748
- "file_path": None,
749
- "file_size_mb": None
750
- }
751
-
752
- # Check cache
753
- if diagnosis["cache_key"] in self._models:
754
- model = self._models[diagnosis["cache_key"]]
755
- diagnosis["in_cache"] = True
756
- diagnosis["status"] = model.status.value if model.status else None
757
- diagnosis["error_message"] = model._error_message
758
- diagnosis["fallback_reason"] = model._fallback_reason
759
-
760
- # Check if it's a GGUF model and verify file
761
- if model_type == "gguf" or name.endswith('.gguf'):
762
- import os
763
- # Try to determine the file path
764
- if '/' in name and name.endswith('.gguf'):
765
- parts = name.split('/')
766
- repo_id = '/'.join(parts[:-1])
767
- filename = parts[-1]
768
- # Check Hugging Face cache
769
- cache_dir = os.environ.get('HF_HOME', os.path.join(os.path.expanduser('~'), '.cache', 'huggingface'))
770
- # Try to find the file in cache
771
- potential_paths = [
772
- os.path.join(cache_dir, 'hub', f'models--{repo_id.replace("/", "--")}', 'snapshots', '*', filename),
773
- os.path.join(cache_dir, 'hub', repo_id.replace('/', '--'), filename),
774
- ]
775
- # Check if file exists locally first
776
- if os.path.exists(name):
777
- diagnosis["file_exists"] = True
778
- diagnosis["file_path"] = name
779
- else:
780
- # Try to find in cache
781
- from glob import glob
782
- for pattern in potential_paths:
783
- matches = glob(pattern)
784
- if matches:
785
- diagnosis["file_exists"] = True
786
- diagnosis["file_path"] = matches[0]
787
- break
788
-
789
- if diagnosis["file_path"] and os.path.exists(diagnosis["file_path"]):
790
- try:
791
- diagnosis["file_size_mb"] = round(os.path.getsize(diagnosis["file_path"]) / (1024 * 1024), 2)
792
- except:
793
- pass
794
-
795
- return diagnosis
796
 
797
  def generate_text(self, name: str, prompt: str, model_type: str = None, **kwargs) -> str:
798
  """Generate text using specified model"""
@@ -812,7 +499,7 @@ class UnifiedModelManager:
812
 
813
  for key, model in self._models.items():
814
  # Remove models not used in last hour
815
- if current_time - model._last_used > 31200:
816
  to_remove.append(key)
817
 
818
  for key in to_remove:
@@ -830,8 +517,7 @@ class UnifiedModelManager:
830
  memory_usage=model._memory_usage,
831
  load_time=model._load_time,
832
  last_used=model._last_used,
833
- error_message=model._error_message,
834
- fallback_reason=model._fallback_reason
835
  )
836
  for model in self._models.values()
837
  ]
@@ -850,25 +536,7 @@ unified_model_manager = get_unified_model_manager()
850
  # Legacy compatibility functions
851
  def create_fallback_pipeline():
852
  """Create a fallback pipeline for compatibility"""
853
- fallback_model = FallbackModel("fallback", "fallback")
854
- fallback_model.load() # Ensure it's loaded
855
-
856
- # Create a wrapper that matches the expected interface
857
- class FallbackPipelineWrapper:
858
- def __init__(self, model):
859
- self.model = model
860
-
861
- def generate(self, prompt, **kwargs):
862
- """Generate with keyword arguments (for compatibility with GGUF pipeline interface)"""
863
- # Convert kwargs to GenerationConfig (already imported at module level)
864
- config = GenerationConfig(**kwargs)
865
- return self.model.generate(prompt, config)
866
-
867
- def generate_full_summary(self, prompt, **kwargs):
868
- """Generate full summary (for compatibility)"""
869
- return self.generate(prompt, **kwargs)
870
-
871
- return FallbackPipelineWrapper(fallback_model)
872
 
873
  def get_memory_monitor():
874
  """Get a simple memory monitor for compatibility"""
 
55
  load_time: float
56
  last_used: float
57
  error_message: Optional[str] = None
 
58
 
59
  @dataclass
60
  class GenerationConfig:
 
90
  self._load_time = 0.0
91
  self._last_used = time.time()
92
  self._error_message = None
 
93
  self._memory_usage = 0.0
94
  self._kwargs = kwargs
95
 
96
  @property
97
  def status(self) -> ModelStatus:
98
  return self._status
 
 
 
 
 
 
 
 
 
99
 
100
  @abstractmethod
101
  def _load_implementation(self) -> bool:
 
132
  except Exception as e:
133
  self._status = ModelStatus.ERROR
134
  self._error_message = str(e)
 
135
  logger.error(f"Failed to load model {self.name}: {e}")
 
 
 
136
  return None
137
 
138
  def _update_memory_usage(self):
 
169
  def _load_implementation(self) -> bool:
170
  try:
171
  from transformers import pipeline
 
172
 
173
  # Get T4-optimized kwargs
174
  model_kwargs = get_t4_model_kwargs(self.model_type)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
  # Handle different model types for summarization
177
  if self.model_type.lower() in ["summarization", "seq2seq"]:
 
181
  model=self.name,
182
  device_map="auto" if torch.cuda.is_available() else None,
183
  model_kwargs=model_kwargs,
184
+ **self._kwargs
185
  )
186
  elif self.model_type.lower() in ["text-generation", "causal-lm"]:
187
  # Text generation models
 
190
  model=self.name,
191
  device_map="auto" if torch.cuda.is_available() else None,
192
  model_kwargs=model_kwargs,
193
+ **self._kwargs
194
  )
195
  elif "bart" in self.name.lower() or "t5" in self.name.lower():
196
  # BART and T5 models default to summarization
 
199
  model=self.name,
200
  device_map="auto" if torch.cuda.is_available() else None,
201
  model_kwargs=model_kwargs,
202
+ **self._kwargs
203
  )
204
  elif "longformer" in self.name.lower():
205
  # Longformer models work with summarization pipeline
 
208
  model=self.name,
209
  device_map="auto" if torch.cuda.is_available() else None,
210
  model_kwargs=model_kwargs,
211
+ **self._kwargs
212
  )
213
  else:
214
  # Default to text-generation for unknown types
 
217
  model=self.name,
218
  device_map="auto" if torch.cuda.is_available() else None,
219
  model_kwargs=model_kwargs,
220
+ **self._kwargs
221
  )
222
 
223
  return True
 
244
  "num_return_sequences": 1
245
  }
246
 
 
 
 
 
 
 
 
 
 
247
  # Handle different pipeline types
248
  if hasattr(self._model, 'task') and self._model.task == "summarization":
249
  # Summarization pipeline
 
254
  temperature=config.temperature,
255
  do_sample=config.temperature > 0.1,
256
  num_beams=4, # Better quality for summarization
257
+ early_stopping=True
 
258
  )
259
  return result[0]['summary_text'] if result else ""
260
  else:
 
266
  top_p=config.top_p,
267
  do_sample=config.temperature > 0.1,
268
  pad_token_id=0,
269
+ num_return_sequences=1
 
270
  )
271
  generated_text = result[0]['generated_text']
272
  # Remove the prompt from the generated text
 
298
  def _load_implementation(self) -> bool:
299
  try:
300
  from llama_cpp import Llama
 
 
301
 
302
  # Get T4-optimized kwargs
303
  model_kwargs = get_t4_model_kwargs("gguf")
304
 
305
  # Set up model path - handle different GGUF formats
306
  model_path = self.name
 
307
  if not model_path.endswith('.gguf'):
308
+ if '/' in model_path:
309
+ # Already a full path like microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf
310
+ model_path = f"{model_path}"
311
  else:
312
+ # Add default filename
313
+ model_path = f"{model_path}/{self.filename}"
314
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
315
  self._model = Llama(
316
  model_path=model_path,
317
+ n_ctx=8192, # T4 context window
318
  n_threads=4, # CPU threads
319
+ n_gpu_layers=35 if torch.cuda.is_available() else 0, # GPU layers for Phi-3
320
  verbose=False,
321
  **model_kwargs
322
  )
323
 
 
324
  return True
325
  except Exception as e:
326
+ logger.error(f"Failed to load GGUF model {self.name}: {e}")
 
 
 
 
 
 
 
 
327
  return False
328
 
329
  def generate(self, prompt: str, config: GenerationConfig) -> str:
 
357
 
358
  def _load_implementation(self) -> bool:
359
  try:
 
360
  from optimum.intel import OVModelForCausalLM
361
  from transformers import AutoTokenizer
362
 
 
373
  # e.g., "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov" -> "microsoft/Phi-3-mini-4k-instruct"
374
  if "Phi-3-mini-4k-instruct" in self.name:
375
  tokenizer_path = "microsoft/Phi-3-mini-4k-instruct"
376
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
  self._model = OVModelForCausalLM.from_pretrained(
378
  model_path,
379
  device="GPU" if torch.cuda.is_available() else "CPU",
380
+ **model_kwargs
 
381
  )
382
 
383
+ self._tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
 
 
 
 
384
  return True
385
  except Exception as e:
386
  logger.error(f"Failed to load OpenVINO model {self.name}: {e}")
 
 
387
  return False
388
 
389
  def generate(self, prompt: str, config: GenerationConfig) -> str:
 
391
  raise ModelError(self.name, "not_loaded", "Model not loaded")
392
 
393
  try:
394
+ inputs = self._tokenizer(prompt, return_tensors="pt")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
395
  if torch.cuda.is_available():
396
  inputs = {k: v.to("cuda") for k, v in inputs.items()}
397
 
 
427
 
428
  def generate(self, prompt: str, config: GenerationConfig) -> str:
429
  # Simple rule-based fallback
 
430
  return "Patient summary generation completed. Please review patient data manually for comprehensive assessment."
431
 
432
  class UnifiedModelManager:
 
447
  model_type = detect_model_type(name)
448
 
449
  # Check if model is supported on T4
 
450
  if not is_model_supported_on_t4(name, model_type):
451
+ logger.warning(f"Model {name} may not be optimal for T4. Using fallback.")
 
452
  model_type = "fallback"
453
 
454
  cache_key = f"{name}:{model_type}"
 
464
  model_kwargs = get_t4_model_kwargs(model_type)
465
  model_kwargs.update(kwargs)
466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
467
  if model_type == "gguf" or filename or name.endswith('.gguf'):
468
  model = GGUFModel(name, model_type, filename, **model_kwargs)
469
+ elif model_type == "openvino" or "openvino" in name.lower():
470
  model = OpenVINOModel(name, model_type, **model_kwargs)
471
  elif model_type == "fallback":
472
  model = FallbackModel(name, model_type, **model_kwargs)
 
 
 
473
  else:
474
  model = TransformersModel(name, model_type, **model_kwargs)
475
 
 
477
 
478
  # Load if not lazy
479
  if not lazy and model.status != ModelStatus.LOADED:
480
+ model.load()
 
 
 
481
 
482
  return model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
483
 
484
  def generate_text(self, name: str, prompt: str, model_type: str = None, **kwargs) -> str:
485
  """Generate text using specified model"""
 
499
 
500
  for key, model in self._models.items():
501
  # Remove models not used in last hour
502
+ if current_time - model._last_used > 3600:
503
  to_remove.append(key)
504
 
505
  for key in to_remove:
 
517
  memory_usage=model._memory_usage,
518
  load_time=model._load_time,
519
  last_used=model._last_used,
520
+ error_message=model._error_message
 
521
  )
522
  for model in self._models.values()
523
  ]
 
536
  # Legacy compatibility functions
537
  def create_fallback_pipeline():
538
  """Create a fallback pipeline for compatibility"""
539
+ return FallbackModel("fallback", "fallback")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540
 
541
  def get_memory_monitor():
542
  """Get a simple memory monitor for compatibility"""
temp_test_load.py DELETED
@@ -1,6 +0,0 @@
1
- import sys, os
2
- sys.path.append(r'd:/dartdev/glitz/git/HNTAI/services/ai-service/src')
3
- from ai_med_extract.utils.unified_model_manager import UnifiedModelManager
4
- manager = UnifiedModelManager()
5
- model = manager.get_model('microsoft/Phi-3-small-8k-instruct', model_type='causal-openvino', lazy=False)
6
- print('Model status after load:', model.status)
 
 
 
 
 
 
 
temp_test_load_128k.py DELETED
@@ -1,9 +0,0 @@
1
- import sys, os
2
- sys.path.append(r'd:/dartdev/glitz/git/HNTAI/services/ai-service/src')
3
- from ai_med_extract.utils.unified_model_manager import UnifiedModelManager
4
- manager = UnifiedModelManager()
5
- # Testing the primary model from config
6
- model_name = 'microsoft/Phi-3-mini-128k-instruct'
7
- print(f'Testing load for: {model_name}')
8
- model = manager.get_model(model_name, model_type='causal-openvino', lazy=False)
9
- print(f'Model status after load: {model.status}')