Spaces:

salvinjose
/

HNTAI

Paused

HNTAI / patient_summary_models_review.md

Refactor text generation in routes_fastapi.py to return raw summaries instead of formatted markdown. Remove unnecessary markdown processing functions and streamline summary handling, enhancing performance and clarity in the output structure.

6aa6b6a 8 months ago

preview code

Raw

History Blame

23 kB

Patient Summary Generation - Model Review & Rating

Executive Summary

This document reviews and rates 6 models for the patient summary generation flow based on:

Compatibility with current implementation
Performance (speed, memory usage)
Quality (output quality for clinical summaries)
Token Limits (input/output capacity)
Reliability (error handling, fallbacks)

Flow Overview

The patient summary generation follows this flow:

Data Processing: EHR data → robust parsing → delta calculation → baseline extraction
Prompt Building: Creates structured prompts with patient data, visits, demographics
Model Loading: Unified model manager handles loading with caching
Generation: Model-specific generation with token limits (default: 8192 input, 1024-8192 output)
Post-processing: Markdown formatting → ensure 4 sections → output

Model Ratings

1. facebook/bart-large-cnn

Type: summarization Rating: ⭐⭐⭐⭐⭐ (9/10)

Strengths

✅ Well-supported: Primary default model for summarization
✅ Optimized pipeline: Uses HuggingFace summarization pipeline (lines 1337-1414)
✅ Good quality: BART architecture excels at abstractive summarization
✅ Memory efficient: ~406M parameters, reasonable for production
✅ Fast inference: Optimized for summarization tasks

Weaknesses

⚠️ Context length: Limited to ~1024 tokens input (standard BART)
⚠️ May truncate: Long patient histories might be cut off

Implementation Details

# Handled in: async_patient_summary, lines 1337-1414
model_type = "summarization"
pipeline = unified_model_manager.get_model(model_name, "summarization")
config = GenerationConfig(max_tokens=1024, min_tokens=100, temperature=0.1, top_p=0.5)
raw_summary = await asyncio.to_thread(pipeline.generate, context, config)

Recommendations

✅ Use for: Standard patient summaries with moderate history
✅ Best for: Balanced quality/speed requirements
✅ Production ready: Yes

2. patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

Type: seq2seq
Rating: ⭐⭐⭐⭐ (7/10)

Strengths

✅ Long context: Longformer architecture handles 4096+ tokens
✅ Good for long histories: Better suited for extensive patient records
✅ Half-precision: FP16 reduces memory footprint
✅ Medical training: Trained on CNN/DailyMail (long-form summarization)

Weaknesses

⚠️ Type handling: Currently mapped to summarization pipeline (line 1425)
⚠️ May have fallback: Code shows fallback to bart-large-cnn if loading fails (line 1430-1442)
⚠️ Less tested: Not a primary default model

Implementation Details

# Handled in: async_patient_summary, lines 1416-1503
model_type = "seq2seq"
# Mapped to summarization pipeline internally
model = unified_model_manager.get_model(model_name, "seq2seq")
# Falls back to default summarization model if load fails

Recommendations

✅ Use for: Patients with extensive visit history (>50 visits)
⚠️ Production: Needs testing for reliability
💡 Improvement: Better error handling and specific seq2seq optimization

3. microsoft/Phi-3-mini-4k-instruct

Type: causal-openvino
Rating: ⭐⭐⭐⭐⭐ (9/10)

Strengths

✅ Instruction-tuned: Phi-3 follows instructions well (good for structured summaries)
✅ 4k context: 4096 tokens allows longer patient histories
✅ Fast inference: Optimized model size (3.8B parameters)
✅ OpenVINO support: Can use optimized inference
✅ Handled as text-generation: Uses unified manager (lines 1177-1335)

Weaknesses

⚠️ Requires good prompts: Instruction format must be clear
⚠️ Potential OpenVINO path: May route to OpenVINO pipeline (lines 1229-1235)

Implementation Details

# Handled in: async_patient_summary, lines 1177-1335
model_type = "causal-openvino" or "text-generation"
# Mapped to text-generation internally
actual_model_type = "text-generation" if model_type in {"text-generation", "causal-openvino"} else model_type
model = unified_model_manager.get_model(model_name, "text-generation")
config = GenerationConfig(max_tokens=1024, temperature=0.1, top_p=0.5)
raw_summary = await asyncio.to_thread(model.generate, prompt, config)

Recommendations

✅ Use for: Structured summaries with clear instructions
✅ Best for: Modern instruction-following requirements
✅ Production ready: Yes

4. OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Type: causal-openvino
Rating: ⭐⭐⭐⭐ (8/10)

Strengths

✅ Optimized inference: OpenVINO optimization for CPU/Intel hardware
✅ FP16 precision: Half-precision reduces memory
✅ Same capabilities: Same as Phi-3-mini-4k-instruct model-wise
✅ Fallback option: Listed as fallback for causal-openvino (model_config.py line 56)

Weaknesses

⚠️ Hardware specific: Optimized for Intel hardware
⚠️ Different loading path: Uses get_openvino_pipeline (line 1233)
⚠️ May be slower on non-Intel: GPU may prefer standard model

Implementation Details

# Handled in: async_patient_summary, lines 1229-1235
elif model_type == "causal-openvino":
    from ..utils.model_loader_spaces import get_openvino_pipeline
    pipeline = await asyncio.to_thread(get_openvino_pipeline, model_name)

Recommendations

✅ Use for: Intel CPU servers, optimized inference
⚠️ Production: Test on target hardware first
💡 Conditional: Use based on hardware detection

5. microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf

Type: gguf
Rating: ⭐⭐⭐⭐ (8/10)

Strengths

✅ Quantized: Q4 quantization = very low memory (~2.5GB vs ~7.6GB)
✅ Fast inference: Optimized GGUF format
✅ 4k context: Maintains 4096 token context
✅ Well-handled: Dedicated GGUF pipeline (lines 1015-1175)
✅ Caching: Pipeline caching for performance (GGUF_PIPELINE_CACHE)

Weaknesses

⚠️ Slight quality loss: Q4 quantization may reduce quality slightly
⚠️ Longer timeouts: Extended timeout needed (600s on HF Spaces)
⚠️ File path parsing: Requires special handling for filename extraction

Implementation Details

# Handled in: async_patient_summary, lines 1015-1175
if model_name.endswith('.gguf'):
    parts = model_name.rsplit('/', 1)
    repo_id = parts[0]
    filename = parts[1]
else:
    repo_id = model_name
    filename = None

pipeline = await asyncio.to_thread(get_cached_gguf_pipeline, repo_id, filename)
full_prompt = f"""<|system|>...<|user|>...<|assistant|>"""
raw_summary = await asyncio.to_thread(
    pipeline.generate,
    full_prompt,
    max_tokens=1024,
    temperature=0.1,
    top_p=0.5
)

Recommendations

✅ Use for: Memory-constrained environments, local deployment
✅ Best for: HuggingFace Spaces deployment
✅ Production ready: Yes, with extended timeout

6. google/flan-t5-large

Type: summarization
Rating: ⭐⭐⭐ (6/10)

Strengths

✅ Fallback option: Listed as fallback for summarization (model_config.py line 31)
✅ T5 architecture: Encoder-decoder, good for summarization
✅ Well-supported: Standard HuggingFace summarization pipeline

Weaknesses

⚠️ Older model: T5 architecture is less modern than BART/Longformer
⚠️ Context limits: ~512 input tokens (less than BART)
⚠️ Quality: Generally lower quality than BART-large
⚠️ Not primary: Only used as fallback

Implementation Details

# Handled in: async_patient_summary, lines 1337-1414 (same as BART)
# Falls back from primary summarization models
fallback_model_name = model_config.get_default_model('summarization')
# Gets flan-t5-large as fallback

Recommendations

⚠️ Use for: Fallback only when BART fails
❌ Not recommended: For primary production use
💡 Consider: Replacing with better fallback option

Overall Flow Assessment

Current Implementation Strengths

✅ Unified Model Manager: Centralized loading and caching
✅ Model Type Handling: Supports all required types (summarization, seq2seq, gguf, causal-openvino)
✅ Robust Error Handling: Fallbacks to rule-based summary
✅ Token Management: Configurable max tokens (default 1024, supports up to 8192)
✅ Progress Tracking: Job progress updates for all model types
✅ Memory Management: Cleanup after generation

Areas for Improvement

⚠️ Input Token Limits: Some models have hard limits that may truncate long histories
⚠️ Prompt Optimization: Different models may need model-specific prompt formats
⚠️ Seq2Seq Mapping: Currently maps seq2seq → summarization (may not be optimal)
⚠️ Timeout Handling: GGUF models need extended timeouts

Model Comparison Matrix

Model	Type	Context	Quality	Speed	Memory	Rating
bart-large-cnn	summarization	1024	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	9/10
longformer2roberta	seq2seq	4096+	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	7/10
Phi-3-mini-4k	causal-openvino	4096	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	9/10
Phi-3-OpenVINO	causal-openvino	4096	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	8/10
Phi-3-GGUF	gguf	4096	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	8/10
flan-t5-large	summarization	512	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	6/10

Recommendations by Use Case

Best Overall Quality:

facebook/bart-large-cnn (if context fits)
microsoft/Phi-3-mini-4k-instruct (for longer contexts)

Best for Long Patient Histories:

patrickvonplaten/longformer2roberta-cnn_dailymail-fp16
microsoft/Phi-3-mini-4k-instruct (both handle 4k+ tokens)

Best for Memory-Constrained Environments:

microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf
facebook/bart-large-cnn (moderate memory)

Best for Intel CPU/OpenVINO:

OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Production Primary Recommendations:

facebook/bart-large-cnn (balanced, reliable)
microsoft/Phi-3-mini-4k-instruct (modern, instruction-following)
microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf (for Spaces)

Code Integration Assessment

All models are properly integrated into the flow with:

✅ Proper type detection and routing
✅ Unified model manager loading
✅ Error handling and fallbacks
✅ Progress tracking
✅ Memory cleanup

Overall Flow Rating: ⭐⭐⭐⭐⭐ (9/10)

The implementation is robust and handles all model types well. Minor improvements could be made in:

Model-specific prompt optimization
Better handling of very long contexts
Seq2seq-specific optimizations

T4 Medium HuggingFace Spaces Deployment Review

T4 Medium Specifications

GPU Memory: 16GB VRAM (NVIDIA T4)
CPU: 8 vCPUs
RAM: 30GB
Storage: 150GB persistent
Cost: $0.60/hour (~$432/month if running 24/7)

Memory Constraints Analysis

Safe Memory Usage Targets

Based on HF Spaces configuration (hf_spaces_config.py):

Max GPU Usage: 80% of 16GB = ~12.8GB usable
Max RAM Usage: 80% of 30GB = ~24GB usable
Model Caching: Enabled (cleanup every 5 minutes)

T4 Medium Deployment Ratings

1. facebook/bart-large-cnn

T4 Rating: ⭐⭐⭐⭐⭐ (10/10) - BEST CHOICE FOR T4

T4-Specific Assessment

✅ VRAM: ~1.5GB (excellent fit)
✅ RAM: ~2-3GB during inference
✅ Speed: Fast on T4 (optimized for summarization)
✅ HF Spaces Config: Primary model (hf_spaces_config.py line 13)
✅ No Quantization Needed: Fits comfortably in 16GB
✅ Cache-Friendly: Small size allows multiple instances

Performance Estimates

Load Time: ~5-10 seconds
Inference: ~2-5 seconds per summary
Memory Footprint: Very low, allows concurrent requests

T4 Deployment Verdict

✅ Highly Recommended: Best balance for T4
✅ Production Ready: Proven on HF Spaces
✅ Cost Effective: Low resource usage = better uptime

2. patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

T4 Rating: ⭐⭐⭐ (6/10) - USE WITH CAUTION

T4-Specific Assessment

⚠️ VRAM: ~2-3GB (acceptable but tight with batching)
⚠️ RAM: ~4-6GB during inference (higher overhead)
⚠️ Speed: Moderate on T4 (Longformer attention is memory-intensive)
⚠️ HF Spaces Config: Falls back to bart-large-cnn (line 18)
⚠️ FP16 Helpful: Half-precision helps but still resource-intensive

Performance Estimates

Load Time: ~15-25 seconds
Inference: ~8-15 seconds per summary
Memory Footprint: High, limits concurrent requests

T4 Deployment Verdict

⚠️ Conditional Use: Only for very long histories (>4096 tokens)
⚠️ Not Primary: Should be fallback, not default
⚠️ Monitor Memory: Risk of OOM with multiple concurrent requests

Recommendations

Use only when patient history exceeds 4096 tokens
Limit concurrent requests when using this model
Consider chunking strategy for very long histories

3. microsoft/Phi-3-mini-4k-instruct

T4 Rating: ⭐⭐⭐⭐ (8/10) - GOOD CHOICE

T4-Specific Assessment

✅ VRAM: ~2.5GB (fits comfortably)
⚠️ RAM: ~4-5GB during inference
✅ Speed: Good on T4 (3.8B parameters, optimized)
✅ HF Spaces Config: Allowed in SPACES_OPTIMIZED_MODELS (line 111)
⚠️ Text-Generation Mode: Uses more VRAM than summarization models

Performance Estimates

Load Time: ~10-20 seconds
Inference: ~5-10 seconds per summary
Memory Footprint: Moderate

T4 Deployment Verdict

✅ Recommended: Good for instruction-following tasks
✅ Production Ready: Handles 4k context well
⚠️ Monitor: Slightly higher memory than BART

Recommendations

Enable quantization if memory becomes tight
Use for summaries requiring structured output
Consider GGUF version if memory is concern

4. OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

T4 Rating: ⭐⭐⭐ (5/10) - NOT RECOMMENDED FOR T4 GPU

T4-Specific Assessment

❌ OpenVINO on GPU: Code shows fallback to BART (line 38)
⚠️ Optimized for Intel CPU: T4 is NVIDIA, not Intel
⚠️ GPU Compatibility: May not leverage T4 effectively
⚠️ HF Spaces Config: Falls back due to "GPU issues" (line 38)
⚠️ Memory: ~2.5GB but optimization may not apply

Performance Estimates

Load Time: ~15-30 seconds (includes conversion)
Inference: Variable (depends on GPU compatibility)
Memory Footprint: Moderate

T4 Deployment Verdict

❌ Not Recommended: OpenVINO optimized for Intel CPU
❌ Use Standard Phi-3: Better to use non-OpenVINO version
💡 Alternative: Use regular Phi-3-mini-4k-instruct instead

Recommendations

Avoid on T4: OpenVINO is CPU/Intel-focused
Use microsoft/Phi-3-mini-4k-instruct instead
Only consider if running on Intel CPU (not T4)

5. microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf

T4 Rating: ⭐⭐⭐⭐⭐ (9/10) - EXCELLENT FOR T4

T4-Specific Assessment

✅ VRAM: ~2GB (excellent - Q4 quantization)
✅ RAM: ~3-4GB during inference
✅ Speed: Very good on T4 (GGUF optimized)
✅ HF Spaces Config: Primary GGUF model (line 33)
✅ Extended Timeout: 600s configured for HF Spaces (routes_fastapi.py line 1075)
✅ Quantization: Q4 reduces memory by ~75%

Performance Estimates

Load Time: ~20-40 seconds (GGUF loading overhead)
Inference: ~4-8 seconds per summary
Memory Footprint: Very low, allows many concurrent requests

T4 Deployment Verdict

✅ Highly Recommended: Best for memory efficiency
✅ Production Ready: Proven on HF Spaces
✅ Scalable: Low memory allows high concurrency

Advantages Over Standard Phi-3

75% less VRAM usage
Better for concurrent requests
Quality loss minimal (Q4 quantization)

Recommendations

Best Choice for cost-conscious deployment
Use when expecting high concurrent load
Extended timeout already configured (600s)
Cache-friendly for repeated requests

6. google/flan-t5-large

T4 Rating: ⭐⭐⭐ (6/10) - FALLBACK ONLY

T4-Specific Assessment

✅ VRAM: ~2GB (fits well)
✅ RAM: ~2-3GB during inference
✅ Speed: Moderate on T4
✅ HF Spaces Config: Fallback model (line 14)
⚠️ Older Architecture: Less efficient than BART

Performance Estimates

Load Time: ~8-15 seconds
Inference: ~4-8 seconds per summary
Memory Footprint: Low

T4 Deployment Verdict

⚠️ Fallback Only: Use when BART fails
⚠️ Not Primary: Lower quality than alternatives
✅ Safe Fallback: Reliable if needed

Recommendations

Keep as fallback option
Monitor quality vs BART
Consider removing if BART is stable

T4 Medium Deployment Summary

Recommended Models (Priority Order)

🥇 Primary Recommendation: facebook/bart-large-cnn

Why: Best balance of quality, speed, and memory efficiency
VRAM: ~1.5GB (plenty of headroom)
Use Case: Default for all standard patient summaries
Production Status: ✅ Ready

🥈 Secondary Recommendation: microsoft/Phi-3-mini-4k-instruct-gguf

Why: Lowest memory footprint, good quality
VRAM: ~2GB (excellent for high concurrency)
Use Case: High-traffic scenarios, memory-constrained periods
Production Status: ✅ Ready (extended timeout configured)

🥉 Tertiary Recommendation: microsoft/Phi-3-mini-4k-instruct

Why: Better instruction-following, 4k context
VRAM: ~2.5GB (good fit)
Use Case: When structured output is critical
Production Status: ✅ Ready

Conditional Use Models

⚠️ patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

Use When: Patient history exceeds 4096 tokens
Limitations: Higher memory, limit concurrency
Production Status: ⚠️ Monitor closely

Not Recommended for T4

❌ OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Why: Optimized for Intel CPU, not NVIDIA T4 GPU
Alternative: Use standard Phi-3-mini-4k-instruct

T4 Medium Resource Management

Memory Allocation Strategy

# Based on hf_spaces_config.py
MAX_GPU_MEMORY = 16GB * 0.8 = 12.8GB usable
MAX_RAM = 30GB * 0.8 = 24GB usable

# Recommended model allocation:
Primary (BART):     ~1.5GB GPU  (11% of usable)
Secondary (Phi-3): ~2.5GB GPU  (20% of usable)
GGUF (Phi-3-Q4):   ~2.0GB GPU  (16% of usable)

# Headroom for:
- System overhead: ~1-2GB
- Concurrent requests: ~2-3GB per model instance
- Batch processing: ~1-2GB

Concurrent Request Limits

Model	Max Concurrent	VRAM per Request
bart-large-cnn	5-8 requests	~1.5GB base + 0.3GB/req
Phi-3-GGUF	6-10 requests	~2GB base + 0.2GB/req
Phi-3-standard	4-6 requests	~2.5GB base + 0.4GB/req
Longformer	2-3 requests	~3GB base + 0.8GB/req

Timeout Configuration (T4-Specific)

Based on routes_fastapi.py:

Standard models: 120-180s timeout
GGUF models: 600s extended timeout (line 1075)
HF Spaces detection: Automatic (line 1073-1074)

Optimization Strategies for T4

Model Caching: Enabled (cleanup every 5 minutes)
Quantization: Consider enabling for Phi-3 standard if needed
Batch Size: Keep small (1-2) to manage memory
Memory Cleanup: Automatic after each request
GPU Memory Management: 80% max usage enforced

Cost Analysis (T4 Medium)

Monthly Cost Estimate

Base Cost: $0.60/hour × 730 hours = $438/month (24/7)
Storage: Included in Medium tier (150GB)
No Additional Costs: No egress fees for model loading

Cost Optimization Tips

Use GGUF Models: Lower memory = better efficiency = potentially lower instances
Model Caching: Reduces load times = faster response = better UX
Timeout Management: Prevents hanging requests = better resource utilization
Concurrent Requests: Maximize utilization per dollar

T4 Medium Deployment Checklist

✅ Pre-Deployment

Set HUGGINGFACE_SPACES=true environment variable
Configure timeout settings (GGUF extended timeout)
Enable model caching
Set memory limits (80% max usage)
Test model loading on T4

✅ Model Selection

Primary: facebook/bart-large-cnn (default)
Secondary: microsoft/Phi-3-mini-4k-instruct-gguf (for high load)
Fallback: google/flan-t5-large (if BART fails)
Avoid: OpenVINO model (not optimized for T4)

✅ Monitoring

GPU memory usage (target <80%)
RAM usage (target <24GB)
Inference latency (target <10s)
Concurrent request handling
Error rates and fallbacks

✅ Production Hardening

Implement request queuing for high load
Set up automatic fallback to GGUF on memory pressure
Configure alerting for memory spikes
Test concurrent request scenarios
Validate extended timeout for GGUF models

Final T4 Medium Recommendations

🎯 Optimal Configuration

Primary Model: facebook/bart-large-cnn
Fallback Model: microsoft/Phi-3-mini-4k-instruct-gguf
Emergency Fallback: google/flan-t5-large
Max Concurrent: 5-6 requests (BART), 8-10 (GGUF)
Memory Limit: 80% (12.8GB GPU, 24GB RAM)
Timeout: 180s (standard), 600s (GGUF)

📊 Expected Performance

Average Latency: 5-8 seconds per summary
Throughput: 60-100 summaries/hour (single instance)
Memory Efficiency: Excellent (plenty of headroom)
Cost Efficiency: Good ($0.60/hour = reasonable for T4)

✅ Production Ready Status

bart-large-cnn: ✅ Fully ready
Phi-3-GGUF: ✅ Fully ready
Phi-3-standard: ✅ Ready (monitor memory)
Longformer: ⚠️ Conditional use only
OpenVINO: ❌ Not recommended

Overall T4 Deployment Rating: ⭐⭐⭐⭐⭐ (9/10)

The T4 Medium space is well-suited for patient summary generation with the recommended models. Excellent memory headroom allows for reliable operation with multiple concurrent requests.