HNTAI / patient_summary_models_review.md
sachinchandrankallar's picture
Refactor text generation in routes_fastapi.py to return raw summaries instead of formatted markdown. Remove unnecessary markdown processing functions and streamline summary handling, enhancing performance and clarity in the output structure.
6aa6b6a
|
Raw
History Blame
23 kB

Patient Summary Generation - Model Review & Rating

Executive Summary

This document reviews and rates 6 models for the patient summary generation flow based on:

  • Compatibility with current implementation
  • Performance (speed, memory usage)
  • Quality (output quality for clinical summaries)
  • Token Limits (input/output capacity)
  • Reliability (error handling, fallbacks)

Flow Overview

The patient summary generation follows this flow:

  1. Data Processing: EHR data β†’ robust parsing β†’ delta calculation β†’ baseline extraction
  2. Prompt Building: Creates structured prompts with patient data, visits, demographics
  3. Model Loading: Unified model manager handles loading with caching
  4. Generation: Model-specific generation with token limits (default: 8192 input, 1024-8192 output)
  5. Post-processing: Markdown formatting β†’ ensure 4 sections β†’ output

Model Ratings

1. facebook/bart-large-cnn

Type: summarization Rating: ⭐⭐⭐⭐⭐ (9/10)

Strengths

  • βœ… Well-supported: Primary default model for summarization
  • βœ… Optimized pipeline: Uses HuggingFace summarization pipeline (lines 1337-1414)
  • βœ… Good quality: BART architecture excels at abstractive summarization
  • βœ… Memory efficient: ~406M parameters, reasonable for production
  • βœ… Fast inference: Optimized for summarization tasks

Weaknesses

  • ⚠️ Context length: Limited to ~1024 tokens input (standard BART)
  • ⚠️ May truncate: Long patient histories might be cut off

Implementation Details

# Handled in: async_patient_summary, lines 1337-1414
model_type = "summarization"
pipeline = unified_model_manager.get_model(model_name, "summarization")
config = GenerationConfig(max_tokens=1024, min_tokens=100, temperature=0.1, top_p=0.5)
raw_summary = await asyncio.to_thread(pipeline.generate, context, config)

Recommendations

  • βœ… Use for: Standard patient summaries with moderate history
  • βœ… Best for: Balanced quality/speed requirements
  • βœ… Production ready: Yes

2. patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

Type: seq2seq
Rating: ⭐⭐⭐⭐ (7/10)

Strengths

  • βœ… Long context: Longformer architecture handles 4096+ tokens
  • βœ… Good for long histories: Better suited for extensive patient records
  • βœ… Half-precision: FP16 reduces memory footprint
  • βœ… Medical training: Trained on CNN/DailyMail (long-form summarization)

Weaknesses

  • ⚠️ Type handling: Currently mapped to summarization pipeline (line 1425)
  • ⚠️ May have fallback: Code shows fallback to bart-large-cnn if loading fails (line 1430-1442)
  • ⚠️ Less tested: Not a primary default model

Implementation Details

# Handled in: async_patient_summary, lines 1416-1503
model_type = "seq2seq"
# Mapped to summarization pipeline internally
model = unified_model_manager.get_model(model_name, "seq2seq")
# Falls back to default summarization model if load fails

Recommendations

  • βœ… Use for: Patients with extensive visit history (>50 visits)
  • ⚠️ Production: Needs testing for reliability
  • πŸ’‘ Improvement: Better error handling and specific seq2seq optimization

3. microsoft/Phi-3-mini-4k-instruct

Type: causal-openvino
Rating: ⭐⭐⭐⭐⭐ (9/10)

Strengths

  • βœ… Instruction-tuned: Phi-3 follows instructions well (good for structured summaries)
  • βœ… 4k context: 4096 tokens allows longer patient histories
  • βœ… Fast inference: Optimized model size (3.8B parameters)
  • βœ… OpenVINO support: Can use optimized inference
  • βœ… Handled as text-generation: Uses unified manager (lines 1177-1335)

Weaknesses

  • ⚠️ Requires good prompts: Instruction format must be clear
  • ⚠️ Potential OpenVINO path: May route to OpenVINO pipeline (lines 1229-1235)

Implementation Details

# Handled in: async_patient_summary, lines 1177-1335
model_type = "causal-openvino" or "text-generation"
# Mapped to text-generation internally
actual_model_type = "text-generation" if model_type in {"text-generation", "causal-openvino"} else model_type
model = unified_model_manager.get_model(model_name, "text-generation")
config = GenerationConfig(max_tokens=1024, temperature=0.1, top_p=0.5)
raw_summary = await asyncio.to_thread(model.generate, prompt, config)

Recommendations

  • βœ… Use for: Structured summaries with clear instructions
  • βœ… Best for: Modern instruction-following requirements
  • βœ… Production ready: Yes

4. OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Type: causal-openvino
Rating: ⭐⭐⭐⭐ (8/10)

Strengths

  • βœ… Optimized inference: OpenVINO optimization for CPU/Intel hardware
  • βœ… FP16 precision: Half-precision reduces memory
  • βœ… Same capabilities: Same as Phi-3-mini-4k-instruct model-wise
  • βœ… Fallback option: Listed as fallback for causal-openvino (model_config.py line 56)

Weaknesses

  • ⚠️ Hardware specific: Optimized for Intel hardware
  • ⚠️ Different loading path: Uses get_openvino_pipeline (line 1233)
  • ⚠️ May be slower on non-Intel: GPU may prefer standard model

Implementation Details

# Handled in: async_patient_summary, lines 1229-1235
elif model_type == "causal-openvino":
    from ..utils.model_loader_spaces import get_openvino_pipeline
    pipeline = await asyncio.to_thread(get_openvino_pipeline, model_name)

Recommendations

  • βœ… Use for: Intel CPU servers, optimized inference
  • ⚠️ Production: Test on target hardware first
  • πŸ’‘ Conditional: Use based on hardware detection

5. microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf

Type: gguf
Rating: ⭐⭐⭐⭐ (8/10)

Strengths

  • βœ… Quantized: Q4 quantization = very low memory (~2.5GB vs ~7.6GB)
  • βœ… Fast inference: Optimized GGUF format
  • βœ… 4k context: Maintains 4096 token context
  • βœ… Well-handled: Dedicated GGUF pipeline (lines 1015-1175)
  • βœ… Caching: Pipeline caching for performance (GGUF_PIPELINE_CACHE)

Weaknesses

  • ⚠️ Slight quality loss: Q4 quantization may reduce quality slightly
  • ⚠️ Longer timeouts: Extended timeout needed (600s on HF Spaces)
  • ⚠️ File path parsing: Requires special handling for filename extraction

Implementation Details

# Handled in: async_patient_summary, lines 1015-1175
if model_name.endswith('.gguf'):
    parts = model_name.rsplit('/', 1)
    repo_id = parts[0]
    filename = parts[1]
else:
    repo_id = model_name
    filename = None

pipeline = await asyncio.to_thread(get_cached_gguf_pipeline, repo_id, filename)
full_prompt = f"""<|system|>...<|user|>...<|assistant|>"""
raw_summary = await asyncio.to_thread(
    pipeline.generate,
    full_prompt,
    max_tokens=1024,
    temperature=0.1,
    top_p=0.5
)

Recommendations

  • βœ… Use for: Memory-constrained environments, local deployment
  • βœ… Best for: HuggingFace Spaces deployment
  • βœ… Production ready: Yes, with extended timeout

6. google/flan-t5-large

Type: summarization
Rating: ⭐⭐⭐ (6/10)

Strengths

  • βœ… Fallback option: Listed as fallback for summarization (model_config.py line 31)
  • βœ… T5 architecture: Encoder-decoder, good for summarization
  • βœ… Well-supported: Standard HuggingFace summarization pipeline

Weaknesses

  • ⚠️ Older model: T5 architecture is less modern than BART/Longformer
  • ⚠️ Context limits: ~512 input tokens (less than BART)
  • ⚠️ Quality: Generally lower quality than BART-large
  • ⚠️ Not primary: Only used as fallback

Implementation Details

# Handled in: async_patient_summary, lines 1337-1414 (same as BART)
# Falls back from primary summarization models
fallback_model_name = model_config.get_default_model('summarization')
# Gets flan-t5-large as fallback

Recommendations

  • ⚠️ Use for: Fallback only when BART fails
  • ❌ Not recommended: For primary production use
  • πŸ’‘ Consider: Replacing with better fallback option

Overall Flow Assessment

Current Implementation Strengths

  1. βœ… Unified Model Manager: Centralized loading and caching
  2. βœ… Model Type Handling: Supports all required types (summarization, seq2seq, gguf, causal-openvino)
  3. βœ… Robust Error Handling: Fallbacks to rule-based summary
  4. βœ… Token Management: Configurable max tokens (default 1024, supports up to 8192)
  5. βœ… Progress Tracking: Job progress updates for all model types
  6. βœ… Memory Management: Cleanup after generation

Areas for Improvement

  1. ⚠️ Input Token Limits: Some models have hard limits that may truncate long histories
  2. ⚠️ Prompt Optimization: Different models may need model-specific prompt formats
  3. ⚠️ Seq2Seq Mapping: Currently maps seq2seq β†’ summarization (may not be optimal)
  4. ⚠️ Timeout Handling: GGUF models need extended timeouts

Model Comparison Matrix

Model Type Context Quality Speed Memory Rating
bart-large-cnn summarization 1024 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐ 9/10
longformer2roberta seq2seq 4096+ ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐ 7/10
Phi-3-mini-4k causal-openvino 4096 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ 9/10
Phi-3-OpenVINO causal-openvino 4096 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐ 8/10
Phi-3-GGUF gguf 4096 ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 8/10
flan-t5-large summarization 512 ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ 6/10

Recommendations by Use Case

Best Overall Quality:

  1. facebook/bart-large-cnn (if context fits)
  2. microsoft/Phi-3-mini-4k-instruct (for longer contexts)

Best for Long Patient Histories:

  1. patrickvonplaten/longformer2roberta-cnn_dailymail-fp16
  2. microsoft/Phi-3-mini-4k-instruct (both handle 4k+ tokens)

Best for Memory-Constrained Environments:

  1. microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf
  2. facebook/bart-large-cnn (moderate memory)

Best for Intel CPU/OpenVINO:

  1. OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Production Primary Recommendations:

  1. facebook/bart-large-cnn (balanced, reliable)
  2. microsoft/Phi-3-mini-4k-instruct (modern, instruction-following)
  3. microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf (for Spaces)

Code Integration Assessment

All models are properly integrated into the flow with:

  • βœ… Proper type detection and routing
  • βœ… Unified model manager loading
  • βœ… Error handling and fallbacks
  • βœ… Progress tracking
  • βœ… Memory cleanup

Overall Flow Rating: ⭐⭐⭐⭐⭐ (9/10)

The implementation is robust and handles all model types well. Minor improvements could be made in:

  • Model-specific prompt optimization
  • Better handling of very long contexts
  • Seq2seq-specific optimizations

T4 Medium HuggingFace Spaces Deployment Review

T4 Medium Specifications

  • GPU Memory: 16GB VRAM (NVIDIA T4)
  • CPU: 8 vCPUs
  • RAM: 30GB
  • Storage: 150GB persistent
  • Cost: $0.60/hour (~$432/month if running 24/7)

Memory Constraints Analysis

Safe Memory Usage Targets

Based on HF Spaces configuration (hf_spaces_config.py):

  • Max GPU Usage: 80% of 16GB = ~12.8GB usable
  • Max RAM Usage: 80% of 30GB = ~24GB usable
  • Model Caching: Enabled (cleanup every 5 minutes)

T4 Medium Deployment Ratings

1. facebook/bart-large-cnn

T4 Rating: ⭐⭐⭐⭐⭐ (10/10) - BEST CHOICE FOR T4

T4-Specific Assessment

  • βœ… VRAM: ~1.5GB (excellent fit)
  • βœ… RAM: ~2-3GB during inference
  • βœ… Speed: Fast on T4 (optimized for summarization)
  • βœ… HF Spaces Config: Primary model (hf_spaces_config.py line 13)
  • βœ… No Quantization Needed: Fits comfortably in 16GB
  • βœ… Cache-Friendly: Small size allows multiple instances

Performance Estimates

  • Load Time: ~5-10 seconds
  • Inference: ~2-5 seconds per summary
  • Memory Footprint: Very low, allows concurrent requests

T4 Deployment Verdict

  • βœ… Highly Recommended: Best balance for T4
  • βœ… Production Ready: Proven on HF Spaces
  • βœ… Cost Effective: Low resource usage = better uptime

2. patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

T4 Rating: ⭐⭐⭐ (6/10) - USE WITH CAUTION

T4-Specific Assessment

  • ⚠️ VRAM: ~2-3GB (acceptable but tight with batching)
  • ⚠️ RAM: ~4-6GB during inference (higher overhead)
  • ⚠️ Speed: Moderate on T4 (Longformer attention is memory-intensive)
  • ⚠️ HF Spaces Config: Falls back to bart-large-cnn (line 18)
  • ⚠️ FP16 Helpful: Half-precision helps but still resource-intensive

Performance Estimates

  • Load Time: ~15-25 seconds
  • Inference: ~8-15 seconds per summary
  • Memory Footprint: High, limits concurrent requests

T4 Deployment Verdict

  • ⚠️ Conditional Use: Only for very long histories (>4096 tokens)
  • ⚠️ Not Primary: Should be fallback, not default
  • ⚠️ Monitor Memory: Risk of OOM with multiple concurrent requests

Recommendations

  • Use only when patient history exceeds 4096 tokens
  • Limit concurrent requests when using this model
  • Consider chunking strategy for very long histories

3. microsoft/Phi-3-mini-4k-instruct

T4 Rating: ⭐⭐⭐⭐ (8/10) - GOOD CHOICE

T4-Specific Assessment

  • βœ… VRAM: ~2.5GB (fits comfortably)
  • ⚠️ RAM: ~4-5GB during inference
  • βœ… Speed: Good on T4 (3.8B parameters, optimized)
  • βœ… HF Spaces Config: Allowed in SPACES_OPTIMIZED_MODELS (line 111)
  • ⚠️ Text-Generation Mode: Uses more VRAM than summarization models

Performance Estimates

  • Load Time: ~10-20 seconds
  • Inference: ~5-10 seconds per summary
  • Memory Footprint: Moderate

T4 Deployment Verdict

  • βœ… Recommended: Good for instruction-following tasks
  • βœ… Production Ready: Handles 4k context well
  • ⚠️ Monitor: Slightly higher memory than BART

Recommendations

  • Enable quantization if memory becomes tight
  • Use for summaries requiring structured output
  • Consider GGUF version if memory is concern

4. OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

T4 Rating: ⭐⭐⭐ (5/10) - NOT RECOMMENDED FOR T4 GPU

T4-Specific Assessment

  • ❌ OpenVINO on GPU: Code shows fallback to BART (line 38)
  • ⚠️ Optimized for Intel CPU: T4 is NVIDIA, not Intel
  • ⚠️ GPU Compatibility: May not leverage T4 effectively
  • ⚠️ HF Spaces Config: Falls back due to "GPU issues" (line 38)
  • ⚠️ Memory: ~2.5GB but optimization may not apply

Performance Estimates

  • Load Time: ~15-30 seconds (includes conversion)
  • Inference: Variable (depends on GPU compatibility)
  • Memory Footprint: Moderate

T4 Deployment Verdict

  • ❌ Not Recommended: OpenVINO optimized for Intel CPU
  • ❌ Use Standard Phi-3: Better to use non-OpenVINO version
  • πŸ’‘ Alternative: Use regular Phi-3-mini-4k-instruct instead

Recommendations

  • Avoid on T4: OpenVINO is CPU/Intel-focused
  • Use microsoft/Phi-3-mini-4k-instruct instead
  • Only consider if running on Intel CPU (not T4)

5. microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf

T4 Rating: ⭐⭐⭐⭐⭐ (9/10) - EXCELLENT FOR T4

T4-Specific Assessment

  • βœ… VRAM: ~2GB (excellent - Q4 quantization)
  • βœ… RAM: ~3-4GB during inference
  • βœ… Speed: Very good on T4 (GGUF optimized)
  • βœ… HF Spaces Config: Primary GGUF model (line 33)
  • βœ… Extended Timeout: 600s configured for HF Spaces (routes_fastapi.py line 1075)
  • βœ… Quantization: Q4 reduces memory by ~75%

Performance Estimates

  • Load Time: ~20-40 seconds (GGUF loading overhead)
  • Inference: ~4-8 seconds per summary
  • Memory Footprint: Very low, allows many concurrent requests

T4 Deployment Verdict

  • βœ… Highly Recommended: Best for memory efficiency
  • βœ… Production Ready: Proven on HF Spaces
  • βœ… Scalable: Low memory allows high concurrency

Advantages Over Standard Phi-3

  • 75% less VRAM usage
  • Better for concurrent requests
  • Quality loss minimal (Q4 quantization)

Recommendations

  • Best Choice for cost-conscious deployment
  • Use when expecting high concurrent load
  • Extended timeout already configured (600s)
  • Cache-friendly for repeated requests

6. google/flan-t5-large

T4 Rating: ⭐⭐⭐ (6/10) - FALLBACK ONLY

T4-Specific Assessment

  • βœ… VRAM: ~2GB (fits well)
  • βœ… RAM: ~2-3GB during inference
  • βœ… Speed: Moderate on T4
  • βœ… HF Spaces Config: Fallback model (line 14)
  • ⚠️ Older Architecture: Less efficient than BART

Performance Estimates

  • Load Time: ~8-15 seconds
  • Inference: ~4-8 seconds per summary
  • Memory Footprint: Low

T4 Deployment Verdict

  • ⚠️ Fallback Only: Use when BART fails
  • ⚠️ Not Primary: Lower quality than alternatives
  • βœ… Safe Fallback: Reliable if needed

Recommendations

  • Keep as fallback option
  • Monitor quality vs BART
  • Consider removing if BART is stable

T4 Medium Deployment Summary

Recommended Models (Priority Order)

πŸ₯‡ Primary Recommendation: facebook/bart-large-cnn

  • Why: Best balance of quality, speed, and memory efficiency
  • VRAM: ~1.5GB (plenty of headroom)
  • Use Case: Default for all standard patient summaries
  • Production Status: βœ… Ready

πŸ₯ˆ Secondary Recommendation: microsoft/Phi-3-mini-4k-instruct-gguf

  • Why: Lowest memory footprint, good quality
  • VRAM: ~2GB (excellent for high concurrency)
  • Use Case: High-traffic scenarios, memory-constrained periods
  • Production Status: βœ… Ready (extended timeout configured)

πŸ₯‰ Tertiary Recommendation: microsoft/Phi-3-mini-4k-instruct

  • Why: Better instruction-following, 4k context
  • VRAM: ~2.5GB (good fit)
  • Use Case: When structured output is critical
  • Production Status: βœ… Ready

Conditional Use Models

⚠️ patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

  • Use When: Patient history exceeds 4096 tokens
  • Limitations: Higher memory, limit concurrency
  • Production Status: ⚠️ Monitor closely

Not Recommended for T4

❌ OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

  • Why: Optimized for Intel CPU, not NVIDIA T4 GPU
  • Alternative: Use standard Phi-3-mini-4k-instruct

T4 Medium Resource Management

Memory Allocation Strategy

# Based on hf_spaces_config.py
MAX_GPU_MEMORY = 16GB * 0.8 = 12.8GB usable
MAX_RAM = 30GB * 0.8 = 24GB usable

# Recommended model allocation:
Primary (BART):     ~1.5GB GPU  (11% of usable)
Secondary (Phi-3): ~2.5GB GPU  (20% of usable)
GGUF (Phi-3-Q4):   ~2.0GB GPU  (16% of usable)

# Headroom for:
- System overhead: ~1-2GB
- Concurrent requests: ~2-3GB per model instance
- Batch processing: ~1-2GB

Concurrent Request Limits

Model Max Concurrent VRAM per Request
bart-large-cnn 5-8 requests ~1.5GB base + 0.3GB/req
Phi-3-GGUF 6-10 requests ~2GB base + 0.2GB/req
Phi-3-standard 4-6 requests ~2.5GB base + 0.4GB/req
Longformer 2-3 requests ~3GB base + 0.8GB/req

Timeout Configuration (T4-Specific)

Based on routes_fastapi.py:

  • Standard models: 120-180s timeout
  • GGUF models: 600s extended timeout (line 1075)
  • HF Spaces detection: Automatic (line 1073-1074)

Optimization Strategies for T4

  1. Model Caching: Enabled (cleanup every 5 minutes)
  2. Quantization: Consider enabling for Phi-3 standard if needed
  3. Batch Size: Keep small (1-2) to manage memory
  4. Memory Cleanup: Automatic after each request
  5. GPU Memory Management: 80% max usage enforced

Cost Analysis (T4 Medium)

Monthly Cost Estimate

  • Base Cost: $0.60/hour Γ— 730 hours = $438/month (24/7)
  • Storage: Included in Medium tier (150GB)
  • No Additional Costs: No egress fees for model loading

Cost Optimization Tips

  1. Use GGUF Models: Lower memory = better efficiency = potentially lower instances
  2. Model Caching: Reduces load times = faster response = better UX
  3. Timeout Management: Prevents hanging requests = better resource utilization
  4. Concurrent Requests: Maximize utilization per dollar

T4 Medium Deployment Checklist

βœ… Pre-Deployment

  • Set HUGGINGFACE_SPACES=true environment variable
  • Configure timeout settings (GGUF extended timeout)
  • Enable model caching
  • Set memory limits (80% max usage)
  • Test model loading on T4

βœ… Model Selection

  • Primary: facebook/bart-large-cnn (default)
  • Secondary: microsoft/Phi-3-mini-4k-instruct-gguf (for high load)
  • Fallback: google/flan-t5-large (if BART fails)
  • Avoid: OpenVINO model (not optimized for T4)

βœ… Monitoring

  • GPU memory usage (target <80%)
  • RAM usage (target <24GB)
  • Inference latency (target <10s)
  • Concurrent request handling
  • Error rates and fallbacks

βœ… Production Hardening

  • Implement request queuing for high load
  • Set up automatic fallback to GGUF on memory pressure
  • Configure alerting for memory spikes
  • Test concurrent request scenarios
  • Validate extended timeout for GGUF models

Final T4 Medium Recommendations

🎯 Optimal Configuration

Primary Model: facebook/bart-large-cnn
Fallback Model: microsoft/Phi-3-mini-4k-instruct-gguf
Emergency Fallback: google/flan-t5-large
Max Concurrent: 5-6 requests (BART), 8-10 (GGUF)
Memory Limit: 80% (12.8GB GPU, 24GB RAM)
Timeout: 180s (standard), 600s (GGUF)

πŸ“Š Expected Performance

  • Average Latency: 5-8 seconds per summary
  • Throughput: 60-100 summaries/hour (single instance)
  • Memory Efficiency: Excellent (plenty of headroom)
  • Cost Efficiency: Good ($0.60/hour = reasonable for T4)

βœ… Production Ready Status

  • bart-large-cnn: βœ… Fully ready
  • Phi-3-GGUF: βœ… Fully ready
  • Phi-3-standard: βœ… Ready (monitor memory)
  • Longformer: ⚠️ Conditional use only
  • OpenVINO: ❌ Not recommended

Overall T4 Deployment Rating: ⭐⭐⭐⭐⭐ (9/10)

The T4 Medium space is well-suited for patient summary generation with the recommended models. Excellent memory headroom allows for reliable operation with multiple concurrent requests.