Spaces:
Paused
HNTAI - Comprehensive Technical Architecture Documentation
Version: 1.0
Last Updated: December 5, 2025
Project: Medical Data Extraction & AI Processing Platform
Table of Contents
- Executive Summary
- System Overview
- Architecture Design
- Technology Stack
- Core Components
- AI/ML Architecture
- API Architecture
- Data Flow & Processing
- Database Design
- Security Architecture
- Deployment Architecture
- Performance Optimization
- Monitoring & Observability
- Development Workflow
- Integration Patterns
- Scalability Considerations
- Future Roadmap
1. Executive Summary
HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends.
Key Capabilities
- Multi-format Document Processing: PDF, DOCX, images, and audio transcription
- HIPAA Compliance: Automated PHI scrubbing with comprehensive audit logging
- Multi-Model AI Support: Transformers, OpenVINO, and GGUF models with automatic optimization
- Scalable Architecture: Kubernetes-ready with horizontal scaling capabilities
- Production-Ready: Health checks, metrics, structured logging, and error handling
Target Deployment Environments
- Hugging Face Spaces (T4 Medium GPU)
- Kubernetes Clusters (On-premise or cloud)
- Docker Containers (Standalone or orchestrated)
- Local Development (CPU or GPU)
2. System Overview
2.1 Purpose & Scope
HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to:
- Extract structured medical data from unstructured documents
- Anonymize protected health information (PHI) for compliance
- Summarize patient records into comprehensive clinical assessments
- Process multi-modal medical data (text, images, audio)
2.2 Design Principles
- Simplicity: Clean, maintainable codebase with essential features
- Flexibility: Support for multiple AI model types and backends
- Security: HIPAA-compliant with comprehensive audit trails
- Performance: Optimized for T4 GPU with intelligent caching
- Reliability: Robust error handling and automatic fallback mechanisms
2.3 High-Level Architecture
graph TB
subgraph "Client Layer"
A[Web Client]
B[Mobile Client]
C[API Client]
end
subgraph "API Gateway"
D[FastAPI Application]
E[Health Endpoints]
F[Metrics Endpoint]
end
subgraph "Service Layer"
G[Document Processing Service]
H[PHI Scrubbing Service]
I[Patient Summary Service]
J[Model Management Service]
end
subgraph "AI/ML Layer"
K[Unified Model Manager]
L[Transformers Models]
M[GGUF Models]
N[OpenVINO Models]
O[Whisper Audio Models]
end
subgraph "Data Layer"
P[PostgreSQL - Audit Logs]
Q[File Storage]
R[Model Cache]
end
A --> D
B --> D
C --> D
D --> E
D --> F
D --> G
D --> H
D --> I
D --> J
G --> K
H --> K
I --> K
J --> K
K --> L
K --> M
K --> N
K --> O
D --> P
G --> Q
K --> R
3. Architecture Design
3.1 Architectural Style
HNTAI follows a Layered Monolithic Architecture with clear separation of concerns:
- Presentation Layer: FastAPI routes and endpoints
- Service Layer: Business logic and orchestration
- Agent Layer: Specialized AI agents for specific tasks
- Utility Layer: Shared utilities and helpers
- Data Layer: Database and file storage
3.2 Component Architecture
graph LR
subgraph "FastAPI Application"
A[routes_fastapi.py]
B[app.py]
C[main.py]
end
subgraph "Agents"
D[patient_summary_agent.py]
E[phi_scrubber.py]
F[text_extractor.py]
G[medical_data_extractor.py]
end
subgraph "Services"
H[job_manager.py]
I[request_queue.py]
J[error_handler.py]
K[sse_generator.py]
end
subgraph "Utils"
L[unified_model_manager.py]
M[model_config.py]
N[robust_json_parser.py]
O[memory_manager.py]
end
A --> D
A --> E
A --> F
A --> G
A --> H
A --> I
D --> L
E --> L
F --> L
G --> L
L --> M
L --> O
3.3 Directory Structure
HNTAI/
βββ services/
β βββ ai-service/
β βββ src/
β βββ ai_med_extract/
β βββ agents/ # AI agents for specific tasks
β β βββ patient_summary_agent.py
β β βββ phi_scrubber.py
β β βββ text_extractor.py
β β βββ medical_data_extractor.py
β βββ api/ # FastAPI routes
β β βββ routes_fastapi.py
β βββ services/ # Business logic services
β β βββ job_manager.py
β β βββ request_queue.py
β β βββ error_handler.py
β β βββ sse_generator.py
β βββ utils/ # Utilities and helpers
β β βββ unified_model_manager.py
β β βββ model_config.py
β β βββ robust_json_parser.py
β β βββ memory_manager.py
β β βββ openvino_summarizer_utils.py
β β βββ patient_summary_utils.py
β βββ app.py # FastAPI app factory
β βββ main.py # Entry point
β βββ health_endpoints.py # Health checks
β βββ database_audit.py # HIPAA audit logging
βββ docs/ # Documentation
βββ infra/ # Infrastructure configs
β βββ k8s/ # Kubernetes manifests
βββ app.py # HF Spaces entry point
βββ Dockerfile # Multi-stage Docker build
βββ Dockerfile.hf-spaces # HF Spaces optimized
βββ .huggingface.yaml # HF Spaces config
βββ models_config.json # Model configuration
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
4. Technology Stack
4.1 Core Technologies
| Category | Technology | Version | Purpose |
|---|---|---|---|
| Runtime | Python | 3.10+ | Primary language |
| Web Framework | FastAPI | Latest | REST API framework |
| ASGI Server | Uvicorn | Latest | Production server |
| AI/ML Framework | PyTorch | 2.x | Deep learning |
| Transformers | Hugging Face Transformers | Latest | Model loading |
| GGUF Support | llama-cpp-python | Latest | Quantized models |
| OpenVINO | optimum-intel | Latest | Intel optimization |
| Audio Processing | Whisper | Latest | Speech-to-text |
4.2 Supporting Technologies
| Category | Technology | Purpose |
|---|---|---|
| Database | PostgreSQL 13+ | Audit logs (optional) |
| Caching | In-memory LRU | Model caching |
| Document Processing | PyPDF2, python-docx | PDF/DOCX parsing |
| OCR | Tesseract | Image text extraction |
| Audio | FFmpeg | Audio processing |
| Containerization | Docker | Deployment |
| Orchestration | Kubernetes | Scaling |
| Monitoring | Prometheus | Metrics |
4.3 Development Tools
- Code Quality: Black, isort, flake8, mypy
- Testing: pytest
- Version Control: Git
- CI/CD: GitHub Actions (potential)
- Documentation: Markdown, Mermaid diagrams
5. Core Components
5.1 FastAPI Application (app.py)
Purpose: Application factory and initialization
Key Responsibilities:
- Create and configure FastAPI application
- Initialize agents and services
- Register routes and middleware
- Configure CORS and security
Key Functions:
def create_app(initialize: bool = True) -> FastAPI
def initialize_agents(app: FastAPI, preload_small_models: bool = False)
def run_dev() # Development server
5.2 API Routes (routes_fastapi.py)
Purpose: RESTful API endpoints
Endpoint Categories:
Health & Monitoring
GET /health/live- Liveness probeGET /health/ready- Readiness probeGET /metrics- Prometheus metrics
Document Processing
POST /upload- Upload and process documentsPOST /transcribe- Audio transcriptionGET /get_updated_medical_data- Retrieve processed dataPUT /update_medical_data- Update medical records
AI Processing
POST /generate_patient_summary- Generate patient summariesPOST /api/generate_summary- Text summarizationPOST /api/patient_summary_openvino- OpenVINO summariesPOST /extract_medical_data- Extract structured data
Model Management
POST /api/load_model- Load specific modelsGET /api/model_info- Model informationPOST /api/switch_model- Switch models
5.3 Agents
5.3.1 Patient Summary Agent (patient_summary_agent.py)
Purpose: Generate comprehensive patient summaries
Key Features:
- Dynamic model configuration
- Multi-section summary generation
- Chronological narrative building
- Clinical guideline evaluation
- Fallback text-based summarization
Core Methods:
def configure_model(model_name: str, model_type: str)
def generate_clinical_summary(patient_data: Union[List[str], Dict])
def generate_patient_summary(patient_data: Union[List[str], Dict])
def build_chronological_narrative(patient_data: dict)
def format_clinical_output(raw_summary: str, patient_data: dict)
5.3.2 PHI Scrubber (phi_scrubber.py)
Purpose: Remove protected health information
Scrubbing Capabilities:
- Patient names
- Medical record numbers (MRN)
- Dates of birth
- Phone numbers
- Email addresses
- Social Security Numbers
- Addresses
Compliance: HIPAA-compliant with audit logging
5.3.3 Text Extractor (text_extractor.py)
Purpose: Extract text from various document formats
Supported Formats:
- PDF documents
- DOCX files
- Images (via OCR)
- Plain text
5.3.4 Medical Data Extractor (medical_data_extractor.py)
Purpose: Extract structured medical data from text
Extraction Targets:
- Diagnoses
- Medications
- Procedures
- Lab results
- Vital signs
- Allergies
5.4 Services
5.4.1 Job Manager (job_manager.py)
Purpose: Manage long-running jobs
Features:
- Job lifecycle management
- Progress tracking
- Status updates
- Result caching
- Cleanup of completed jobs
5.4.2 Request Queue (request_queue.py)
Purpose: Queue and prioritize requests
Features:
- Request queuing
- Priority handling
- Concurrency control
- Timeout management
5.4.3 Error Handler (error_handler.py)
Purpose: Centralized error handling
Features:
- Error categorization
- Contextual logging
- Job error updates
- Graceful degradation
5.4.4 SSE Generator (sse_generator.py)
Purpose: Server-Sent Events for real-time updates
Features:
- Progress streaming
- Status updates
- Error notifications
- Completion events
6. AI/ML Architecture
6.1 Unified Model Manager
File: unified_model_manager.py
Purpose: Single interface for all AI model types
Architecture:
classDiagram
class BaseModel {
<<abstract>>
+name: str
+model_type: str
+status: ModelStatus
+load()
+generate(prompt, config)*
+unload()
}
class TransformersModel {
+_model: Pipeline
+_load_implementation()
+generate(prompt, config)
}
class GGUFModel {
+_model: Llama
+filename: str
+_extract_filename()
+_load_implementation()
+generate(prompt, config)
}
class OpenVINOModel {
+_model: OVModelForCausalLM
+_tokenizer: AutoTokenizer
+_load_implementation()
+generate(prompt, config)
}
class FallbackModel {
+_load_implementation()
+generate(prompt, config)
}
class UnifiedModelManager {
+max_models: int
+max_memory_mb: int
+get_model(name, type)
+generate_text(name, prompt)
+cleanup()
}
BaseModel <|-- TransformersModel
BaseModel <|-- GGUFModel
BaseModel <|-- OpenVINOModel
BaseModel <|-- FallbackModel
UnifiedModelManager --> BaseModel
6.2 Model Types
6.2.1 Transformers Models
Backend: Hugging Face Transformers
Device: GPU (CUDA) or CPU
Use Cases: General text generation, summarization
Supported Models:
microsoft/Phi-3-mini-4k-instructfacebook/bart-large-cnn(deprecated)google/flan-t5-large
Configuration:
{
"model_name": "microsoft/Phi-3-mini-4k-instruct",
"model_type": "text-generation",
"device_map": "auto",
"torch_dtype": "float16"
}
6.2.2 GGUF Models
Backend: llama-cpp-python
Device: CPU or GPU (via Metal/CUDA)
Use Cases: Efficient inference with quantized models
Supported Models:
microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf(PRIMARY)
Configuration:
{
"model_path": "path/to/model.gguf",
"n_ctx": 8192,
"n_threads": 4,
"n_gpu_layers": 35 # GPU acceleration
}
6.2.3 OpenVINO Models
Backend: Intel OpenVINO
Device: CPU (Intel optimized) or GPU
Use Cases: Production deployment on Intel hardware
Supported Models:
OpenVINO/Phi-3-mini-4k-instruct-fp16-ov
Configuration:
{
"model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
"device": "GPU" if available else "CPU"
}
6.3 Model Selection Strategy
flowchart TD
A[Request with model_name] --> B{Model specified?}
B -->|Yes| C{Model type?}
B -->|No| D[Use default: Phi-3 GGUF]
C -->|GGUF| E[Load GGUF Model]
C -->|OpenVINO| F[Load OpenVINO Model]
C -->|Transformers| G[Load Transformers Model]
C -->|Unknown| H[Auto-detect type]
E --> I{Load successful?}
F --> I
G --> I
H --> I
D --> I
I -->|Yes| J[Generate with model]
I -->|No| K[Try fallback model]
K --> L{Fallback successful?}
L -->|Yes| J
L -->|No| M[Use text-based fallback]
6.4 Model Configuration
File: models_config.json
{
"patient_summary_models": [
{
"name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
"type": "gguf",
"is_active": true,
"cached": true,
"description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
"use_case": "Fast patient summary generation with CPU/GPU",
"repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
"filename": "Phi-3-mini-4k-instruct-q4.gguf"
}
],
"runtime_behavior": {
"allow_runtime_downloads": true,
"cache_runtime_downloads": true,
"fallback_to_cached": true
}
}
6.5 Token Management
Token Limit Handling:
- Automatic token counting (heuristic: ~4 chars/token)
- Pre-generation validation
- Token limit error detection
- Graceful degradation
Token Limits by Model:
- Phi-3 models: 4096 tokens (context window)
- BART models: 1024 tokens
- T5 models: 512 tokens
6.6 Generation Configuration
@dataclass
class GenerationConfig:
max_tokens: int = 8192 # Maximum output tokens
min_tokens: int = 50 # Minimum output tokens
temperature: float = 0.3 # Deterministic for medical
top_p: float = 0.9 # Nucleus sampling
timeout: float = 180.0 # T4 timeout
stream: bool = False # Streaming support
6.7 T4 GPU Optimizations
Hardware Target: NVIDIA T4 Medium (16GB GPU, 16GB RAM)
Optimizations:
Memory Management:
- Max 2 models in memory
- Automatic model unloading
- GPU memory clearing
- Garbage collection
Model Loading:
- Lazy loading (on-demand)
- Intelligent caching
- LRU eviction policy
Inference:
- FP16 precision
- Batch size: 1
- Context window: 8192 tokens
- GPU layer offloading (GGUF)
7. API Architecture
7.1 RESTful Design
Principles:
- Resource-oriented URLs
- HTTP methods for CRUD operations
- JSON request/response format
- Stateless communication
- Proper HTTP status codes
7.2 Request/Response Flow
sequenceDiagram
participant C as Client
participant A as API Gateway
participant S as Service Layer
participant M as Model Manager
participant D as Database
C->>A: POST /generate_patient_summary
A->>A: Validate request
A->>S: Create job
S->>D: Log job creation
A-->>C: 202 Accepted (job_id)
S->>M: Load model
M->>M: Check cache
M->>M: Load if needed
M-->>S: Model ready
S->>M: Generate summary
M->>M: Process prompt
M-->>S: Generated text
S->>D: Log completion
S->>A: Update job status
A-->>C: SSE: Progress updates
C->>A: GET /job/{job_id}
A->>S: Get job status
S->>D: Retrieve job
S-->>A: Job result
A-->>C: 200 OK (result)
7.3 Authentication & Authorization
Current State: Basic API key authentication (optional)
Planned Enhancements:
- JWT-based authentication
- Role-based access control (RBAC)
- OAuth2 integration
- API rate limiting
7.4 Error Handling
Error Response Format:
{
"error": {
"code": "MODEL_LOAD_FAILED",
"message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct",
"details": {
"model_name": "microsoft/Phi-3-mini-4k-instruct",
"error_type": "initialization_error",
"timestamp": "2025-12-05T17:23:52Z"
}
}
}
HTTP Status Codes:
200 OK- Successful request202 Accepted- Job created400 Bad Request- Invalid input404 Not Found- Resource not found500 Internal Server Error- Server error503 Service Unavailable- Service degraded
7.5 Rate Limiting
Strategy: Token bucket algorithm
Limits:
- 100 requests/minute per IP
- 1000 requests/hour per API key
- Burst allowance: 20 requests
8. Data Flow & Processing
8.1 Document Processing Pipeline
flowchart LR
A[Upload Document] --> B{File Type?}
B -->|PDF| C[PDF Parser]
B -->|DOCX| D[DOCX Parser]
B -->|Image| E[OCR Engine]
B -->|Audio| F[Whisper Transcription]
C --> G[Text Extraction]
D --> G
E --> G
F --> G
G --> H[PHI Scrubbing]
H --> I[Medical Data Extraction]
I --> J[Store Processed Data]
J --> K[Return Results]
8.2 Patient Summary Generation Flow
flowchart TD
A[Patient Data Input] --> B[Parse EHR Data]
B --> C[Convert to Plain Text]
C --> D{Data Size Check}
D -->|Small| E[Single-pass Generation]
D -->|Large| F[Chunking Strategy]
F --> G[Chunk by Date/Size]
G --> H[Process Chunks in Parallel]
H --> I[Combine Chunk Summaries]
E --> J[Generate with Model]
I --> J
J --> K[Format Clinical Output]
K --> L[Evaluate Against Guidelines]
L --> M[Return Summary]
8.3 Data Transformation
Input Formats:
- Raw EHR JSON
- HL7 FHIR resources
- Plain text documents
- Scanned images
- Audio recordings
Output Formats:
- Structured JSON
- Clinical summary (Markdown)
- FHIR-compliant resources
- Audit logs
8.4 Caching Strategy
Multi-Level Caching:
- Model Cache: Loaded models in memory
- Result Cache: Generated summaries (LRU)
- File Cache: Processed documents
- Hugging Face Cache: Downloaded models
Cache Invalidation:
- Time-based expiration
- Manual invalidation
- Memory pressure-based eviction
9. Database Design
9.1 Database Schema
Primary Database: PostgreSQL (optional, for audit logs)
Audit Logs Table
CREATE TABLE audit_logs (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
user_id VARCHAR(255),
action VARCHAR(100) NOT NULL,
resource_type VARCHAR(100),
resource_id VARCHAR(255),
phi_accessed BOOLEAN DEFAULT FALSE,
ip_address INET,
user_agent TEXT,
request_data JSONB,
response_status INTEGER,
error_message TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp);
CREATE INDEX idx_audit_user ON audit_logs(user_id);
CREATE INDEX idx_audit_action ON audit_logs(action);
CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed);
9.2 Data Models
Patient Data Model (In-memory):
{
"patient_id": "string",
"demographics": {
"name": "string",
"dob": "date",
"gender": "string",
"mrn": "string"
},
"visits": [
{
"visit_id": "string",
"date": "datetime",
"chief_complaint": "string",
"diagnoses": ["string"],
"medications": ["string"],
"procedures": ["string"],
"vitals": {},
"labs": []
}
]
}
9.3 File Storage
Storage Strategy: Local filesystem or cloud storage
Directory Structure:
/data/
βββ uploads/ # Uploaded documents
βββ processed/ # Processed documents
βββ cache/ # Temporary cache
βββ models/ # Model files
10. Security Architecture
10.1 HIPAA Compliance
Requirements Met:
- Access Controls: Authentication and authorization
- Audit Logging: Comprehensive activity logs
- Data Encryption: In-transit and at-rest
- PHI Scrubbing: Automated anonymization
- Secure Communication: HTTPS/TLS
10.2 PHI Scrubbing
Scrubbing Patterns:
PATTERNS = {
"name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
"mrn": r'\bMRN[:\s]*\d{6,10}\b',
"dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b',
"phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
"email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
"ssn": r'\b\d{3}-\d{2}-\d{4}\b'
}
10.3 Container Security
Security Measures:
- Non-root user execution
- Read-only root filesystem
- Resource limits (CPU, memory)
- Network policies
- Secrets management
- Minimal base images
10.4 API Security
Security Headers:
{
"X-Content-Type-Options": "nosniff",
"X-Frame-Options": "DENY",
"X-XSS-Protection": "1; mode=block",
"Strict-Transport-Security": "max-age=31536000"
}
11. Deployment Architecture
11.1 Deployment Options
11.1.1 Hugging Face Spaces
Configuration: .huggingface.yaml
runtime: docker
sdk: docker
python_version: "3.10"
build:
dockerfile: Dockerfile.hf-spaces
cache: true
hardware:
gpu: t4-medium # 16GB GPU RAM, 16GB System RAM
env:
- SPACE_ID=$SPACE_ID
- HF_HOME=/app/.cache/huggingface
- TORCH_HOME=/app/.cache/torch
- MODEL_CACHE_DIR=/app/models
- PRELOAD_GGUF=true
- HF_SPACES=true
Optimizations:
- Pre-cached models in Docker image
- Lazy model loading
- Memory-efficient inference
- Automatic GPU detection
11.1.2 Kubernetes
Deployment Manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hntai-deployment
spec:
replicas: 3
selector:
matchLabels:
app: hntai
template:
metadata:
labels:
app: hntai
spec:
containers:
- name: hntai
image: hntai:latest
ports:
- containerPort: 7860
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /health/live
port: 7860
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 7860
initialDelaySeconds: 10
periodSeconds: 5
11.1.3 Docker
Multi-Stage Dockerfile:
# Stage 1: Builder
FROM python:3.10-slim AS builder
RUN apt-get update && apt-get install -y build-essential
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt
# Stage 2: Runtime
FROM python:3.10-slim AS runtime
COPY --from=builder /install /usr/local
WORKDIR /app
COPY . .
ENV PYTHONUNBUFFERED=1
EXPOSE 7860
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
11.2 Scaling Strategy
Horizontal Scaling:
- Multiple replicas behind load balancer
- Stateless design for easy scaling
- Shared model cache (optional)
Vertical Scaling:
- Increase CPU/memory per instance
- GPU acceleration for inference
- Larger model support
11.3 High Availability
Components:
- Load Balancer: Distribute traffic
- Health Checks: Automatic failover
- Auto-scaling: Based on CPU/memory
- Graceful Shutdown: Drain connections
12. Performance Optimization
12.1 Model Optimization
Techniques:
- Quantization: GGUF Q4 models (4-bit)
- Precision: FP16 for GPU inference
- Batching: Batch size optimization
- Caching: Model and result caching
- Lazy Loading: On-demand model loading
12.2 Memory Management
Strategies:
- Automatic garbage collection
- GPU memory clearing
- Model unloading (LRU)
- Memory pressure monitoring
Memory Limits:
- T4 Medium: 16GB GPU, 16GB RAM
- Max 2 models in memory
- Automatic eviction at 80% usage
12.3 Inference Optimization
T4-Specific Optimizations:
{
"max_models": 2,
"max_memory_mb": 14000,
"n_ctx": 8192,
"n_threads": 4,
"n_gpu_layers": 35,
"torch_dtype": "float16",
"device_map": "auto"
}
12.4 Caching Strategy
Cache Hierarchy:
- L1 - Model Cache: In-memory loaded models
- L2 - Result Cache: Generated summaries (LRU, 100 items)
- L3 - File Cache: Processed documents (disk)
- L4 - HF Cache: Downloaded models (disk)
12.5 Performance Metrics
Target Metrics:
- Model load time: < 10 seconds
- Summary generation: < 60 seconds (small), < 180 seconds (large)
- API response time: < 100ms (excluding generation)
- Memory usage: < 80% of available
- GPU utilization: > 70% during inference
13. Monitoring & Observability
13.1 Health Checks
Liveness Probe (/health/live):
{
"status": "alive",
"timestamp": "2025-12-05T17:23:52Z"
}
Readiness Probe (/health/ready):
{
"status": "ready",
"checks": {
"database": "ok",
"model_manager": "ok",
"file_storage": "ok"
},
"timestamp": "2025-12-05T17:23:52Z"
}
13.2 Metrics
Prometheus Metrics (/metrics):
# Model metrics
model_load_time_seconds{model_name="phi-3-gguf"} 8.5
model_inference_time_seconds{model_name="phi-3-gguf"} 45.2
model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296
# API metrics
http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234
http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3
# System metrics
memory_usage_percent 65.2
gpu_memory_usage_percent 72.1
cpu_usage_percent 45.8
13.3 Logging
Structured Logging:
{
"timestamp": "2025-12-05T17:23:52Z",
"level": "INFO",
"logger": "ai_med_extract.agents.patient_summary_agent",
"message": "Generated patient summary",
"context": {
"job_id": "abc123",
"model_name": "phi-3-gguf",
"duration_seconds": 45.2,
"token_count": 2048
}
}
Log Levels:
DEBUG: Detailed diagnostic informationINFO: General informational messagesWARNING: Warning messagesERROR: Error messagesCRITICAL: Critical failures
13.4 Audit Logging
HIPAA Audit Trail:
{
"timestamp": "2025-12-05T17:23:52Z",
"user_id": "user123",
"action": "PHI_ACCESS",
"resource_type": "patient_summary",
"resource_id": "patient456",
"phi_accessed": true,
"ip_address": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"request_data": {...},
"response_status": 200
}
14. Development Workflow
14.1 Local Development
Setup:
# Clone repository
git clone <repository-url>
cd HNTAI
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai"
export SECRET_KEY="your-secret-key"
export HF_HOME="/tmp/huggingface"
# Run development server
cd services/ai-service/src
python -m ai_med_extract.app run_dev
14.2 Testing
Test Structure:
tests/
βββ unit/
β βββ test_agents.py
β βββ test_model_manager.py
β βββ test_utils.py
βββ integration/
β βββ test_api.py
β βββ test_workflows.py
βββ conftest.py
Running Tests:
# Unit tests
python -m pytest tests/unit/
# Integration tests
python -m pytest tests/integration/
# Coverage report
python -m pytest --cov=ai_med_extract tests/
14.3 Code Quality
Tools:
# Format code
black .
isort .
# Lint code
flake8 .
# Type checking
mypy services/ai-service/src/ai_med_extract/
14.4 Git Workflow
Branching Strategy:
main: Production-ready codedevelop: Integration branchfeature/*: Feature branchesbugfix/*: Bug fix brancheshotfix/*: Production hotfixes
Commit Convention:
<type>(<scope>): <subject>
<body>
<footer>
Types: feat, fix, docs, style, refactor, test, chore
15. Integration Patterns
15.1 External System Integration
Integration Points:
- EHR Systems: HL7, FHIR APIs
- Document Management: File uploads, cloud storage
- Authentication: OAuth2, SAML
- Monitoring: Prometheus, Grafana
- Logging: ELK Stack, CloudWatch
15.2 API Integration
Client Libraries (Planned):
- Python SDK
- JavaScript SDK
- REST API documentation (OpenAPI/Swagger)
Example Integration:
import requests
# Upload document
response = requests.post(
"https://api.hntai.com/upload",
files={"file": open("document.pdf", "rb")},
headers={"Authorization": "Bearer <token>"}
)
# Generate patient summary
response = requests.post(
"https://api.hntai.com/generate_patient_summary",
json={
"patient_data": {...},
"model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
"model_type": "gguf"
},
headers={"Authorization": "Bearer <token>"}
)
job_id = response.json()["job_id"]
# Poll for results
while True:
response = requests.get(
f"https://api.hntai.com/job/{job_id}",
headers={"Authorization": "Bearer <token>"}
)
if response.json()["status"] == "completed":
break
time.sleep(5)
15.3 Webhook Support
Planned Feature: Webhook notifications for job completion
{
"event": "job.completed",
"job_id": "abc123",
"timestamp": "2025-12-05T17:23:52Z",
"data": {
"status": "completed",
"result": {...}
}
}
16. Scalability Considerations
16.1 Horizontal Scaling
Strategies:
- Stateless Design: No session state in application
- Load Balancing: Distribute requests across instances
- Shared Cache: Redis for distributed caching
- Message Queue: RabbitMQ/Kafka for async processing
16.2 Vertical Scaling
Resource Scaling:
- CPU: 2-8 cores per instance
- Memory: 8-32 GB per instance
- GPU: T4, V100, A100 for inference
16.3 Database Scaling
Strategies:
- Read Replicas: For audit log queries
- Partitioning: Time-based partitioning for logs
- Indexing: Optimize query performance
- Archiving: Move old logs to cold storage
16.4 Model Serving
Scaling Options:
- Model Replication: Same model on multiple instances
- Model Sharding: Different models on different instances
- Model Versioning: A/B testing with multiple versions
- Dedicated Inference: Separate inference service
17. Future Roadmap
17.1 Short-Term (3-6 months)
Enhanced Model Support:
- Support for Llama 3, Mistral models
- Fine-tuned medical models
- Multi-modal models (text + images)
Improved Performance:
- Model quantization (INT8, INT4)
- Batch inference support
- Streaming responses
Additional Features:
- Real-time collaboration
- Version control for summaries
- Template-based summaries
17.2 Medium-Term (6-12 months)
Advanced AI Capabilities:
- Multi-agent orchestration
- Retrieval-Augmented Generation (RAG)
- Knowledge graph integration
Enterprise Features:
- Multi-tenancy support
- Advanced RBAC
- SSO integration
- Compliance reporting
Platform Enhancements:
- Web UI for management
- Mobile app support
- Plugin architecture
17.3 Long-Term (12+ months)
AI/ML Advancements:
- Custom model training pipeline
- Federated learning support
- Explainable AI (XAI)
Ecosystem Integration:
- FHIR server integration
- HL7 v3 support
- DICOM image analysis
Global Expansion:
- Multi-language support
- Regional compliance (GDPR, etc.)
- Edge deployment
Appendix A: Configuration Reference
Environment Variables
| Variable | Description | Default | Required |
|---|---|---|---|
DATABASE_URL |
PostgreSQL connection string | - | No |
SECRET_KEY |
Application secret key | - | Yes |
JWT_SECRET_KEY |
JWT signing key | - | Yes |
HF_HOME |
Hugging Face cache directory | /tmp/huggingface |
No |
TORCH_HOME |
PyTorch cache directory | /tmp/torch |
No |
WHISPER_CACHE |
Whisper model cache | /tmp/whisper |
No |
HF_SPACES |
Hugging Face Spaces mode | false |
No |
PRELOAD_GGUF |
Preload GGUF models | false |
No |
MAX_NEW_TOKENS |
Max output tokens | 8192 |
No |
MAX_INPUT_TOKENS |
Max input tokens | 2048 |
No |
Appendix B: API Reference
Complete Endpoint List
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Root endpoint |
GET |
/health/live |
Liveness probe |
GET |
/health/ready |
Readiness probe |
GET |
/metrics |
Prometheus metrics |
POST |
/upload |
Upload document |
POST |
/transcribe |
Transcribe audio |
POST |
/generate_patient_summary |
Generate patient summary |
POST |
/api/generate_summary |
Generate text summary |
POST |
/api/patient_summary_openvino |
OpenVINO summary |
POST |
/extract_medical_data |
Extract medical data |
GET |
/get_updated_medical_data |
Get processed data |
PUT |
/update_medical_data |
Update medical data |
POST |
/api/load_model |
Load model |
GET |
/api/model_info |
Get model info |
POST |
/api/switch_model |
Switch model |
Appendix C: Troubleshooting Guide
Common Issues
Model Loading Failures
Symptom: Model fails to load
Causes:
- Insufficient memory
- Missing dependencies
- Network issues (download)
Solutions:
- Check memory availability
- Verify dependencies installed
- Check network connectivity
- Use fallback model
Token Limit Errors
Symptom: "Input exceeds token limit"
Causes:
- Input too long
- Model context window exceeded
Solutions:
- Reduce input size
- Use chunking strategy
- Switch to larger context model
Performance Issues
Symptom: Slow inference
Causes:
- CPU-only inference
- Large model size
- Memory pressure
Solutions:
- Enable GPU acceleration
- Use quantized models (GGUF)
- Reduce batch size
- Clear model cache
Appendix D: Glossary
| Term | Definition |
|---|---|
| PHI | Protected Health Information |
| HIPAA | Health Insurance Portability and Accountability Act |
| EHR | Electronic Health Record |
| FHIR | Fast Healthcare Interoperability Resources |
| HL7 | Health Level 7 (healthcare data standard) |
| GGUF | GPT-Generated Unified Format (quantized models) |
| OpenVINO | Open Visual Inference and Neural Network Optimization |
| T4 | NVIDIA Tesla T4 GPU |
| LRU | Least Recently Used (cache eviction) |
| SSE | Server-Sent Events |
| ASGI | Asynchronous Server Gateway Interface |
Document Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-12-05 | System | Initial comprehensive documentation |
End of Technical Architecture Documentation