# HNTAI - Comprehensive Technical Architecture Documentation **Version:** 1.0 **Last Updated:** December 5, 2025 **Project:** Medical Data Extraction & AI Processing Platform --- ## Table of Contents 1. [Executive Summary](#executive-summary) 2. [System Overview](#system-overview) 3. [Architecture Design](#architecture-design) 4. [Technology Stack](#technology-stack) 5. [Core Components](#core-components) 6. [AI/ML Architecture](#aiml-architecture) 7. [API Architecture](#api-architecture) 8. [Data Flow & Processing](#data-flow--processing) 9. [Database Design](#database-design) 10. [Security Architecture](#security-architecture) 11. [Deployment Architecture](#deployment-architecture) 12. [Performance Optimization](#performance-optimization) 13. [Monitoring & Observability](#monitoring--observability) 14. [Development Workflow](#development-workflow) 15. [Integration Patterns](#integration-patterns) 16. [Scalability Considerations](#scalability-considerations) 17. [Future Roadmap](#future-roadmap) --- ## 1. Executive Summary HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends. ### Key Capabilities - **Multi-format Document Processing**: PDF, DOCX, images, and audio transcription - **HIPAA Compliance**: Automated PHI scrubbing with comprehensive audit logging - **Multi-Model AI Support**: Transformers, OpenVINO, and GGUF models with automatic optimization - **Scalable Architecture**: Kubernetes-ready with horizontal scaling capabilities - **Production-Ready**: Health checks, metrics, structured logging, and error handling ### Target Deployment Environments - **Hugging Face Spaces** (T4 Medium GPU) - **Kubernetes Clusters** (On-premise or cloud) - **Docker Containers** (Standalone or orchestrated) - **Local Development** (CPU or GPU) --- ## 2. System Overview ### 2.1 Purpose & Scope HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to: 1. **Extract** structured medical data from unstructured documents 2. **Anonymize** protected health information (PHI) for compliance 3. **Summarize** patient records into comprehensive clinical assessments 4. **Process** multi-modal medical data (text, images, audio) ### 2.2 Design Principles - **Simplicity**: Clean, maintainable codebase with essential features - **Flexibility**: Support for multiple AI model types and backends - **Security**: HIPAA-compliant with comprehensive audit trails - **Performance**: Optimized for T4 GPU with intelligent caching - **Reliability**: Robust error handling and automatic fallback mechanisms ### 2.3 High-Level Architecture ```mermaid graph TB subgraph "Client Layer" A[Web Client] B[Mobile Client] C[API Client] end subgraph "API Gateway" D[FastAPI Application] E[Health Endpoints] F[Metrics Endpoint] end subgraph "Service Layer" G[Document Processing Service] H[PHI Scrubbing Service] I[Patient Summary Service] J[Model Management Service] end subgraph "AI/ML Layer" K[Unified Model Manager] L[Transformers Models] M[GGUF Models] N[OpenVINO Models] O[Whisper Audio Models] end subgraph "Data Layer" P[PostgreSQL - Audit Logs] Q[File Storage] R[Model Cache] end A --> D B --> D C --> D D --> E D --> F D --> G D --> H D --> I D --> J G --> K H --> K I --> K J --> K K --> L K --> M K --> N K --> O D --> P G --> Q K --> R ``` --- ## 3. Architecture Design ### 3.1 Architectural Style HNTAI follows a **Layered Monolithic Architecture** with clear separation of concerns: 1. **Presentation Layer**: FastAPI routes and endpoints 2. **Service Layer**: Business logic and orchestration 3. **Agent Layer**: Specialized AI agents for specific tasks 4. **Utility Layer**: Shared utilities and helpers 5. **Data Layer**: Database and file storage ### 3.2 Component Architecture ```mermaid graph LR subgraph "FastAPI Application" A[routes_fastapi.py] B[app.py] C[main.py] end subgraph "Agents" D[patient_summary_agent.py] E[phi_scrubber.py] F[text_extractor.py] G[medical_data_extractor.py] end subgraph "Services" H[job_manager.py] I[request_queue.py] J[error_handler.py] K[sse_generator.py] end subgraph "Utils" L[unified_model_manager.py] M[model_config.py] N[robust_json_parser.py] O[memory_manager.py] end A --> D A --> E A --> F A --> G A --> H A --> I D --> L E --> L F --> L G --> L L --> M L --> O ``` ### 3.3 Directory Structure ``` HNTAI/ ├── services/ │ └── ai-service/ │ └── src/ │ └── ai_med_extract/ │ ├── agents/ # AI agents for specific tasks │ │ ├── patient_summary_agent.py │ │ ├── phi_scrubber.py │ │ ├── text_extractor.py │ │ └── medical_data_extractor.py │ ├── api/ # FastAPI routes │ │ └── routes_fastapi.py │ ├── services/ # Business logic services │ │ ├── job_manager.py │ │ ├── request_queue.py │ │ ├── error_handler.py │ │ └── sse_generator.py │ ├── utils/ # Utilities and helpers │ │ ├── unified_model_manager.py │ │ ├── model_config.py │ │ ├── robust_json_parser.py │ │ ├── memory_manager.py │ │ ├── openvino_summarizer_utils.py │ │ └── patient_summary_utils.py │ ├── app.py # FastAPI app factory │ ├── main.py # Entry point │ ├── health_endpoints.py # Health checks │ └── database_audit.py # HIPAA audit logging ├── docs/ # Documentation ├── infra/ # Infrastructure configs │ └── k8s/ # Kubernetes manifests ├── app.py # HF Spaces entry point ├── Dockerfile # Multi-stage Docker build ├── Dockerfile.hf-spaces # HF Spaces optimized ├── .huggingface.yaml # HF Spaces config ├── models_config.json # Model configuration ├── requirements.txt # Python dependencies └── README.md # Project documentation ``` --- ## 4. Technology Stack ### 4.1 Core Technologies | Category | Technology | Version | Purpose | |----------|-----------|---------|---------| | **Runtime** | Python | 3.10+ | Primary language | | **Web Framework** | FastAPI | Latest | REST API framework | | **ASGI Server** | Uvicorn | Latest | Production server | | **AI/ML Framework** | PyTorch | 2.x | Deep learning | | **Transformers** | Hugging Face Transformers | Latest | Model loading | | **GGUF Support** | llama-cpp-python | Latest | Quantized models | | **OpenVINO** | optimum-intel | Latest | Intel optimization | | **Audio Processing** | Whisper | Latest | Speech-to-text | ### 4.2 Supporting Technologies | Category | Technology | Purpose | |----------|-----------|---------| | **Database** | PostgreSQL 13+ | Audit logs (optional) | | **Caching** | In-memory LRU | Model caching | | **Document Processing** | PyPDF2, python-docx | PDF/DOCX parsing | | **OCR** | Tesseract | Image text extraction | | **Audio** | FFmpeg | Audio processing | | **Containerization** | Docker | Deployment | | **Orchestration** | Kubernetes | Scaling | | **Monitoring** | Prometheus | Metrics | ### 4.3 Development Tools - **Code Quality**: Black, isort, flake8, mypy - **Testing**: pytest - **Version Control**: Git - **CI/CD**: GitHub Actions (potential) - **Documentation**: Markdown, Mermaid diagrams --- ## 5. Core Components ### 5.1 FastAPI Application (`app.py`) **Purpose**: Application factory and initialization **Key Responsibilities**: - Create and configure FastAPI application - Initialize agents and services - Register routes and middleware - Configure CORS and security **Key Functions**: ```python def create_app(initialize: bool = True) -> FastAPI def initialize_agents(app: FastAPI, preload_small_models: bool = False) def run_dev() # Development server ``` ### 5.2 API Routes (`routes_fastapi.py`) **Purpose**: RESTful API endpoints **Endpoint Categories**: #### Health & Monitoring - `GET /health/live` - Liveness probe - `GET /health/ready` - Readiness probe - `GET /metrics` - Prometheus metrics #### Document Processing - `POST /upload` - Upload and process documents - `POST /transcribe` - Audio transcription - `GET /get_updated_medical_data` - Retrieve processed data - `PUT /update_medical_data` - Update medical records #### AI Processing - `POST /generate_patient_summary` - Generate patient summaries - `POST /api/generate_summary` - Text summarization - `POST /api/patient_summary_openvino` - OpenVINO summaries - `POST /extract_medical_data` - Extract structured data #### Model Management - `POST /api/load_model` - Load specific models - `GET /api/model_info` - Model information - `POST /api/switch_model` - Switch models ### 5.3 Agents #### 5.3.1 Patient Summary Agent (`patient_summary_agent.py`) **Purpose**: Generate comprehensive patient summaries **Key Features**: - Dynamic model configuration - Multi-section summary generation - Chronological narrative building - Clinical guideline evaluation - Fallback text-based summarization **Core Methods**: ```python def configure_model(model_name: str, model_type: str) def generate_clinical_summary(patient_data: Union[List[str], Dict]) def generate_patient_summary(patient_data: Union[List[str], Dict]) def build_chronological_narrative(patient_data: dict) def format_clinical_output(raw_summary: str, patient_data: dict) ``` #### 5.3.2 PHI Scrubber (`phi_scrubber.py`) **Purpose**: Remove protected health information **Scrubbing Capabilities**: - Patient names - Medical record numbers (MRN) - Dates of birth - Phone numbers - Email addresses - Social Security Numbers - Addresses **Compliance**: HIPAA-compliant with audit logging #### 5.3.3 Text Extractor (`text_extractor.py`) **Purpose**: Extract text from various document formats **Supported Formats**: - PDF documents - DOCX files - Images (via OCR) - Plain text #### 5.3.4 Medical Data Extractor (`medical_data_extractor.py`) **Purpose**: Extract structured medical data from text **Extraction Targets**: - Diagnoses - Medications - Procedures - Lab results - Vital signs - Allergies ### 5.4 Services #### 5.4.1 Job Manager (`job_manager.py`) **Purpose**: Manage long-running jobs **Features**: - Job lifecycle management - Progress tracking - Status updates - Result caching - Cleanup of completed jobs #### 5.4.2 Request Queue (`request_queue.py`) **Purpose**: Queue and prioritize requests **Features**: - Request queuing - Priority handling - Concurrency control - Timeout management #### 5.4.3 Error Handler (`error_handler.py`) **Purpose**: Centralized error handling **Features**: - Error categorization - Contextual logging - Job error updates - Graceful degradation #### 5.4.4 SSE Generator (`sse_generator.py`) **Purpose**: Server-Sent Events for real-time updates **Features**: - Progress streaming - Status updates - Error notifications - Completion events --- ## 6. AI/ML Architecture ### 6.1 Unified Model Manager **File**: `unified_model_manager.py` **Purpose**: Single interface for all AI model types **Architecture**: ```mermaid classDiagram class BaseModel { <> +name: str +model_type: str +status: ModelStatus +load() +generate(prompt, config)* +unload() } class TransformersModel { +_model: Pipeline +_load_implementation() +generate(prompt, config) } class GGUFModel { +_model: Llama +filename: str +_extract_filename() +_load_implementation() +generate(prompt, config) } class OpenVINOModel { +_model: OVModelForCausalLM +_tokenizer: AutoTokenizer +_load_implementation() +generate(prompt, config) } class FallbackModel { +_load_implementation() +generate(prompt, config) } class UnifiedModelManager { +max_models: int +max_memory_mb: int +get_model(name, type) +generate_text(name, prompt) +cleanup() } BaseModel <|-- TransformersModel BaseModel <|-- GGUFModel BaseModel <|-- OpenVINOModel BaseModel <|-- FallbackModel UnifiedModelManager --> BaseModel ``` ### 6.2 Model Types #### 6.2.1 Transformers Models **Backend**: Hugging Face Transformers **Device**: GPU (CUDA) or CPU **Use Cases**: General text generation, summarization **Supported Models**: - `microsoft/Phi-3-mini-4k-instruct` - `facebook/bart-large-cnn` (deprecated) - `google/flan-t5-large` **Configuration**: ```python { "model_name": "microsoft/Phi-3-mini-4k-instruct", "model_type": "text-generation", "device_map": "auto", "torch_dtype": "float16" } ``` #### 6.2.2 GGUF Models **Backend**: llama-cpp-python **Device**: CPU or GPU (via Metal/CUDA) **Use Cases**: Efficient inference with quantized models **Supported Models**: - `microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf` (PRIMARY) **Configuration**: ```python { "model_path": "path/to/model.gguf", "n_ctx": 8192, "n_threads": 4, "n_gpu_layers": 35 # GPU acceleration } ``` #### 6.2.3 OpenVINO Models **Backend**: Intel OpenVINO **Device**: CPU (Intel optimized) or GPU **Use Cases**: Production deployment on Intel hardware **Supported Models**: - `OpenVINO/Phi-3-mini-4k-instruct-fp16-ov` **Configuration**: ```python { "model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov", "device": "GPU" if available else "CPU" } ``` ### 6.3 Model Selection Strategy ```mermaid flowchart TD A[Request with model_name] --> B{Model specified?} B -->|Yes| C{Model type?} B -->|No| D[Use default: Phi-3 GGUF] C -->|GGUF| E[Load GGUF Model] C -->|OpenVINO| F[Load OpenVINO Model] C -->|Transformers| G[Load Transformers Model] C -->|Unknown| H[Auto-detect type] E --> I{Load successful?} F --> I G --> I H --> I D --> I I -->|Yes| J[Generate with model] I -->|No| K[Try fallback model] K --> L{Fallback successful?} L -->|Yes| J L -->|No| M[Use text-based fallback] ``` ### 6.4 Model Configuration **File**: `models_config.json` ```json { "patient_summary_models": [ { "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf", "type": "gguf", "is_active": true, "cached": true, "description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL", "use_case": "Fast patient summary generation with CPU/GPU", "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf", "filename": "Phi-3-mini-4k-instruct-q4.gguf" } ], "runtime_behavior": { "allow_runtime_downloads": true, "cache_runtime_downloads": true, "fallback_to_cached": true } } ``` ### 6.5 Token Management **Token Limit Handling**: - Automatic token counting (heuristic: ~4 chars/token) - Pre-generation validation - Token limit error detection - Graceful degradation **Token Limits by Model**: - Phi-3 models: 4096 tokens (context window) - BART models: 1024 tokens - T5 models: 512 tokens ### 6.6 Generation Configuration ```python @dataclass class GenerationConfig: max_tokens: int = 8192 # Maximum output tokens min_tokens: int = 50 # Minimum output tokens temperature: float = 0.3 # Deterministic for medical top_p: float = 0.9 # Nucleus sampling timeout: float = 180.0 # T4 timeout stream: bool = False # Streaming support ``` ### 6.7 T4 GPU Optimizations **Hardware Target**: NVIDIA T4 Medium (16GB GPU, 16GB RAM) **Optimizations**: 1. **Memory Management**: - Max 2 models in memory - Automatic model unloading - GPU memory clearing - Garbage collection 2. **Model Loading**: - Lazy loading (on-demand) - Intelligent caching - LRU eviction policy 3. **Inference**: - FP16 precision - Batch size: 1 - Context window: 8192 tokens - GPU layer offloading (GGUF) --- ## 7. API Architecture ### 7.1 RESTful Design **Principles**: - Resource-oriented URLs - HTTP methods for CRUD operations - JSON request/response format - Stateless communication - Proper HTTP status codes ### 7.2 Request/Response Flow ```mermaid sequenceDiagram participant C as Client participant A as API Gateway participant S as Service Layer participant M as Model Manager participant D as Database C->>A: POST /generate_patient_summary A->>A: Validate request A->>S: Create job S->>D: Log job creation A-->>C: 202 Accepted (job_id) S->>M: Load model M->>M: Check cache M->>M: Load if needed M-->>S: Model ready S->>M: Generate summary M->>M: Process prompt M-->>S: Generated text S->>D: Log completion S->>A: Update job status A-->>C: SSE: Progress updates C->>A: GET /job/{job_id} A->>S: Get job status S->>D: Retrieve job S-->>A: Job result A-->>C: 200 OK (result) ``` ### 7.3 Authentication & Authorization **Current State**: Basic API key authentication (optional) **Planned Enhancements**: - JWT-based authentication - Role-based access control (RBAC) - OAuth2 integration - API rate limiting ### 7.4 Error Handling **Error Response Format**: ```json { "error": { "code": "MODEL_LOAD_FAILED", "message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct", "details": { "model_name": "microsoft/Phi-3-mini-4k-instruct", "error_type": "initialization_error", "timestamp": "2025-12-05T17:23:52Z" } } } ``` **HTTP Status Codes**: - `200 OK` - Successful request - `202 Accepted` - Job created - `400 Bad Request` - Invalid input - `404 Not Found` - Resource not found - `500 Internal Server Error` - Server error - `503 Service Unavailable` - Service degraded ### 7.5 Rate Limiting **Strategy**: Token bucket algorithm **Limits**: - 100 requests/minute per IP - 1000 requests/hour per API key - Burst allowance: 20 requests --- ## 8. Data Flow & Processing ### 8.1 Document Processing Pipeline ```mermaid flowchart LR A[Upload Document] --> B{File Type?} B -->|PDF| C[PDF Parser] B -->|DOCX| D[DOCX Parser] B -->|Image| E[OCR Engine] B -->|Audio| F[Whisper Transcription] C --> G[Text Extraction] D --> G E --> G F --> G G --> H[PHI Scrubbing] H --> I[Medical Data Extraction] I --> J[Store Processed Data] J --> K[Return Results] ``` ### 8.2 Patient Summary Generation Flow ```mermaid flowchart TD A[Patient Data Input] --> B[Parse EHR Data] B --> C[Convert to Plain Text] C --> D{Data Size Check} D -->|Small| E[Single-pass Generation] D -->|Large| F[Chunking Strategy] F --> G[Chunk by Date/Size] G --> H[Process Chunks in Parallel] H --> I[Combine Chunk Summaries] E --> J[Generate with Model] I --> J J --> K[Format Clinical Output] K --> L[Evaluate Against Guidelines] L --> M[Return Summary] ``` ### 8.3 Data Transformation **Input Formats**: - Raw EHR JSON - HL7 FHIR resources - Plain text documents - Scanned images - Audio recordings **Output Formats**: - Structured JSON - Clinical summary (Markdown) - FHIR-compliant resources - Audit logs ### 8.4 Caching Strategy **Multi-Level Caching**: 1. **Model Cache**: Loaded models in memory 2. **Result Cache**: Generated summaries (LRU) 3. **File Cache**: Processed documents 4. **Hugging Face Cache**: Downloaded models **Cache Invalidation**: - Time-based expiration - Manual invalidation - Memory pressure-based eviction --- ## 9. Database Design ### 9.1 Database Schema **Primary Database**: PostgreSQL (optional, for audit logs) #### Audit Logs Table ```sql CREATE TABLE audit_logs ( id SERIAL PRIMARY KEY, timestamp TIMESTAMP NOT NULL DEFAULT NOW(), user_id VARCHAR(255), action VARCHAR(100) NOT NULL, resource_type VARCHAR(100), resource_id VARCHAR(255), phi_accessed BOOLEAN DEFAULT FALSE, ip_address INET, user_agent TEXT, request_data JSONB, response_status INTEGER, error_message TEXT, created_at TIMESTAMP DEFAULT NOW() ); CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp); CREATE INDEX idx_audit_user ON audit_logs(user_id); CREATE INDEX idx_audit_action ON audit_logs(action); CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed); ``` ### 9.2 Data Models **Patient Data Model** (In-memory): ```python { "patient_id": "string", "demographics": { "name": "string", "dob": "date", "gender": "string", "mrn": "string" }, "visits": [ { "visit_id": "string", "date": "datetime", "chief_complaint": "string", "diagnoses": ["string"], "medications": ["string"], "procedures": ["string"], "vitals": {}, "labs": [] } ] } ``` ### 9.3 File Storage **Storage Strategy**: Local filesystem or cloud storage **Directory Structure**: ``` /data/ ├── uploads/ # Uploaded documents ├── processed/ # Processed documents ├── cache/ # Temporary cache └── models/ # Model files ``` --- ## 10. Security Architecture ### 10.1 HIPAA Compliance **Requirements Met**: 1. **Access Controls**: Authentication and authorization 2. **Audit Logging**: Comprehensive activity logs 3. **Data Encryption**: In-transit and at-rest 4. **PHI Scrubbing**: Automated anonymization 5. **Secure Communication**: HTTPS/TLS ### 10.2 PHI Scrubbing **Scrubbing Patterns**: ```python PATTERNS = { "name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', "mrn": r'\bMRN[:\s]*\d{6,10}\b', "dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b', "phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', "ssn": r'\b\d{3}-\d{2}-\d{4}\b' } ``` ### 10.3 Container Security **Security Measures**: - Non-root user execution - Read-only root filesystem - Resource limits (CPU, memory) - Network policies - Secrets management - Minimal base images ### 10.4 API Security **Security Headers**: ```python { "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "X-XSS-Protection": "1; mode=block", "Strict-Transport-Security": "max-age=31536000" } ``` --- ## 11. Deployment Architecture ### 11.1 Deployment Options #### 11.1.1 Hugging Face Spaces **Configuration**: `.huggingface.yaml` ```yaml runtime: docker sdk: docker python_version: "3.10" build: dockerfile: Dockerfile.hf-spaces cache: true hardware: gpu: t4-medium # 16GB GPU RAM, 16GB System RAM env: - SPACE_ID=$SPACE_ID - HF_HOME=/app/.cache/huggingface - TORCH_HOME=/app/.cache/torch - MODEL_CACHE_DIR=/app/models - PRELOAD_GGUF=true - HF_SPACES=true ``` **Optimizations**: - Pre-cached models in Docker image - Lazy model loading - Memory-efficient inference - Automatic GPU detection #### 11.1.2 Kubernetes **Deployment Manifest**: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: hntai-deployment spec: replicas: 3 selector: matchLabels: app: hntai template: metadata: labels: app: hntai spec: containers: - name: hntai image: hntai:latest ports: - containerPort: 7860 resources: requests: memory: "4Gi" cpu: "2" limits: memory: "8Gi" cpu: "4" livenessProbe: httpGet: path: /health/live port: 7860 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health/ready port: 7860 initialDelaySeconds: 10 periodSeconds: 5 ``` #### 11.1.3 Docker **Multi-Stage Dockerfile**: ```dockerfile # Stage 1: Builder FROM python:3.10-slim AS builder RUN apt-get update && apt-get install -y build-essential COPY requirements.txt . RUN pip install --prefix=/install -r requirements.txt # Stage 2: Runtime FROM python:3.10-slim AS runtime COPY --from=builder /install /usr/local WORKDIR /app COPY . . ENV PYTHONUNBUFFERED=1 EXPOSE 7860 CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"] ``` ### 11.2 Scaling Strategy **Horizontal Scaling**: - Multiple replicas behind load balancer - Stateless design for easy scaling - Shared model cache (optional) **Vertical Scaling**: - Increase CPU/memory per instance - GPU acceleration for inference - Larger model support ### 11.3 High Availability **Components**: 1. **Load Balancer**: Distribute traffic 2. **Health Checks**: Automatic failover 3. **Auto-scaling**: Based on CPU/memory 4. **Graceful Shutdown**: Drain connections --- ## 12. Performance Optimization ### 12.1 Model Optimization **Techniques**: 1. **Quantization**: GGUF Q4 models (4-bit) 2. **Precision**: FP16 for GPU inference 3. **Batching**: Batch size optimization 4. **Caching**: Model and result caching 5. **Lazy Loading**: On-demand model loading ### 12.2 Memory Management **Strategies**: - Automatic garbage collection - GPU memory clearing - Model unloading (LRU) - Memory pressure monitoring **Memory Limits**: - T4 Medium: 16GB GPU, 16GB RAM - Max 2 models in memory - Automatic eviction at 80% usage ### 12.3 Inference Optimization **T4-Specific Optimizations**: ```python { "max_models": 2, "max_memory_mb": 14000, "n_ctx": 8192, "n_threads": 4, "n_gpu_layers": 35, "torch_dtype": "float16", "device_map": "auto" } ``` ### 12.4 Caching Strategy **Cache Hierarchy**: 1. **L1 - Model Cache**: In-memory loaded models 2. **L2 - Result Cache**: Generated summaries (LRU, 100 items) 3. **L3 - File Cache**: Processed documents (disk) 4. **L4 - HF Cache**: Downloaded models (disk) ### 12.5 Performance Metrics **Target Metrics**: - Model load time: < 10 seconds - Summary generation: < 60 seconds (small), < 180 seconds (large) - API response time: < 100ms (excluding generation) - Memory usage: < 80% of available - GPU utilization: > 70% during inference --- ## 13. Monitoring & Observability ### 13.1 Health Checks **Liveness Probe** (`/health/live`): ```python { "status": "alive", "timestamp": "2025-12-05T17:23:52Z" } ``` **Readiness Probe** (`/health/ready`): ```python { "status": "ready", "checks": { "database": "ok", "model_manager": "ok", "file_storage": "ok" }, "timestamp": "2025-12-05T17:23:52Z" } ``` ### 13.2 Metrics **Prometheus Metrics** (`/metrics`): ``` # Model metrics model_load_time_seconds{model_name="phi-3-gguf"} 8.5 model_inference_time_seconds{model_name="phi-3-gguf"} 45.2 model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296 # API metrics http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234 http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3 # System metrics memory_usage_percent 65.2 gpu_memory_usage_percent 72.1 cpu_usage_percent 45.8 ``` ### 13.3 Logging **Structured Logging**: ```python { "timestamp": "2025-12-05T17:23:52Z", "level": "INFO", "logger": "ai_med_extract.agents.patient_summary_agent", "message": "Generated patient summary", "context": { "job_id": "abc123", "model_name": "phi-3-gguf", "duration_seconds": 45.2, "token_count": 2048 } } ``` **Log Levels**: - `DEBUG`: Detailed diagnostic information - `INFO`: General informational messages - `WARNING`: Warning messages - `ERROR`: Error messages - `CRITICAL`: Critical failures ### 13.4 Audit Logging **HIPAA Audit Trail**: ```python { "timestamp": "2025-12-05T17:23:52Z", "user_id": "user123", "action": "PHI_ACCESS", "resource_type": "patient_summary", "resource_id": "patient456", "phi_accessed": true, "ip_address": "192.168.1.100", "user_agent": "Mozilla/5.0...", "request_data": {...}, "response_status": 200 } ``` --- ## 14. Development Workflow ### 14.1 Local Development **Setup**: ```bash # Clone repository git clone cd HNTAI # Create virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt # Set environment variables export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai" export SECRET_KEY="your-secret-key" export HF_HOME="/tmp/huggingface" # Run development server cd services/ai-service/src python -m ai_med_extract.app run_dev ``` ### 14.2 Testing **Test Structure**: ``` tests/ ├── unit/ │ ├── test_agents.py │ ├── test_model_manager.py │ └── test_utils.py ├── integration/ │ ├── test_api.py │ └── test_workflows.py └── conftest.py ``` **Running Tests**: ```bash # Unit tests python -m pytest tests/unit/ # Integration tests python -m pytest tests/integration/ # Coverage report python -m pytest --cov=ai_med_extract tests/ ``` ### 14.3 Code Quality **Tools**: ```bash # Format code black . isort . # Lint code flake8 . # Type checking mypy services/ai-service/src/ai_med_extract/ ``` ### 14.4 Git Workflow **Branching Strategy**: - `main`: Production-ready code - `develop`: Integration branch - `feature/*`: Feature branches - `bugfix/*`: Bug fix branches - `hotfix/*`: Production hotfixes **Commit Convention**: ``` ():