HNTAI / TECHNICAL_ARCHITECTURE.md
sachinchandrankallar's picture
changes for publishing the latest including generate_generic api
4156c57
|
Raw
History Blame Contribute Delete
38.1 kB

HNTAI - Comprehensive Technical Architecture Documentation

Version: 1.0
Last Updated: December 5, 2025
Project: Medical Data Extraction & AI Processing Platform


Table of Contents

  1. Executive Summary
  2. System Overview
  3. Architecture Design
  4. Technology Stack
  5. Core Components
  6. AI/ML Architecture
  7. API Architecture
  8. Data Flow & Processing
  9. Database Design
  10. Security Architecture
  11. Deployment Architecture
  12. Performance Optimization
  13. Monitoring & Observability
  14. Development Workflow
  15. Integration Patterns
  16. Scalability Considerations
  17. Future Roadmap

1. Executive Summary

HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends.

Key Capabilities

  • Multi-format Document Processing: PDF, DOCX, images, and audio transcription
  • HIPAA Compliance: Automated PHI scrubbing with comprehensive audit logging
  • Multi-Model AI Support: Transformers, OpenVINO, and GGUF models with automatic optimization
  • Scalable Architecture: Kubernetes-ready with horizontal scaling capabilities
  • Production-Ready: Health checks, metrics, structured logging, and error handling

Target Deployment Environments

  • Hugging Face Spaces (T4 Medium GPU)
  • Kubernetes Clusters (On-premise or cloud)
  • Docker Containers (Standalone or orchestrated)
  • Local Development (CPU or GPU)

2. System Overview

2.1 Purpose & Scope

HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to:

  1. Extract structured medical data from unstructured documents
  2. Anonymize protected health information (PHI) for compliance
  3. Summarize patient records into comprehensive clinical assessments
  4. Process multi-modal medical data (text, images, audio)

2.2 Design Principles

  • Simplicity: Clean, maintainable codebase with essential features
  • Flexibility: Support for multiple AI model types and backends
  • Security: HIPAA-compliant with comprehensive audit trails
  • Performance: Optimized for T4 GPU with intelligent caching
  • Reliability: Robust error handling and automatic fallback mechanisms

2.3 High-Level Architecture

graph TB
    subgraph "Client Layer"
        A[Web Client]
        B[Mobile Client]
        C[API Client]
    end
    
    subgraph "API Gateway"
        D[FastAPI Application]
        E[Health Endpoints]
        F[Metrics Endpoint]
    end
    
    subgraph "Service Layer"
        G[Document Processing Service]
        H[PHI Scrubbing Service]
        I[Patient Summary Service]
        J[Model Management Service]
    end
    
    subgraph "AI/ML Layer"
        K[Unified Model Manager]
        L[Transformers Models]
        M[GGUF Models]
        N[OpenVINO Models]
        O[Whisper Audio Models]
    end
    
    subgraph "Data Layer"
        P[PostgreSQL - Audit Logs]
        Q[File Storage]
        R[Model Cache]
    end
    
    A --> D
    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    G --> K
    H --> K
    I --> K
    J --> K
    K --> L
    K --> M
    K --> N
    K --> O
    D --> P
    G --> Q
    K --> R

3. Architecture Design

3.1 Architectural Style

HNTAI follows a Layered Monolithic Architecture with clear separation of concerns:

  1. Presentation Layer: FastAPI routes and endpoints
  2. Service Layer: Business logic and orchestration
  3. Agent Layer: Specialized AI agents for specific tasks
  4. Utility Layer: Shared utilities and helpers
  5. Data Layer: Database and file storage

3.2 Component Architecture

graph LR
    subgraph "FastAPI Application"
        A[routes_fastapi.py]
        B[app.py]
        C[main.py]
    end
    
    subgraph "Agents"
        D[patient_summary_agent.py]
        E[phi_scrubber.py]
        F[text_extractor.py]
        G[medical_data_extractor.py]
    end
    
    subgraph "Services"
        H[job_manager.py]
        I[request_queue.py]
        J[error_handler.py]
        K[sse_generator.py]
    end
    
    subgraph "Utils"
        L[unified_model_manager.py]
        M[model_config.py]
        N[robust_json_parser.py]
        O[memory_manager.py]
    end
    
    A --> D
    A --> E
    A --> F
    A --> G
    A --> H
    A --> I
    D --> L
    E --> L
    F --> L
    G --> L
    L --> M
    L --> O

3.3 Directory Structure

HNTAI/
β”œβ”€β”€ services/
β”‚   └── ai-service/
β”‚       └── src/
β”‚           └── ai_med_extract/
β”‚               β”œβ”€β”€ agents/              # AI agents for specific tasks
β”‚               β”‚   β”œβ”€β”€ patient_summary_agent.py
β”‚               β”‚   β”œβ”€β”€ phi_scrubber.py
β”‚               β”‚   β”œβ”€β”€ text_extractor.py
β”‚               β”‚   └── medical_data_extractor.py
β”‚               β”œβ”€β”€ api/                 # FastAPI routes
β”‚               β”‚   └── routes_fastapi.py
β”‚               β”œβ”€β”€ services/            # Business logic services
β”‚               β”‚   β”œβ”€β”€ job_manager.py
β”‚               β”‚   β”œβ”€β”€ request_queue.py
β”‚               β”‚   β”œβ”€β”€ error_handler.py
β”‚               β”‚   └── sse_generator.py
β”‚               β”œβ”€β”€ utils/               # Utilities and helpers
β”‚               β”‚   β”œβ”€β”€ unified_model_manager.py
β”‚               β”‚   β”œβ”€β”€ model_config.py
β”‚               β”‚   β”œβ”€β”€ robust_json_parser.py
β”‚               β”‚   β”œβ”€β”€ memory_manager.py
β”‚               β”‚   β”œβ”€β”€ openvino_summarizer_utils.py
β”‚               β”‚   └── patient_summary_utils.py
β”‚               β”œβ”€β”€ app.py               # FastAPI app factory
β”‚               β”œβ”€β”€ main.py              # Entry point
β”‚               β”œβ”€β”€ health_endpoints.py  # Health checks
β”‚               └── database_audit.py    # HIPAA audit logging
β”œβ”€β”€ docs/                                # Documentation
β”œβ”€β”€ infra/                               # Infrastructure configs
β”‚   └── k8s/                            # Kubernetes manifests
β”œβ”€β”€ app.py                               # HF Spaces entry point
β”œβ”€β”€ Dockerfile                           # Multi-stage Docker build
β”œβ”€β”€ Dockerfile.hf-spaces                 # HF Spaces optimized
β”œβ”€β”€ .huggingface.yaml                    # HF Spaces config
β”œβ”€β”€ models_config.json                   # Model configuration
β”œβ”€β”€ requirements.txt                     # Python dependencies
└── README.md                            # Project documentation

4. Technology Stack

4.1 Core Technologies

Category Technology Version Purpose
Runtime Python 3.10+ Primary language
Web Framework FastAPI Latest REST API framework
ASGI Server Uvicorn Latest Production server
AI/ML Framework PyTorch 2.x Deep learning
Transformers Hugging Face Transformers Latest Model loading
GGUF Support llama-cpp-python Latest Quantized models
OpenVINO optimum-intel Latest Intel optimization
Audio Processing Whisper Latest Speech-to-text

4.2 Supporting Technologies

Category Technology Purpose
Database PostgreSQL 13+ Audit logs (optional)
Caching In-memory LRU Model caching
Document Processing PyPDF2, python-docx PDF/DOCX parsing
OCR Tesseract Image text extraction
Audio FFmpeg Audio processing
Containerization Docker Deployment
Orchestration Kubernetes Scaling
Monitoring Prometheus Metrics

4.3 Development Tools

  • Code Quality: Black, isort, flake8, mypy
  • Testing: pytest
  • Version Control: Git
  • CI/CD: GitHub Actions (potential)
  • Documentation: Markdown, Mermaid diagrams

5. Core Components

5.1 FastAPI Application (app.py)

Purpose: Application factory and initialization

Key Responsibilities:

  • Create and configure FastAPI application
  • Initialize agents and services
  • Register routes and middleware
  • Configure CORS and security

Key Functions:

def create_app(initialize: bool = True) -> FastAPI
def initialize_agents(app: FastAPI, preload_small_models: bool = False)
def run_dev()  # Development server

5.2 API Routes (routes_fastapi.py)

Purpose: RESTful API endpoints

Endpoint Categories:

Health & Monitoring

  • GET /health/live - Liveness probe
  • GET /health/ready - Readiness probe
  • GET /metrics - Prometheus metrics

Document Processing

  • POST /upload - Upload and process documents
  • POST /transcribe - Audio transcription
  • GET /get_updated_medical_data - Retrieve processed data
  • PUT /update_medical_data - Update medical records

AI Processing

  • POST /generate_patient_summary - Generate patient summaries
  • POST /api/generate_summary - Text summarization
  • POST /api/patient_summary_openvino - OpenVINO summaries
  • POST /extract_medical_data - Extract structured data

Model Management

  • POST /api/load_model - Load specific models
  • GET /api/model_info - Model information
  • POST /api/switch_model - Switch models

5.3 Agents

5.3.1 Patient Summary Agent (patient_summary_agent.py)

Purpose: Generate comprehensive patient summaries

Key Features:

  • Dynamic model configuration
  • Multi-section summary generation
  • Chronological narrative building
  • Clinical guideline evaluation
  • Fallback text-based summarization

Core Methods:

def configure_model(model_name: str, model_type: str)
def generate_clinical_summary(patient_data: Union[List[str], Dict])
def generate_patient_summary(patient_data: Union[List[str], Dict])
def build_chronological_narrative(patient_data: dict)
def format_clinical_output(raw_summary: str, patient_data: dict)

5.3.2 PHI Scrubber (phi_scrubber.py)

Purpose: Remove protected health information

Scrubbing Capabilities:

  • Patient names
  • Medical record numbers (MRN)
  • Dates of birth
  • Phone numbers
  • Email addresses
  • Social Security Numbers
  • Addresses

Compliance: HIPAA-compliant with audit logging

5.3.3 Text Extractor (text_extractor.py)

Purpose: Extract text from various document formats

Supported Formats:

  • PDF documents
  • DOCX files
  • Images (via OCR)
  • Plain text

5.3.4 Medical Data Extractor (medical_data_extractor.py)

Purpose: Extract structured medical data from text

Extraction Targets:

  • Diagnoses
  • Medications
  • Procedures
  • Lab results
  • Vital signs
  • Allergies

5.4 Services

5.4.1 Job Manager (job_manager.py)

Purpose: Manage long-running jobs

Features:

  • Job lifecycle management
  • Progress tracking
  • Status updates
  • Result caching
  • Cleanup of completed jobs

5.4.2 Request Queue (request_queue.py)

Purpose: Queue and prioritize requests

Features:

  • Request queuing
  • Priority handling
  • Concurrency control
  • Timeout management

5.4.3 Error Handler (error_handler.py)

Purpose: Centralized error handling

Features:

  • Error categorization
  • Contextual logging
  • Job error updates
  • Graceful degradation

5.4.4 SSE Generator (sse_generator.py)

Purpose: Server-Sent Events for real-time updates

Features:

  • Progress streaming
  • Status updates
  • Error notifications
  • Completion events

6. AI/ML Architecture

6.1 Unified Model Manager

File: unified_model_manager.py

Purpose: Single interface for all AI model types

Architecture:

classDiagram
    class BaseModel {
        <<abstract>>
        +name: str
        +model_type: str
        +status: ModelStatus
        +load()
        +generate(prompt, config)*
        +unload()
    }
    
    class TransformersModel {
        +_model: Pipeline
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class GGUFModel {
        +_model: Llama
        +filename: str
        +_extract_filename()
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class OpenVINOModel {
        +_model: OVModelForCausalLM
        +_tokenizer: AutoTokenizer
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class FallbackModel {
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class UnifiedModelManager {
        +max_models: int
        +max_memory_mb: int
        +get_model(name, type)
        +generate_text(name, prompt)
        +cleanup()
    }
    
    BaseModel <|-- TransformersModel
    BaseModel <|-- GGUFModel
    BaseModel <|-- OpenVINOModel
    BaseModel <|-- FallbackModel
    UnifiedModelManager --> BaseModel

6.2 Model Types

6.2.1 Transformers Models

Backend: Hugging Face Transformers
Device: GPU (CUDA) or CPU
Use Cases: General text generation, summarization

Supported Models:

  • microsoft/Phi-3-mini-4k-instruct
  • facebook/bart-large-cnn (deprecated)
  • google/flan-t5-large

Configuration:

{
    "model_name": "microsoft/Phi-3-mini-4k-instruct",
    "model_type": "text-generation",
    "device_map": "auto",
    "torch_dtype": "float16"
}

6.2.2 GGUF Models

Backend: llama-cpp-python
Device: CPU or GPU (via Metal/CUDA)
Use Cases: Efficient inference with quantized models

Supported Models:

  • microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf (PRIMARY)

Configuration:

{
    "model_path": "path/to/model.gguf",
    "n_ctx": 8192,
    "n_threads": 4,
    "n_gpu_layers": 35  # GPU acceleration
}

6.2.3 OpenVINO Models

Backend: Intel OpenVINO
Device: CPU (Intel optimized) or GPU
Use Cases: Production deployment on Intel hardware

Supported Models:

  • OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Configuration:

{
    "model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
    "device": "GPU" if available else "CPU"
}

6.3 Model Selection Strategy

flowchart TD
    A[Request with model_name] --> B{Model specified?}
    B -->|Yes| C{Model type?}
    B -->|No| D[Use default: Phi-3 GGUF]
    
    C -->|GGUF| E[Load GGUF Model]
    C -->|OpenVINO| F[Load OpenVINO Model]
    C -->|Transformers| G[Load Transformers Model]
    C -->|Unknown| H[Auto-detect type]
    
    E --> I{Load successful?}
    F --> I
    G --> I
    H --> I
    D --> I
    
    I -->|Yes| J[Generate with model]
    I -->|No| K[Try fallback model]
    
    K --> L{Fallback successful?}
    L -->|Yes| J
    L -->|No| M[Use text-based fallback]

6.4 Model Configuration

File: models_config.json

{
  "patient_summary_models": [
    {
      "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
      "type": "gguf",
      "is_active": true,
      "cached": true,
      "description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
      "use_case": "Fast patient summary generation with CPU/GPU",
      "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
      "filename": "Phi-3-mini-4k-instruct-q4.gguf"
    }
  ],
  "runtime_behavior": {
    "allow_runtime_downloads": true,
    "cache_runtime_downloads": true,
    "fallback_to_cached": true
  }
}

6.5 Token Management

Token Limit Handling:

  • Automatic token counting (heuristic: ~4 chars/token)
  • Pre-generation validation
  • Token limit error detection
  • Graceful degradation

Token Limits by Model:

  • Phi-3 models: 4096 tokens (context window)
  • BART models: 1024 tokens
  • T5 models: 512 tokens

6.6 Generation Configuration

@dataclass
class GenerationConfig:
    max_tokens: int = 8192      # Maximum output tokens
    min_tokens: int = 50        # Minimum output tokens
    temperature: float = 0.3    # Deterministic for medical
    top_p: float = 0.9          # Nucleus sampling
    timeout: float = 180.0      # T4 timeout
    stream: bool = False        # Streaming support

6.7 T4 GPU Optimizations

Hardware Target: NVIDIA T4 Medium (16GB GPU, 16GB RAM)

Optimizations:

  1. Memory Management:

    • Max 2 models in memory
    • Automatic model unloading
    • GPU memory clearing
    • Garbage collection
  2. Model Loading:

    • Lazy loading (on-demand)
    • Intelligent caching
    • LRU eviction policy
  3. Inference:

    • FP16 precision
    • Batch size: 1
    • Context window: 8192 tokens
    • GPU layer offloading (GGUF)

7. API Architecture

7.1 RESTful Design

Principles:

  • Resource-oriented URLs
  • HTTP methods for CRUD operations
  • JSON request/response format
  • Stateless communication
  • Proper HTTP status codes

7.2 Request/Response Flow

sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant S as Service Layer
    participant M as Model Manager
    participant D as Database
    
    C->>A: POST /generate_patient_summary
    A->>A: Validate request
    A->>S: Create job
    S->>D: Log job creation
    A-->>C: 202 Accepted (job_id)
    
    S->>M: Load model
    M->>M: Check cache
    M->>M: Load if needed
    M-->>S: Model ready
    
    S->>M: Generate summary
    M->>M: Process prompt
    M-->>S: Generated text
    
    S->>D: Log completion
    S->>A: Update job status
    A-->>C: SSE: Progress updates
    
    C->>A: GET /job/{job_id}
    A->>S: Get job status
    S->>D: Retrieve job
    S-->>A: Job result
    A-->>C: 200 OK (result)

7.3 Authentication & Authorization

Current State: Basic API key authentication (optional)

Planned Enhancements:

  • JWT-based authentication
  • Role-based access control (RBAC)
  • OAuth2 integration
  • API rate limiting

7.4 Error Handling

Error Response Format:

{
  "error": {
    "code": "MODEL_LOAD_FAILED",
    "message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct",
    "details": {
      "model_name": "microsoft/Phi-3-mini-4k-instruct",
      "error_type": "initialization_error",
      "timestamp": "2025-12-05T17:23:52Z"
    }
  }
}

HTTP Status Codes:

  • 200 OK - Successful request
  • 202 Accepted - Job created
  • 400 Bad Request - Invalid input
  • 404 Not Found - Resource not found
  • 500 Internal Server Error - Server error
  • 503 Service Unavailable - Service degraded

7.5 Rate Limiting

Strategy: Token bucket algorithm

Limits:

  • 100 requests/minute per IP
  • 1000 requests/hour per API key
  • Burst allowance: 20 requests

8. Data Flow & Processing

8.1 Document Processing Pipeline

flowchart LR
    A[Upload Document] --> B{File Type?}
    B -->|PDF| C[PDF Parser]
    B -->|DOCX| D[DOCX Parser]
    B -->|Image| E[OCR Engine]
    B -->|Audio| F[Whisper Transcription]
    
    C --> G[Text Extraction]
    D --> G
    E --> G
    F --> G
    
    G --> H[PHI Scrubbing]
    H --> I[Medical Data Extraction]
    I --> J[Store Processed Data]
    J --> K[Return Results]

8.2 Patient Summary Generation Flow

flowchart TD
    A[Patient Data Input] --> B[Parse EHR Data]
    B --> C[Convert to Plain Text]
    C --> D{Data Size Check}
    
    D -->|Small| E[Single-pass Generation]
    D -->|Large| F[Chunking Strategy]
    
    F --> G[Chunk by Date/Size]
    G --> H[Process Chunks in Parallel]
    H --> I[Combine Chunk Summaries]
    
    E --> J[Generate with Model]
    I --> J
    
    J --> K[Format Clinical Output]
    K --> L[Evaluate Against Guidelines]
    L --> M[Return Summary]

8.3 Data Transformation

Input Formats:

  • Raw EHR JSON
  • HL7 FHIR resources
  • Plain text documents
  • Scanned images
  • Audio recordings

Output Formats:

  • Structured JSON
  • Clinical summary (Markdown)
  • FHIR-compliant resources
  • Audit logs

8.4 Caching Strategy

Multi-Level Caching:

  1. Model Cache: Loaded models in memory
  2. Result Cache: Generated summaries (LRU)
  3. File Cache: Processed documents
  4. Hugging Face Cache: Downloaded models

Cache Invalidation:

  • Time-based expiration
  • Manual invalidation
  • Memory pressure-based eviction

9. Database Design

9.1 Database Schema

Primary Database: PostgreSQL (optional, for audit logs)

Audit Logs Table

CREATE TABLE audit_logs (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
    user_id VARCHAR(255),
    action VARCHAR(100) NOT NULL,
    resource_type VARCHAR(100),
    resource_id VARCHAR(255),
    phi_accessed BOOLEAN DEFAULT FALSE,
    ip_address INET,
    user_agent TEXT,
    request_data JSONB,
    response_status INTEGER,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp);
CREATE INDEX idx_audit_user ON audit_logs(user_id);
CREATE INDEX idx_audit_action ON audit_logs(action);
CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed);

9.2 Data Models

Patient Data Model (In-memory):

{
    "patient_id": "string",
    "demographics": {
        "name": "string",
        "dob": "date",
        "gender": "string",
        "mrn": "string"
    },
    "visits": [
        {
            "visit_id": "string",
            "date": "datetime",
            "chief_complaint": "string",
            "diagnoses": ["string"],
            "medications": ["string"],
            "procedures": ["string"],
            "vitals": {},
            "labs": []
        }
    ]
}

9.3 File Storage

Storage Strategy: Local filesystem or cloud storage

Directory Structure:

/data/
β”œβ”€β”€ uploads/           # Uploaded documents
β”œβ”€β”€ processed/         # Processed documents
β”œβ”€β”€ cache/            # Temporary cache
└── models/           # Model files

10. Security Architecture

10.1 HIPAA Compliance

Requirements Met:

  1. Access Controls: Authentication and authorization
  2. Audit Logging: Comprehensive activity logs
  3. Data Encryption: In-transit and at-rest
  4. PHI Scrubbing: Automated anonymization
  5. Secure Communication: HTTPS/TLS

10.2 PHI Scrubbing

Scrubbing Patterns:

PATTERNS = {
    "name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
    "mrn": r'\bMRN[:\s]*\d{6,10}\b',
    "dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b',
    "phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    "ssn": r'\b\d{3}-\d{2}-\d{4}\b'
}

10.3 Container Security

Security Measures:

  • Non-root user execution
  • Read-only root filesystem
  • Resource limits (CPU, memory)
  • Network policies
  • Secrets management
  • Minimal base images

10.4 API Security

Security Headers:

{
    "X-Content-Type-Options": "nosniff",
    "X-Frame-Options": "DENY",
    "X-XSS-Protection": "1; mode=block",
    "Strict-Transport-Security": "max-age=31536000"
}

11. Deployment Architecture

11.1 Deployment Options

11.1.1 Hugging Face Spaces

Configuration: .huggingface.yaml

runtime: docker
sdk: docker
python_version: "3.10"

build:
  dockerfile: Dockerfile.hf-spaces
  cache: true

hardware:
  gpu: t4-medium  # 16GB GPU RAM, 16GB System RAM

env:
  - SPACE_ID=$SPACE_ID
  - HF_HOME=/app/.cache/huggingface
  - TORCH_HOME=/app/.cache/torch
  - MODEL_CACHE_DIR=/app/models
  - PRELOAD_GGUF=true
  - HF_SPACES=true

Optimizations:

  • Pre-cached models in Docker image
  • Lazy model loading
  • Memory-efficient inference
  • Automatic GPU detection

11.1.2 Kubernetes

Deployment Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hntai-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hntai
  template:
    metadata:
      labels:
        app: hntai
    spec:
      containers:
      - name: hntai
        image: hntai:latest
        ports:
        - containerPort: 7860
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 7860
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 7860
          initialDelaySeconds: 10
          periodSeconds: 5

11.1.3 Docker

Multi-Stage Dockerfile:

# Stage 1: Builder
FROM python:3.10-slim AS builder
RUN apt-get update && apt-get install -y build-essential
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.10-slim AS runtime
COPY --from=builder /install /usr/local
WORKDIR /app
COPY . .
ENV PYTHONUNBUFFERED=1
EXPOSE 7860
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

11.2 Scaling Strategy

Horizontal Scaling:

  • Multiple replicas behind load balancer
  • Stateless design for easy scaling
  • Shared model cache (optional)

Vertical Scaling:

  • Increase CPU/memory per instance
  • GPU acceleration for inference
  • Larger model support

11.3 High Availability

Components:

  1. Load Balancer: Distribute traffic
  2. Health Checks: Automatic failover
  3. Auto-scaling: Based on CPU/memory
  4. Graceful Shutdown: Drain connections

12. Performance Optimization

12.1 Model Optimization

Techniques:

  1. Quantization: GGUF Q4 models (4-bit)
  2. Precision: FP16 for GPU inference
  3. Batching: Batch size optimization
  4. Caching: Model and result caching
  5. Lazy Loading: On-demand model loading

12.2 Memory Management

Strategies:

  • Automatic garbage collection
  • GPU memory clearing
  • Model unloading (LRU)
  • Memory pressure monitoring

Memory Limits:

  • T4 Medium: 16GB GPU, 16GB RAM
  • Max 2 models in memory
  • Automatic eviction at 80% usage

12.3 Inference Optimization

T4-Specific Optimizations:

{
    "max_models": 2,
    "max_memory_mb": 14000,
    "n_ctx": 8192,
    "n_threads": 4,
    "n_gpu_layers": 35,
    "torch_dtype": "float16",
    "device_map": "auto"
}

12.4 Caching Strategy

Cache Hierarchy:

  1. L1 - Model Cache: In-memory loaded models
  2. L2 - Result Cache: Generated summaries (LRU, 100 items)
  3. L3 - File Cache: Processed documents (disk)
  4. L4 - HF Cache: Downloaded models (disk)

12.5 Performance Metrics

Target Metrics:

  • Model load time: < 10 seconds
  • Summary generation: < 60 seconds (small), < 180 seconds (large)
  • API response time: < 100ms (excluding generation)
  • Memory usage: < 80% of available
  • GPU utilization: > 70% during inference

13. Monitoring & Observability

13.1 Health Checks

Liveness Probe (/health/live):

{
    "status": "alive",
    "timestamp": "2025-12-05T17:23:52Z"
}

Readiness Probe (/health/ready):

{
    "status": "ready",
    "checks": {
        "database": "ok",
        "model_manager": "ok",
        "file_storage": "ok"
    },
    "timestamp": "2025-12-05T17:23:52Z"
}

13.2 Metrics

Prometheus Metrics (/metrics):

# Model metrics
model_load_time_seconds{model_name="phi-3-gguf"} 8.5
model_inference_time_seconds{model_name="phi-3-gguf"} 45.2
model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296

# API metrics
http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234
http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3

# System metrics
memory_usage_percent 65.2
gpu_memory_usage_percent 72.1
cpu_usage_percent 45.8

13.3 Logging

Structured Logging:

{
    "timestamp": "2025-12-05T17:23:52Z",
    "level": "INFO",
    "logger": "ai_med_extract.agents.patient_summary_agent",
    "message": "Generated patient summary",
    "context": {
        "job_id": "abc123",
        "model_name": "phi-3-gguf",
        "duration_seconds": 45.2,
        "token_count": 2048
    }
}

Log Levels:

  • DEBUG: Detailed diagnostic information
  • INFO: General informational messages
  • WARNING: Warning messages
  • ERROR: Error messages
  • CRITICAL: Critical failures

13.4 Audit Logging

HIPAA Audit Trail:

{
    "timestamp": "2025-12-05T17:23:52Z",
    "user_id": "user123",
    "action": "PHI_ACCESS",
    "resource_type": "patient_summary",
    "resource_id": "patient456",
    "phi_accessed": true,
    "ip_address": "192.168.1.100",
    "user_agent": "Mozilla/5.0...",
    "request_data": {...},
    "response_status": 200
}

14. Development Workflow

14.1 Local Development

Setup:

# Clone repository
git clone <repository-url>
cd HNTAI

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai"
export SECRET_KEY="your-secret-key"
export HF_HOME="/tmp/huggingface"

# Run development server
cd services/ai-service/src
python -m ai_med_extract.app run_dev

14.2 Testing

Test Structure:

tests/
β”œβ”€β”€ unit/
β”‚   β”œβ”€β”€ test_agents.py
β”‚   β”œβ”€β”€ test_model_manager.py
β”‚   └── test_utils.py
β”œβ”€β”€ integration/
β”‚   β”œβ”€β”€ test_api.py
β”‚   └── test_workflows.py
└── conftest.py

Running Tests:

# Unit tests
python -m pytest tests/unit/

# Integration tests
python -m pytest tests/integration/

# Coverage report
python -m pytest --cov=ai_med_extract tests/

14.3 Code Quality

Tools:

# Format code
black .
isort .

# Lint code
flake8 .

# Type checking
mypy services/ai-service/src/ai_med_extract/

14.4 Git Workflow

Branching Strategy:

  • main: Production-ready code
  • develop: Integration branch
  • feature/*: Feature branches
  • bugfix/*: Bug fix branches
  • hotfix/*: Production hotfixes

Commit Convention:

<type>(<scope>): <subject>

<body>

<footer>

Types: feat, fix, docs, style, refactor, test, chore


15. Integration Patterns

15.1 External System Integration

Integration Points:

  1. EHR Systems: HL7, FHIR APIs
  2. Document Management: File uploads, cloud storage
  3. Authentication: OAuth2, SAML
  4. Monitoring: Prometheus, Grafana
  5. Logging: ELK Stack, CloudWatch

15.2 API Integration

Client Libraries (Planned):

  • Python SDK
  • JavaScript SDK
  • REST API documentation (OpenAPI/Swagger)

Example Integration:

import requests

# Upload document
response = requests.post(
    "https://api.hntai.com/upload",
    files={"file": open("document.pdf", "rb")},
    headers={"Authorization": "Bearer <token>"}
)

# Generate patient summary
response = requests.post(
    "https://api.hntai.com/generate_patient_summary",
    json={
        "patient_data": {...},
        "model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
        "model_type": "gguf"
    },
    headers={"Authorization": "Bearer <token>"}
)

job_id = response.json()["job_id"]

# Poll for results
while True:
    response = requests.get(
        f"https://api.hntai.com/job/{job_id}",
        headers={"Authorization": "Bearer <token>"}
    )
    if response.json()["status"] == "completed":
        break
    time.sleep(5)

15.3 Webhook Support

Planned Feature: Webhook notifications for job completion

{
    "event": "job.completed",
    "job_id": "abc123",
    "timestamp": "2025-12-05T17:23:52Z",
    "data": {
        "status": "completed",
        "result": {...}
    }
}

16. Scalability Considerations

16.1 Horizontal Scaling

Strategies:

  1. Stateless Design: No session state in application
  2. Load Balancing: Distribute requests across instances
  3. Shared Cache: Redis for distributed caching
  4. Message Queue: RabbitMQ/Kafka for async processing

16.2 Vertical Scaling

Resource Scaling:

  • CPU: 2-8 cores per instance
  • Memory: 8-32 GB per instance
  • GPU: T4, V100, A100 for inference

16.3 Database Scaling

Strategies:

  1. Read Replicas: For audit log queries
  2. Partitioning: Time-based partitioning for logs
  3. Indexing: Optimize query performance
  4. Archiving: Move old logs to cold storage

16.4 Model Serving

Scaling Options:

  1. Model Replication: Same model on multiple instances
  2. Model Sharding: Different models on different instances
  3. Model Versioning: A/B testing with multiple versions
  4. Dedicated Inference: Separate inference service

17. Future Roadmap

17.1 Short-Term (3-6 months)

  1. Enhanced Model Support:

    • Support for Llama 3, Mistral models
    • Fine-tuned medical models
    • Multi-modal models (text + images)
  2. Improved Performance:

    • Model quantization (INT8, INT4)
    • Batch inference support
    • Streaming responses
  3. Additional Features:

    • Real-time collaboration
    • Version control for summaries
    • Template-based summaries

17.2 Medium-Term (6-12 months)

  1. Advanced AI Capabilities:

    • Multi-agent orchestration
    • Retrieval-Augmented Generation (RAG)
    • Knowledge graph integration
  2. Enterprise Features:

    • Multi-tenancy support
    • Advanced RBAC
    • SSO integration
    • Compliance reporting
  3. Platform Enhancements:

    • Web UI for management
    • Mobile app support
    • Plugin architecture

17.3 Long-Term (12+ months)

  1. AI/ML Advancements:

    • Custom model training pipeline
    • Federated learning support
    • Explainable AI (XAI)
  2. Ecosystem Integration:

    • FHIR server integration
    • HL7 v3 support
    • DICOM image analysis
  3. Global Expansion:

    • Multi-language support
    • Regional compliance (GDPR, etc.)
    • Edge deployment

Appendix A: Configuration Reference

Environment Variables

Variable Description Default Required
DATABASE_URL PostgreSQL connection string - No
SECRET_KEY Application secret key - Yes
JWT_SECRET_KEY JWT signing key - Yes
HF_HOME Hugging Face cache directory /tmp/huggingface No
TORCH_HOME PyTorch cache directory /tmp/torch No
WHISPER_CACHE Whisper model cache /tmp/whisper No
HF_SPACES Hugging Face Spaces mode false No
PRELOAD_GGUF Preload GGUF models false No
MAX_NEW_TOKENS Max output tokens 8192 No
MAX_INPUT_TOKENS Max input tokens 2048 No

Appendix B: API Reference

Complete Endpoint List

Method Endpoint Description
GET / Root endpoint
GET /health/live Liveness probe
GET /health/ready Readiness probe
GET /metrics Prometheus metrics
POST /upload Upload document
POST /transcribe Transcribe audio
POST /generate_patient_summary Generate patient summary
POST /api/generate_summary Generate text summary
POST /api/patient_summary_openvino OpenVINO summary
POST /extract_medical_data Extract medical data
GET /get_updated_medical_data Get processed data
PUT /update_medical_data Update medical data
POST /api/load_model Load model
GET /api/model_info Get model info
POST /api/switch_model Switch model

Appendix C: Troubleshooting Guide

Common Issues

Model Loading Failures

Symptom: Model fails to load
Causes:

  • Insufficient memory
  • Missing dependencies
  • Network issues (download)

Solutions:

  1. Check memory availability
  2. Verify dependencies installed
  3. Check network connectivity
  4. Use fallback model

Token Limit Errors

Symptom: "Input exceeds token limit"
Causes:

  • Input too long
  • Model context window exceeded

Solutions:

  1. Reduce input size
  2. Use chunking strategy
  3. Switch to larger context model

Performance Issues

Symptom: Slow inference
Causes:

  • CPU-only inference
  • Large model size
  • Memory pressure

Solutions:

  1. Enable GPU acceleration
  2. Use quantized models (GGUF)
  3. Reduce batch size
  4. Clear model cache

Appendix D: Glossary

Term Definition
PHI Protected Health Information
HIPAA Health Insurance Portability and Accountability Act
EHR Electronic Health Record
FHIR Fast Healthcare Interoperability Resources
HL7 Health Level 7 (healthcare data standard)
GGUF GPT-Generated Unified Format (quantized models)
OpenVINO Open Visual Inference and Neural Network Optimization
T4 NVIDIA Tesla T4 GPU
LRU Least Recently Used (cache eviction)
SSE Server-Sent Events
ASGI Asynchronous Server Gateway Interface

Document Revision History

Version Date Author Changes
1.0 2025-12-05 System Initial comprehensive documentation

End of Technical Architecture Documentation