Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

HNTAI / TECHNICAL_ARCHITECTURE.md

sachinchandrankallar

changes for publishing the latest including generate_generic api

4156c57 6 months ago

preview code

Raw

History Blame Contribute Delete

38.1 kB

HNTAI - Comprehensive Technical Architecture Documentation

Version: 1.0
Last Updated: December 5, 2025
Project: Medical Data Extraction & AI Processing Platform

Executive Summary
System Overview
Architecture Design
Technology Stack
Core Components
AI/ML Architecture
API Architecture
Data Flow & Processing
Database Design
Security Architecture
Deployment Architecture
Performance Optimization
Monitoring & Observability
Development Workflow
Integration Patterns
Scalability Considerations
Future Roadmap

1. Executive Summary

HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends.

Key Capabilities

Multi-format Document Processing: PDF, DOCX, images, and audio transcription
HIPAA Compliance: Automated PHI scrubbing with comprehensive audit logging
Multi-Model AI Support: Transformers, OpenVINO, and GGUF models with automatic optimization
Scalable Architecture: Kubernetes-ready with horizontal scaling capabilities
Production-Ready: Health checks, metrics, structured logging, and error handling

Target Deployment Environments

Hugging Face Spaces (T4 Medium GPU)
Kubernetes Clusters (On-premise or cloud)
Docker Containers (Standalone or orchestrated)
Local Development (CPU or GPU)

2. System Overview

2.1 Purpose & Scope

HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to:

Extract structured medical data from unstructured documents
Anonymize protected health information (PHI) for compliance
Summarize patient records into comprehensive clinical assessments
Process multi-modal medical data (text, images, audio)

2.2 Design Principles

Simplicity: Clean, maintainable codebase with essential features
Flexibility: Support for multiple AI model types and backends
Security: HIPAA-compliant with comprehensive audit trails
Performance: Optimized for T4 GPU with intelligent caching
Reliability: Robust error handling and automatic fallback mechanisms

2.3 High-Level Architecture

graph TB
    subgraph "Client Layer"
        A[Web Client]
        B[Mobile Client]
        C[API Client]
    end
    
    subgraph "API Gateway"
        D[FastAPI Application]
        E[Health Endpoints]
        F[Metrics Endpoint]
    end
    
    subgraph "Service Layer"
        G[Document Processing Service]
        H[PHI Scrubbing Service]
        I[Patient Summary Service]
        J[Model Management Service]
    end
    
    subgraph "AI/ML Layer"
        K[Unified Model Manager]
        L[Transformers Models]
        M[GGUF Models]
        N[OpenVINO Models]
        O[Whisper Audio Models]
    end
    
    subgraph "Data Layer"
        P[PostgreSQL - Audit Logs]
        Q[File Storage]
        R[Model Cache]
    end
    
    A --> D
    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    G --> K
    H --> K
    I --> K
    J --> K
    K --> L
    K --> M
    K --> N
    K --> O
    D --> P
    G --> Q
    K --> R

3. Architecture Design

3.1 Architectural Style

HNTAI follows a Layered Monolithic Architecture with clear separation of concerns:

Presentation Layer: FastAPI routes and endpoints
Service Layer: Business logic and orchestration
Agent Layer: Specialized AI agents for specific tasks
Utility Layer: Shared utilities and helpers
Data Layer: Database and file storage

3.2 Component Architecture

graph LR
    subgraph "FastAPI Application"
        A[routes_fastapi.py]
        B[app.py]
        C[main.py]
    end
    
    subgraph "Agents"
        D[patient_summary_agent.py]
        E[phi_scrubber.py]
        F[text_extractor.py]
        G[medical_data_extractor.py]
    end
    
    subgraph "Services"
        H[job_manager.py]
        I[request_queue.py]
        J[error_handler.py]
        K[sse_generator.py]
    end
    
    subgraph "Utils"
        L[unified_model_manager.py]
        M[model_config.py]
        N[robust_json_parser.py]
        O[memory_manager.py]
    end
    
    A --> D
    A --> E
    A --> F
    A --> G
    A --> H
    A --> I
    D --> L
    E --> L
    F --> L
    G --> L
    L --> M
    L --> O

3.3 Directory Structure

HNTAI/
├── services/
│   └── ai-service/
│       └── src/
│           └── ai_med_extract/
│               ├── agents/              # AI agents for specific tasks
│               │   ├── patient_summary_agent.py
│               │   ├── phi_scrubber.py
│               │   ├── text_extractor.py
│               │   └── medical_data_extractor.py
│               ├── api/                 # FastAPI routes
│               │   └── routes_fastapi.py
│               ├── services/            # Business logic services
│               │   ├── job_manager.py
│               │   ├── request_queue.py
│               │   ├── error_handler.py
│               │   └── sse_generator.py
│               ├── utils/               # Utilities and helpers
│               │   ├── unified_model_manager.py
│               │   ├── model_config.py
│               │   ├── robust_json_parser.py
│               │   ├── memory_manager.py
│               │   ├── openvino_summarizer_utils.py
│               │   └── patient_summary_utils.py
│               ├── app.py               # FastAPI app factory
│               ├── main.py              # Entry point
│               ├── health_endpoints.py  # Health checks
│               └── database_audit.py    # HIPAA audit logging
├── docs/                                # Documentation
├── infra/                               # Infrastructure configs
│   └── k8s/                            # Kubernetes manifests
├── app.py                               # HF Spaces entry point
├── Dockerfile                           # Multi-stage Docker build
├── Dockerfile.hf-spaces                 # HF Spaces optimized
├── .huggingface.yaml                    # HF Spaces config
├── models_config.json                   # Model configuration
├── requirements.txt                     # Python dependencies
└── README.md                            # Project documentation

4. Technology Stack

4.1 Core Technologies

Category	Technology	Version	Purpose
Runtime	Python	3.10+	Primary language
Web Framework	FastAPI	Latest	REST API framework
ASGI Server	Uvicorn	Latest	Production server
AI/ML Framework	PyTorch	2.x	Deep learning
Transformers	Hugging Face Transformers	Latest	Model loading
GGUF Support	llama-cpp-python	Latest	Quantized models
OpenVINO	optimum-intel	Latest	Intel optimization
Audio Processing	Whisper	Latest	Speech-to-text

4.2 Supporting Technologies

Category	Technology	Purpose
Database	PostgreSQL 13+	Audit logs (optional)
Caching	In-memory LRU	Model caching
Document Processing	PyPDF2, python-docx	PDF/DOCX parsing
OCR	Tesseract	Image text extraction
Audio	FFmpeg	Audio processing
Containerization	Docker	Deployment
Orchestration	Kubernetes	Scaling
Monitoring	Prometheus	Metrics

4.3 Development Tools

Code Quality: Black, isort, flake8, mypy
Testing: pytest
Version Control: Git
CI/CD: GitHub Actions (potential)
Documentation: Markdown, Mermaid diagrams

5. Core Components

5.1 FastAPI Application (`app.py`)

Purpose: Application factory and initialization

Key Responsibilities:

Create and configure FastAPI application
Initialize agents and services
Register routes and middleware
Configure CORS and security

Key Functions:

def create_app(initialize: bool = True) -> FastAPI
def initialize_agents(app: FastAPI, preload_small_models: bool = False)
def run_dev()  # Development server

5.2 API Routes (`routes_fastapi.py`)

Purpose: RESTful API endpoints

Endpoint Categories:

Health & Monitoring

GET /health/live - Liveness probe
GET /health/ready - Readiness probe
GET /metrics - Prometheus metrics

Document Processing

POST /upload - Upload and process documents
POST /transcribe - Audio transcription
GET /get_updated_medical_data - Retrieve processed data
PUT /update_medical_data - Update medical records

AI Processing

POST /generate_patient_summary - Generate patient summaries
POST /api/generate_summary - Text summarization
POST /api/patient_summary_openvino - OpenVINO summaries
POST /extract_medical_data - Extract structured data

Model Management

POST /api/load_model - Load specific models
GET /api/model_info - Model information
POST /api/switch_model - Switch models

5.3 Agents

5.3.1 Patient Summary Agent (`patient_summary_agent.py`)

Purpose: Generate comprehensive patient summaries

Key Features:

Dynamic model configuration
Multi-section summary generation
Chronological narrative building
Clinical guideline evaluation
Fallback text-based summarization

Core Methods:

def configure_model(model_name: str, model_type: str)
def generate_clinical_summary(patient_data: Union[List[str], Dict])
def generate_patient_summary(patient_data: Union[List[str], Dict])
def build_chronological_narrative(patient_data: dict)
def format_clinical_output(raw_summary: str, patient_data: dict)

5.3.2 PHI Scrubber (`phi_scrubber.py`)

Purpose: Remove protected health information

Scrubbing Capabilities:

Patient names
Medical record numbers (MRN)
Dates of birth
Phone numbers
Email addresses
Social Security Numbers
Addresses

Compliance: HIPAA-compliant with audit logging

5.3.3 Text Extractor (`text_extractor.py`)

Purpose: Extract text from various document formats

Supported Formats:

PDF documents
DOCX files
Images (via OCR)
Plain text

5.3.4 Medical Data Extractor (`medical_data_extractor.py`)

Purpose: Extract structured medical data from text

Extraction Targets:

Diagnoses
Medications
Procedures
Lab results
Vital signs
Allergies

5.4 Services

5.4.1 Job Manager (`job_manager.py`)

Purpose: Manage long-running jobs

Features:

Job lifecycle management
Progress tracking
Status updates
Result caching
Cleanup of completed jobs

5.4.2 Request Queue (`request_queue.py`)

Purpose: Queue and prioritize requests

Features:

Request queuing
Priority handling
Concurrency control
Timeout management

5.4.3 Error Handler (`error_handler.py`)

Purpose: Centralized error handling

Features:

Error categorization
Contextual logging
Job error updates
Graceful degradation

5.4.4 SSE Generator (`sse_generator.py`)

Purpose: Server-Sent Events for real-time updates

Features:

Progress streaming
Status updates
Error notifications
Completion events

6. AI/ML Architecture

6.1 Unified Model Manager

File: unified_model_manager.py

Purpose: Single interface for all AI model types

Architecture:

classDiagram
    class BaseModel {
        <<abstract>>
        +name: str
        +model_type: str
        +status: ModelStatus
        +load()
        +generate(prompt, config)*
        +unload()
    }
    
    class TransformersModel {
        +_model: Pipeline
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class GGUFModel {
        +_model: Llama
        +filename: str
        +_extract_filename()
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class OpenVINOModel {
        +_model: OVModelForCausalLM
        +_tokenizer: AutoTokenizer
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class FallbackModel {
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class UnifiedModelManager {
        +max_models: int
        +max_memory_mb: int
        +get_model(name, type)
        +generate_text(name, prompt)
        +cleanup()
    }
    
    BaseModel <|-- TransformersModel
    BaseModel <|-- GGUFModel
    BaseModel <|-- OpenVINOModel
    BaseModel <|-- FallbackModel
    UnifiedModelManager --> BaseModel

6.2 Model Types

6.2.1 Transformers Models

Backend: Hugging Face Transformers
Device: GPU (CUDA) or CPU
Use Cases: General text generation, summarization

Supported Models:

microsoft/Phi-3-mini-4k-instruct
facebook/bart-large-cnn (deprecated)
google/flan-t5-large

Configuration:

{
    "model_name": "microsoft/Phi-3-mini-4k-instruct",
    "model_type": "text-generation",
    "device_map": "auto",
    "torch_dtype": "float16"
}

6.2.2 GGUF Models

Backend: llama-cpp-python
Device: CPU or GPU (via Metal/CUDA)
Use Cases: Efficient inference with quantized models

Supported Models:

microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf (PRIMARY)

Configuration:

{
    "model_path": "path/to/model.gguf",
    "n_ctx": 8192,
    "n_threads": 4,
    "n_gpu_layers": 35  # GPU acceleration
}

6.2.3 OpenVINO Models

Backend: Intel OpenVINO
Device: CPU (Intel optimized) or GPU
Use Cases: Production deployment on Intel hardware

Supported Models:

OpenVINO/Phi-3-mini-4k-instruct-fp16-ov

Configuration:

{
    "model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
    "device": "GPU" if available else "CPU"
}

6.3 Model Selection Strategy

flowchart TD
    A[Request with model_name] --> B{Model specified?}
    B -->|Yes| C{Model type?}
    B -->|No| D[Use default: Phi-3 GGUF]
    
    C -->|GGUF| E[Load GGUF Model]
    C -->|OpenVINO| F[Load OpenVINO Model]
    C -->|Transformers| G[Load Transformers Model]
    C -->|Unknown| H[Auto-detect type]
    
    E --> I{Load successful?}
    F --> I
    G --> I
    H --> I
    D --> I
    
    I -->|Yes| J[Generate with model]
    I -->|No| K[Try fallback model]
    
    K --> L{Fallback successful?}
    L -->|Yes| J
    L -->|No| M[Use text-based fallback]

6.4 Model Configuration

File: models_config.json

{
  "patient_summary_models": [
    {
      "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
      "type": "gguf",
      "is_active": true,
      "cached": true,
      "description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
      "use_case": "Fast patient summary generation with CPU/GPU",
      "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
      "filename": "Phi-3-mini-4k-instruct-q4.gguf"
    }
  ],
  "runtime_behavior": {
    "allow_runtime_downloads": true,
    "cache_runtime_downloads": true,
    "fallback_to_cached": true
  }
}

6.5 Token Management

Token Limit Handling:

Automatic token counting (heuristic: ~4 chars/token)
Pre-generation validation
Token limit error detection
Graceful degradation

Token Limits by Model:

Phi-3 models: 4096 tokens (context window)
BART models: 1024 tokens
T5 models: 512 tokens

6.6 Generation Configuration

@dataclass
class GenerationConfig:
    max_tokens: int = 8192      # Maximum output tokens
    min_tokens: int = 50        # Minimum output tokens
    temperature: float = 0.3    # Deterministic for medical
    top_p: float = 0.9          # Nucleus sampling
    timeout: float = 180.0      # T4 timeout
    stream: bool = False        # Streaming support

6.7 T4 GPU Optimizations

Hardware Target: NVIDIA T4 Medium (16GB GPU, 16GB RAM)

Optimizations:

Memory Management:
- Max 2 models in memory
- Automatic model unloading
- GPU memory clearing
- Garbage collection
Model Loading:
- Lazy loading (on-demand)
- Intelligent caching
- LRU eviction policy
Inference:
- FP16 precision
- Batch size: 1
- Context window: 8192 tokens
- GPU layer offloading (GGUF)

7. API Architecture

7.1 RESTful Design

Principles:

Resource-oriented URLs
HTTP methods for CRUD operations
JSON request/response format
Stateless communication
Proper HTTP status codes

7.2 Request/Response Flow

sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant S as Service Layer
    participant M as Model Manager
    participant D as Database
    
    C->>A: POST /generate_patient_summary
    A->>A: Validate request
    A->>S: Create job
    S->>D: Log job creation
    A-->>C: 202 Accepted (job_id)
    
    S->>M: Load model
    M->>M: Check cache
    M->>M: Load if needed
    M-->>S: Model ready
    
    S->>M: Generate summary
    M->>M: Process prompt
    M-->>S: Generated text
    
    S->>D: Log completion
    S->>A: Update job status
    A-->>C: SSE: Progress updates
    
    C->>A: GET /job/{job_id}
    A->>S: Get job status
    S->>D: Retrieve job
    S-->>A: Job result
    A-->>C: 200 OK (result)

7.3 Authentication & Authorization

Current State: Basic API key authentication (optional)

Planned Enhancements:

JWT-based authentication
Role-based access control (RBAC)
OAuth2 integration
API rate limiting

7.4 Error Handling

Error Response Format:

{
  "error": {
    "code": "MODEL_LOAD_FAILED",
    "message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct",
    "details": {
      "model_name": "microsoft/Phi-3-mini-4k-instruct",
      "error_type": "initialization_error",
      "timestamp": "2025-12-05T17:23:52Z"
    }
  }
}

HTTP Status Codes:

200 OK - Successful request
202 Accepted - Job created
400 Bad Request - Invalid input
404 Not Found - Resource not found
500 Internal Server Error - Server error
503 Service Unavailable - Service degraded

7.5 Rate Limiting

Strategy: Token bucket algorithm

Limits:

100 requests/minute per IP
1000 requests/hour per API key
Burst allowance: 20 requests

8. Data Flow & Processing

8.1 Document Processing Pipeline

flowchart LR
    A[Upload Document] --> B{File Type?}
    B -->|PDF| C[PDF Parser]
    B -->|DOCX| D[DOCX Parser]
    B -->|Image| E[OCR Engine]
    B -->|Audio| F[Whisper Transcription]
    
    C --> G[Text Extraction]
    D --> G
    E --> G
    F --> G
    
    G --> H[PHI Scrubbing]
    H --> I[Medical Data Extraction]
    I --> J[Store Processed Data]
    J --> K[Return Results]

8.2 Patient Summary Generation Flow

flowchart TD
    A[Patient Data Input] --> B[Parse EHR Data]
    B --> C[Convert to Plain Text]
    C --> D{Data Size Check}
    
    D -->|Small| E[Single-pass Generation]
    D -->|Large| F[Chunking Strategy]
    
    F --> G[Chunk by Date/Size]
    G --> H[Process Chunks in Parallel]
    H --> I[Combine Chunk Summaries]
    
    E --> J[Generate with Model]
    I --> J
    
    J --> K[Format Clinical Output]
    K --> L[Evaluate Against Guidelines]
    L --> M[Return Summary]

8.3 Data Transformation

Input Formats:

Raw EHR JSON
HL7 FHIR resources
Plain text documents
Scanned images
Audio recordings

Output Formats:

Structured JSON
Clinical summary (Markdown)
FHIR-compliant resources
Audit logs

8.4 Caching Strategy

Multi-Level Caching:

Model Cache: Loaded models in memory
Result Cache: Generated summaries (LRU)
File Cache: Processed documents
Hugging Face Cache: Downloaded models

Cache Invalidation:

Time-based expiration
Manual invalidation
Memory pressure-based eviction

9. Database Design

9.1 Database Schema

Primary Database: PostgreSQL (optional, for audit logs)

Audit Logs Table

CREATE TABLE audit_logs (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
    user_id VARCHAR(255),
    action VARCHAR(100) NOT NULL,
    resource_type VARCHAR(100),
    resource_id VARCHAR(255),
    phi_accessed BOOLEAN DEFAULT FALSE,
    ip_address INET,
    user_agent TEXT,
    request_data JSONB,
    response_status INTEGER,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp);
CREATE INDEX idx_audit_user ON audit_logs(user_id);
CREATE INDEX idx_audit_action ON audit_logs(action);
CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed);

9.2 Data Models

Patient Data Model (In-memory):

{
    "patient_id": "string",
    "demographics": {
        "name": "string",
        "dob": "date",
        "gender": "string",
        "mrn": "string"
    },
    "visits": [
        {
            "visit_id": "string",
            "date": "datetime",
            "chief_complaint": "string",
            "diagnoses": ["string"],
            "medications": ["string"],
            "procedures": ["string"],
            "vitals": {},
            "labs": []
        }
    ]
}

9.3 File Storage

Storage Strategy: Local filesystem or cloud storage

Directory Structure:

/data/
├── uploads/           # Uploaded documents
├── processed/         # Processed documents
├── cache/            # Temporary cache
└── models/           # Model files

10. Security Architecture

10.1 HIPAA Compliance

Requirements Met:

Access Controls: Authentication and authorization
Audit Logging: Comprehensive activity logs
Data Encryption: In-transit and at-rest
PHI Scrubbing: Automated anonymization
Secure Communication: HTTPS/TLS

10.2 PHI Scrubbing

Scrubbing Patterns:

PATTERNS = {
    "name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
    "mrn": r'\bMRN[:\s]*\d{6,10}\b',
    "dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b',
    "phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    "ssn": r'\b\d{3}-\d{2}-\d{4}\b'
}

10.3 Container Security

Security Measures:

Non-root user execution
Read-only root filesystem
Resource limits (CPU, memory)
Network policies
Secrets management
Minimal base images

10.4 API Security

Security Headers:

{
    "X-Content-Type-Options": "nosniff",
    "X-Frame-Options": "DENY",
    "X-XSS-Protection": "1; mode=block",
    "Strict-Transport-Security": "max-age=31536000"
}

11. Deployment Architecture

11.1 Deployment Options

11.1.1 Hugging Face Spaces

Configuration: .huggingface.yaml

runtime: docker
sdk: docker
python_version: "3.10"

build:
  dockerfile: Dockerfile.hf-spaces
  cache: true

hardware:
  gpu: t4-medium  # 16GB GPU RAM, 16GB System RAM

env:
  - SPACE_ID=$SPACE_ID
  - HF_HOME=/app/.cache/huggingface
  - TORCH_HOME=/app/.cache/torch
  - MODEL_CACHE_DIR=/app/models
  - PRELOAD_GGUF=true
  - HF_SPACES=true

Optimizations:

Pre-cached models in Docker image
Lazy model loading
Memory-efficient inference
Automatic GPU detection

11.1.2 Kubernetes

Deployment Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hntai-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hntai
  template:
    metadata:
      labels:
        app: hntai
    spec:
      containers:
      - name: hntai
        image: hntai:latest
        ports:
        - containerPort: 7860
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 7860
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 7860
          initialDelaySeconds: 10
          periodSeconds: 5

11.1.3 Docker

Multi-Stage Dockerfile:

# Stage 1: Builder
FROM python:3.10-slim AS builder
RUN apt-get update && apt-get install -y build-essential
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.10-slim AS runtime
COPY --from=builder /install /usr/local
WORKDIR /app
COPY . .
ENV PYTHONUNBUFFERED=1
EXPOSE 7860
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

11.2 Scaling Strategy

Horizontal Scaling:

Multiple replicas behind load balancer
Stateless design for easy scaling
Shared model cache (optional)

Vertical Scaling:

Increase CPU/memory per instance
GPU acceleration for inference
Larger model support

11.3 High Availability

Components:

Load Balancer: Distribute traffic
Health Checks: Automatic failover
Auto-scaling: Based on CPU/memory
Graceful Shutdown: Drain connections

12. Performance Optimization

12.1 Model Optimization

Techniques:

Quantization: GGUF Q4 models (4-bit)
Precision: FP16 for GPU inference
Batching: Batch size optimization
Caching: Model and result caching
Lazy Loading: On-demand model loading

12.2 Memory Management

Strategies:

Automatic garbage collection
GPU memory clearing
Model unloading (LRU)
Memory pressure monitoring

Memory Limits:

T4 Medium: 16GB GPU, 16GB RAM
Max 2 models in memory
Automatic eviction at 80% usage

12.3 Inference Optimization

T4-Specific Optimizations:

{
    "max_models": 2,
    "max_memory_mb": 14000,
    "n_ctx": 8192,
    "n_threads": 4,
    "n_gpu_layers": 35,
    "torch_dtype": "float16",
    "device_map": "auto"
}

12.4 Caching Strategy

Cache Hierarchy:

L1 - Model Cache: In-memory loaded models
L2 - Result Cache: Generated summaries (LRU, 100 items)
L3 - File Cache: Processed documents (disk)
L4 - HF Cache: Downloaded models (disk)

12.5 Performance Metrics

Target Metrics:

Model load time: < 10 seconds
Summary generation: < 60 seconds (small), < 180 seconds (large)
API response time: < 100ms (excluding generation)
Memory usage: < 80% of available
GPU utilization: > 70% during inference

13. Monitoring & Observability

13.1 Health Checks

Liveness Probe (/health/live):

{
    "status": "alive",
    "timestamp": "2025-12-05T17:23:52Z"
}

Readiness Probe (/health/ready):

{
    "status": "ready",
    "checks": {
        "database": "ok",
        "model_manager": "ok",
        "file_storage": "ok"
    },
    "timestamp": "2025-12-05T17:23:52Z"
}

13.2 Metrics

Prometheus Metrics (/metrics):

# Model metrics
model_load_time_seconds{model_name="phi-3-gguf"} 8.5
model_inference_time_seconds{model_name="phi-3-gguf"} 45.2
model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296

# API metrics
http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234
http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3

# System metrics
memory_usage_percent 65.2
gpu_memory_usage_percent 72.1
cpu_usage_percent 45.8

13.3 Logging

Structured Logging:

{
    "timestamp": "2025-12-05T17:23:52Z",
    "level": "INFO",
    "logger": "ai_med_extract.agents.patient_summary_agent",
    "message": "Generated patient summary",
    "context": {
        "job_id": "abc123",
        "model_name": "phi-3-gguf",
        "duration_seconds": 45.2,
        "token_count": 2048
    }
}

Log Levels:

DEBUG: Detailed diagnostic information
INFO: General informational messages
WARNING: Warning messages
ERROR: Error messages
CRITICAL: Critical failures

13.4 Audit Logging

HIPAA Audit Trail:

{
    "timestamp": "2025-12-05T17:23:52Z",
    "user_id": "user123",
    "action": "PHI_ACCESS",
    "resource_type": "patient_summary",
    "resource_id": "patient456",
    "phi_accessed": true,
    "ip_address": "192.168.1.100",
    "user_agent": "Mozilla/5.0...",
    "request_data": {...},
    "response_status": 200
}

14. Development Workflow

14.1 Local Development

Setup:

# Clone repository
git clone <repository-url>
cd HNTAI

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai"
export SECRET_KEY="your-secret-key"
export HF_HOME="/tmp/huggingface"

# Run development server
cd services/ai-service/src
python -m ai_med_extract.app run_dev

14.2 Testing

Test Structure:

tests/
├── unit/
│   ├── test_agents.py
│   ├── test_model_manager.py
│   └── test_utils.py
├── integration/
│   ├── test_api.py
│   └── test_workflows.py
└── conftest.py

Running Tests:

# Unit tests
python -m pytest tests/unit/

# Integration tests
python -m pytest tests/integration/

# Coverage report
python -m pytest --cov=ai_med_extract tests/

14.3 Code Quality

Tools:

# Format code
black .
isort .

# Lint code
flake8 .

# Type checking
mypy services/ai-service/src/ai_med_extract/

14.4 Git Workflow

Branching Strategy:

main: Production-ready code
develop: Integration branch
feature/*: Feature branches
bugfix/*: Bug fix branches
hotfix/*: Production hotfixes

Commit Convention:

<type>(<scope>): <subject>

<body>

<footer>

Types: feat, fix, docs, style, refactor, test, chore

15. Integration Patterns

15.1 External System Integration

Integration Points:

EHR Systems: HL7, FHIR APIs
Document Management: File uploads, cloud storage
Authentication: OAuth2, SAML
Monitoring: Prometheus, Grafana
Logging: ELK Stack, CloudWatch

15.2 API Integration

Client Libraries (Planned):

Python SDK
JavaScript SDK
REST API documentation (OpenAPI/Swagger)

Example Integration:

import requests

# Upload document
response = requests.post(
    "https://api.hntai.com/upload",
    files={"file": open("document.pdf", "rb")},
    headers={"Authorization": "Bearer <token>"}
)

# Generate patient summary
response = requests.post(
    "https://api.hntai.com/generate_patient_summary",
    json={
        "patient_data": {...},
        "model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
        "model_type": "gguf"
    },
    headers={"Authorization": "Bearer <token>"}
)

job_id = response.json()["job_id"]

# Poll for results
while True:
    response = requests.get(
        f"https://api.hntai.com/job/{job_id}",
        headers={"Authorization": "Bearer <token>"}
    )
    if response.json()["status"] == "completed":
        break
    time.sleep(5)

15.3 Webhook Support

Planned Feature: Webhook notifications for job completion

{
    "event": "job.completed",
    "job_id": "abc123",
    "timestamp": "2025-12-05T17:23:52Z",
    "data": {
        "status": "completed",
        "result": {...}
    }
}

16. Scalability Considerations

16.1 Horizontal Scaling

Strategies:

Stateless Design: No session state in application
Load Balancing: Distribute requests across instances
Shared Cache: Redis for distributed caching
Message Queue: RabbitMQ/Kafka for async processing

16.2 Vertical Scaling

Resource Scaling:

CPU: 2-8 cores per instance
Memory: 8-32 GB per instance
GPU: T4, V100, A100 for inference

16.3 Database Scaling

Strategies:

Read Replicas: For audit log queries
Partitioning: Time-based partitioning for logs
Indexing: Optimize query performance
Archiving: Move old logs to cold storage

16.4 Model Serving

Scaling Options:

Model Replication: Same model on multiple instances
Model Sharding: Different models on different instances
Model Versioning: A/B testing with multiple versions
Dedicated Inference: Separate inference service

17. Future Roadmap

17.1 Short-Term (3-6 months)

Enhanced Model Support:
- Support for Llama 3, Mistral models
- Fine-tuned medical models
- Multi-modal models (text + images)
Improved Performance:
- Model quantization (INT8, INT4)
- Batch inference support
- Streaming responses
Additional Features:
- Real-time collaboration
- Version control for summaries
- Template-based summaries

17.2 Medium-Term (6-12 months)

Advanced AI Capabilities:
- Multi-agent orchestration
- Retrieval-Augmented Generation (RAG)
- Knowledge graph integration
Enterprise Features:
- Multi-tenancy support
- Advanced RBAC
- SSO integration
- Compliance reporting
Platform Enhancements:
- Web UI for management
- Mobile app support
- Plugin architecture

17.3 Long-Term (12+ months)

AI/ML Advancements:
- Custom model training pipeline
- Federated learning support
- Explainable AI (XAI)
Ecosystem Integration:
- FHIR server integration
- HL7 v3 support
- DICOM image analysis
Global Expansion:
- Multi-language support
- Regional compliance (GDPR, etc.)
- Edge deployment

Appendix A: Configuration Reference

Environment Variables

Variable	Description	Default	Required
`DATABASE_URL`	PostgreSQL connection string	-	No
`SECRET_KEY`	Application secret key	-	Yes
`JWT_SECRET_KEY`	JWT signing key	-	Yes
`HF_HOME`	Hugging Face cache directory	`/tmp/huggingface`	No
`TORCH_HOME`	PyTorch cache directory	`/tmp/torch`	No
`WHISPER_CACHE`	Whisper model cache	`/tmp/whisper`	No
`HF_SPACES`	Hugging Face Spaces mode	`false`	No
`PRELOAD_GGUF`	Preload GGUF models	`false`	No
`MAX_NEW_TOKENS`	Max output tokens	`8192`	No
`MAX_INPUT_TOKENS`	Max input tokens	`2048`	No

Appendix B: API Reference

Complete Endpoint List

Method	Endpoint	Description
`GET`	`/`	Root endpoint
`GET`	`/health/live`	Liveness probe
`GET`	`/health/ready`	Readiness probe
`GET`	`/metrics`	Prometheus metrics
`POST`	`/upload`	Upload document
`POST`	`/transcribe`	Transcribe audio
`POST`	`/generate_patient_summary`	Generate patient summary
`POST`	`/api/generate_summary`	Generate text summary
`POST`	`/api/patient_summary_openvino`	OpenVINO summary
`POST`	`/extract_medical_data`	Extract medical data
`GET`	`/get_updated_medical_data`	Get processed data
`PUT`	`/update_medical_data`	Update medical data
`POST`	`/api/load_model`	Load model
`GET`	`/api/model_info`	Get model info
`POST`	`/api/switch_model`	Switch model

Appendix C: Troubleshooting Guide

Common Issues

Model Loading Failures

Symptom: Model fails to load
Causes:

Insufficient memory
Missing dependencies
Network issues (download)

Solutions:

Check memory availability
Verify dependencies installed
Check network connectivity
Use fallback model

Token Limit Errors

Symptom: "Input exceeds token limit"
Causes:

Input too long
Model context window exceeded

Solutions:

Reduce input size
Use chunking strategy
Switch to larger context model

Performance Issues

Symptom: Slow inference
Causes:

CPU-only inference
Large model size
Memory pressure

Solutions:

Enable GPU acceleration
Use quantized models (GGUF)
Reduce batch size
Clear model cache

Appendix D: Glossary

Term	Definition
PHI	Protected Health Information
HIPAA	Health Insurance Portability and Accountability Act
EHR	Electronic Health Record
FHIR	Fast Healthcare Interoperability Resources
HL7	Health Level 7 (healthcare data standard)
GGUF	GPT-Generated Unified Format (quantized models)
OpenVINO	Open Visual Inference and Neural Network Optimization
T4	NVIDIA Tesla T4 GPU
LRU	Least Recently Used (cache eviction)
SSE	Server-Sent Events
ASGI	Asynchronous Server Gateway Interface

Document Revision History

Version	Date	Author	Changes
1.0	2025-12-05	System	Initial comprehensive documentation

End of Technical Architecture Documentation

HNTAI - Comprehensive Technical Architecture Documentation

Table of Contents

1. Executive Summary

Key Capabilities

Target Deployment Environments

2. System Overview

2.1 Purpose & Scope

2.2 Design Principles

2.3 High-Level Architecture

3. Architecture Design

3.1 Architectural Style

3.2 Component Architecture

3.3 Directory Structure

4. Technology Stack

4.1 Core Technologies

4.2 Supporting Technologies

4.3 Development Tools

5. Core Components

5.1 FastAPI Application (app.py)

5.2 API Routes (routes_fastapi.py)

Health & Monitoring

Document Processing

AI Processing

Model Management

5.3 Agents

5.3.1 Patient Summary Agent (patient_summary_agent.py)

5.3.2 PHI Scrubber (phi_scrubber.py)

5.3.3 Text Extractor (text_extractor.py)

5.3.4 Medical Data Extractor (medical_data_extractor.py)

5.4 Services

5.4.1 Job Manager (job_manager.py)

5.4.2 Request Queue (request_queue.py)

5.4.3 Error Handler (error_handler.py)

5.4.4 SSE Generator (sse_generator.py)

6. AI/ML Architecture

6.1 Unified Model Manager

6.2 Model Types

6.2.1 Transformers Models

6.2.2 GGUF Models

6.2.3 OpenVINO Models

6.3 Model Selection Strategy

6.4 Model Configuration

6.5 Token Management

6.6 Generation Configuration

6.7 T4 GPU Optimizations

7. API Architecture

7.1 RESTful Design

7.2 Request/Response Flow

7.3 Authentication & Authorization

7.4 Error Handling

7.5 Rate Limiting

8. Data Flow & Processing

8.1 Document Processing Pipeline

8.2 Patient Summary Generation Flow

8.3 Data Transformation

8.4 Caching Strategy

9. Database Design

9.1 Database Schema

Audit Logs Table

9.2 Data Models

9.3 File Storage

10. Security Architecture

10.1 HIPAA Compliance

10.2 PHI Scrubbing

10.3 Container Security

10.4 API Security

11. Deployment Architecture

11.1 Deployment Options

11.1.1 Hugging Face Spaces

11.1.2 Kubernetes

11.1.3 Docker

11.2 Scaling Strategy

11.3 High Availability

12. Performance Optimization

12.1 Model Optimization

12.2 Memory Management

12.3 Inference Optimization

12.4 Caching Strategy

12.5 Performance Metrics

13. Monitoring & Observability

5.1 FastAPI Application (`app.py`)

5.2 API Routes (`routes_fastapi.py`)

5.3.1 Patient Summary Agent (`patient_summary_agent.py`)

5.3.2 PHI Scrubber (`phi_scrubber.py`)

5.3.3 Text Extractor (`text_extractor.py`)

5.3.4 Medical Data Extractor (`medical_data_extractor.py`)

5.4.1 Job Manager (`job_manager.py`)

5.4.2 Request Queue (`request_queue.py`)

5.4.3 Error Handler (`error_handler.py`)

5.4.4 SSE Generator (`sse_generator.py`)