# HNTAI - Comprehensive Technical Architecture Documentation

**Version:** 1.0  
**Last Updated:** December 5, 2025  
**Project:** Medical Data Extraction & AI Processing Platform

---

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [System Overview](#system-overview)
3. [Architecture Design](#architecture-design)
4. [Technology Stack](#technology-stack)
5. [Core Components](#core-components)
6. [AI/ML Architecture](#aiml-architecture)
7. [API Architecture](#api-architecture)
8. [Data Flow & Processing](#data-flow--processing)
9. [Database Design](#database-design)
10. [Security Architecture](#security-architecture)
11. [Deployment Architecture](#deployment-architecture)
12. [Performance Optimization](#performance-optimization)
13. [Monitoring & Observability](#monitoring--observability)
14. [Development Workflow](#development-workflow)
15. [Integration Patterns](#integration-patterns)
16. [Scalability Considerations](#scalability-considerations)
17. [Future Roadmap](#future-roadmap)

---

## 1. Executive Summary

HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends.

### Key Capabilities

- **Multi-format Document Processing**: PDF, DOCX, images, and audio transcription
- **HIPAA Compliance**: Automated PHI scrubbing with comprehensive audit logging
- **Multi-Model AI Support**: Transformers, OpenVINO, and GGUF models with automatic optimization
- **Scalable Architecture**: Kubernetes-ready with horizontal scaling capabilities
- **Production-Ready**: Health checks, metrics, structured logging, and error handling

### Target Deployment Environments

- **Hugging Face Spaces** (T4 Medium GPU)
- **Kubernetes Clusters** (On-premise or cloud)
- **Docker Containers** (Standalone or orchestrated)
- **Local Development** (CPU or GPU)

---

## 2. System Overview

### 2.1 Purpose & Scope

HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to:

1. **Extract** structured medical data from unstructured documents
2. **Anonymize** protected health information (PHI) for compliance
3. **Summarize** patient records into comprehensive clinical assessments
4. **Process** multi-modal medical data (text, images, audio)

### 2.2 Design Principles

- **Simplicity**: Clean, maintainable codebase with essential features
- **Flexibility**: Support for multiple AI model types and backends
- **Security**: HIPAA-compliant with comprehensive audit trails
- **Performance**: Optimized for T4 GPU with intelligent caching
- **Reliability**: Robust error handling and automatic fallback mechanisms

### 2.3 High-Level Architecture

```mermaid
graph TB
    subgraph "Client Layer"
        A[Web Client]
        B[Mobile Client]
        C[API Client]
    end
    
    subgraph "API Gateway"
        D[FastAPI Application]
        E[Health Endpoints]
        F[Metrics Endpoint]
    end
    
    subgraph "Service Layer"
        G[Document Processing Service]
        H[PHI Scrubbing Service]
        I[Patient Summary Service]
        J[Model Management Service]
    end
    
    subgraph "AI/ML Layer"
        K[Unified Model Manager]
        L[Transformers Models]
        M[GGUF Models]
        N[OpenVINO Models]
        O[Whisper Audio Models]
    end
    
    subgraph "Data Layer"
        P[PostgreSQL - Audit Logs]
        Q[File Storage]
        R[Model Cache]
    end
    
    A --> D
    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    G --> K
    H --> K
    I --> K
    J --> K
    K --> L
    K --> M
    K --> N
    K --> O
    D --> P
    G --> Q
    K --> R
```

---

## 3. Architecture Design

### 3.1 Architectural Style

HNTAI follows a **Layered Monolithic Architecture** with clear separation of concerns:

1. **Presentation Layer**: FastAPI routes and endpoints
2. **Service Layer**: Business logic and orchestration
3. **Agent Layer**: Specialized AI agents for specific tasks
4. **Utility Layer**: Shared utilities and helpers
5. **Data Layer**: Database and file storage

### 3.2 Component Architecture

```mermaid
graph LR
    subgraph "FastAPI Application"
        A[routes_fastapi.py]
        B[app.py]
        C[main.py]
    end
    
    subgraph "Agents"
        D[patient_summary_agent.py]
        E[phi_scrubber.py]
        F[text_extractor.py]
        G[medical_data_extractor.py]
    end
    
    subgraph "Services"
        H[job_manager.py]
        I[request_queue.py]
        J[error_handler.py]
        K[sse_generator.py]
    end
    
    subgraph "Utils"
        L[unified_model_manager.py]
        M[model_config.py]
        N[robust_json_parser.py]
        O[memory_manager.py]
    end
    
    A --> D
    A --> E
    A --> F
    A --> G
    A --> H
    A --> I
    D --> L
    E --> L
    F --> L
    G --> L
    L --> M
    L --> O
```

### 3.3 Directory Structure

```
HNTAI/
├── services/
│   └── ai-service/
│       └── src/
│           └── ai_med_extract/
│               ├── agents/              # AI agents for specific tasks
│               │   ├── patient_summary_agent.py
│               │   ├── phi_scrubber.py
│               │   ├── text_extractor.py
│               │   └── medical_data_extractor.py
│               ├── api/                 # FastAPI routes
│               │   └── routes_fastapi.py
│               ├── services/            # Business logic services
│               │   ├── job_manager.py
│               │   ├── request_queue.py
│               │   ├── error_handler.py
│               │   └── sse_generator.py
│               ├── utils/               # Utilities and helpers
│               │   ├── unified_model_manager.py
│               │   ├── model_config.py
│               │   ├── robust_json_parser.py
│               │   ├── memory_manager.py
│               │   ├── openvino_summarizer_utils.py
│               │   └── patient_summary_utils.py
│               ├── app.py               # FastAPI app factory
│               ├── main.py              # Entry point
│               ├── health_endpoints.py  # Health checks
│               └── database_audit.py    # HIPAA audit logging
├── docs/                                # Documentation
├── infra/                               # Infrastructure configs
│   └── k8s/                            # Kubernetes manifests
├── app.py                               # HF Spaces entry point
├── Dockerfile                           # Multi-stage Docker build
├── Dockerfile.hf-spaces                 # HF Spaces optimized
├── .huggingface.yaml                    # HF Spaces config
├── models_config.json                   # Model configuration
├── requirements.txt                     # Python dependencies
└── README.md                            # Project documentation
```

---

## 4. Technology Stack

### 4.1 Core Technologies

| Category | Technology | Version | Purpose |
|----------|-----------|---------|---------|
| **Runtime** | Python | 3.10+ | Primary language |
| **Web Framework** | FastAPI | Latest | REST API framework |
| **ASGI Server** | Uvicorn | Latest | Production server |
| **AI/ML Framework** | PyTorch | 2.x | Deep learning |
| **Transformers** | Hugging Face Transformers | Latest | Model loading |
| **GGUF Support** | llama-cpp-python | Latest | Quantized models |
| **OpenVINO** | optimum-intel | Latest | Intel optimization |
| **Audio Processing** | Whisper | Latest | Speech-to-text |

### 4.2 Supporting Technologies

| Category | Technology | Purpose |
|----------|-----------|---------|
| **Database** | PostgreSQL 13+ | Audit logs (optional) |
| **Caching** | In-memory LRU | Model caching |
| **Document Processing** | PyPDF2, python-docx | PDF/DOCX parsing |
| **OCR** | Tesseract | Image text extraction |
| **Audio** | FFmpeg | Audio processing |
| **Containerization** | Docker | Deployment |
| **Orchestration** | Kubernetes | Scaling |
| **Monitoring** | Prometheus | Metrics |

### 4.3 Development Tools

- **Code Quality**: Black, isort, flake8, mypy
- **Testing**: pytest
- **Version Control**: Git
- **CI/CD**: GitHub Actions (potential)
- **Documentation**: Markdown, Mermaid diagrams

---

## 5. Core Components

### 5.1 FastAPI Application (`app.py`)

**Purpose**: Application factory and initialization

**Key Responsibilities**:
- Create and configure FastAPI application
- Initialize agents and services
- Register routes and middleware
- Configure CORS and security

**Key Functions**:
```python
def create_app(initialize: bool = True) -> FastAPI
def initialize_agents(app: FastAPI, preload_small_models: bool = False)
def run_dev()  # Development server
```

### 5.2 API Routes (`routes_fastapi.py`)

**Purpose**: RESTful API endpoints

**Endpoint Categories**:

#### Health & Monitoring
- `GET /health/live` - Liveness probe
- `GET /health/ready` - Readiness probe
- `GET /metrics` - Prometheus metrics

#### Document Processing
- `POST /upload` - Upload and process documents
- `POST /transcribe` - Audio transcription
- `GET /get_updated_medical_data` - Retrieve processed data
- `PUT /update_medical_data` - Update medical records

#### AI Processing
- `POST /generate_patient_summary` - Generate patient summaries
- `POST /api/generate_summary` - Text summarization
- `POST /api/patient_summary_openvino` - OpenVINO summaries
- `POST /extract_medical_data` - Extract structured data

#### Model Management
- `POST /api/load_model` - Load specific models
- `GET /api/model_info` - Model information
- `POST /api/switch_model` - Switch models

### 5.3 Agents

#### 5.3.1 Patient Summary Agent (`patient_summary_agent.py`)

**Purpose**: Generate comprehensive patient summaries

**Key Features**:
- Dynamic model configuration
- Multi-section summary generation
- Chronological narrative building
- Clinical guideline evaluation
- Fallback text-based summarization

**Core Methods**:
```python
def configure_model(model_name: str, model_type: str)
def generate_clinical_summary(patient_data: Union[List[str], Dict])
def generate_patient_summary(patient_data: Union[List[str], Dict])
def build_chronological_narrative(patient_data: dict)
def format_clinical_output(raw_summary: str, patient_data: dict)
```

#### 5.3.2 PHI Scrubber (`phi_scrubber.py`)

**Purpose**: Remove protected health information

**Scrubbing Capabilities**:
- Patient names
- Medical record numbers (MRN)
- Dates of birth
- Phone numbers
- Email addresses
- Social Security Numbers
- Addresses

**Compliance**: HIPAA-compliant with audit logging

#### 5.3.3 Text Extractor (`text_extractor.py`)

**Purpose**: Extract text from various document formats

**Supported Formats**:
- PDF documents
- DOCX files
- Images (via OCR)
- Plain text

#### 5.3.4 Medical Data Extractor (`medical_data_extractor.py`)

**Purpose**: Extract structured medical data from text

**Extraction Targets**:
- Diagnoses
- Medications
- Procedures
- Lab results
- Vital signs
- Allergies

### 5.4 Services

#### 5.4.1 Job Manager (`job_manager.py`)

**Purpose**: Manage long-running jobs

**Features**:
- Job lifecycle management
- Progress tracking
- Status updates
- Result caching
- Cleanup of completed jobs

#### 5.4.2 Request Queue (`request_queue.py`)

**Purpose**: Queue and prioritize requests

**Features**:
- Request queuing
- Priority handling
- Concurrency control
- Timeout management

#### 5.4.3 Error Handler (`error_handler.py`)

**Purpose**: Centralized error handling

**Features**:
- Error categorization
- Contextual logging
- Job error updates
- Graceful degradation

#### 5.4.4 SSE Generator (`sse_generator.py`)

**Purpose**: Server-Sent Events for real-time updates

**Features**:
- Progress streaming
- Status updates
- Error notifications
- Completion events

---

## 6. AI/ML Architecture

### 6.1 Unified Model Manager

**File**: `unified_model_manager.py`

**Purpose**: Single interface for all AI model types

**Architecture**:

```mermaid
classDiagram
    class BaseModel {
        <<abstract>>
        +name: str
        +model_type: str
        +status: ModelStatus
        +load()
        +generate(prompt, config)*
        +unload()
    }
    
    class TransformersModel {
        +_model: Pipeline
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class GGUFModel {
        +_model: Llama
        +filename: str
        +_extract_filename()
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class OpenVINOModel {
        +_model: OVModelForCausalLM
        +_tokenizer: AutoTokenizer
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class FallbackModel {
        +_load_implementation()
        +generate(prompt, config)
    }
    
    class UnifiedModelManager {
        +max_models: int
        +max_memory_mb: int
        +get_model(name, type)
        +generate_text(name, prompt)
        +cleanup()
    }
    
    BaseModel <|-- TransformersModel
    BaseModel <|-- GGUFModel
    BaseModel <|-- OpenVINOModel
    BaseModel <|-- FallbackModel
    UnifiedModelManager --> BaseModel
```

### 6.2 Model Types

#### 6.2.1 Transformers Models

**Backend**: Hugging Face Transformers  
**Device**: GPU (CUDA) or CPU  
**Use Cases**: General text generation, summarization

**Supported Models**:
- `microsoft/Phi-3-mini-4k-instruct`
- `facebook/bart-large-cnn` (deprecated)
- `google/flan-t5-large`

**Configuration**:
```python
{
    "model_name": "microsoft/Phi-3-mini-4k-instruct",
    "model_type": "text-generation",
    "device_map": "auto",
    "torch_dtype": "float16"
}
```

#### 6.2.2 GGUF Models

**Backend**: llama-cpp-python  
**Device**: CPU or GPU (via Metal/CUDA)  
**Use Cases**: Efficient inference with quantized models

**Supported Models**:
- `microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf` (PRIMARY)

**Configuration**:
```python
{
    "model_path": "path/to/model.gguf",
    "n_ctx": 8192,
    "n_threads": 4,
    "n_gpu_layers": 35  # GPU acceleration
}
```

#### 6.2.3 OpenVINO Models

**Backend**: Intel OpenVINO  
**Device**: CPU (Intel optimized) or GPU  
**Use Cases**: Production deployment on Intel hardware

**Supported Models**:
- `OpenVINO/Phi-3-mini-4k-instruct-fp16-ov`

**Configuration**:
```python
{
    "model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
    "device": "GPU" if available else "CPU"
}
```

### 6.3 Model Selection Strategy

```mermaid
flowchart TD
    A[Request with model_name] --> B{Model specified?}
    B -->|Yes| C{Model type?}
    B -->|No| D[Use default: Phi-3 GGUF]
    
    C -->|GGUF| E[Load GGUF Model]
    C -->|OpenVINO| F[Load OpenVINO Model]
    C -->|Transformers| G[Load Transformers Model]
    C -->|Unknown| H[Auto-detect type]
    
    E --> I{Load successful?}
    F --> I
    G --> I
    H --> I
    D --> I
    
    I -->|Yes| J[Generate with model]
    I -->|No| K[Try fallback model]
    
    K --> L{Fallback successful?}
    L -->|Yes| J
    L -->|No| M[Use text-based fallback]
```

### 6.4 Model Configuration

**File**: `models_config.json`

```json
{
  "patient_summary_models": [
    {
      "name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
      "type": "gguf",
      "is_active": true,
      "cached": true,
      "description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
      "use_case": "Fast patient summary generation with CPU/GPU",
      "repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
      "filename": "Phi-3-mini-4k-instruct-q4.gguf"
    }
  ],
  "runtime_behavior": {
    "allow_runtime_downloads": true,
    "cache_runtime_downloads": true,
    "fallback_to_cached": true
  }
}
```

### 6.5 Token Management

**Token Limit Handling**:
- Automatic token counting (heuristic: ~4 chars/token)
- Pre-generation validation
- Token limit error detection
- Graceful degradation

**Token Limits by Model**:
- Phi-3 models: 4096 tokens (context window)
- BART models: 1024 tokens
- T5 models: 512 tokens

### 6.6 Generation Configuration

```python
@dataclass
class GenerationConfig:
    max_tokens: int = 8192      # Maximum output tokens
    min_tokens: int = 50        # Minimum output tokens
    temperature: float = 0.3    # Deterministic for medical
    top_p: float = 0.9          # Nucleus sampling
    timeout: float = 180.0      # T4 timeout
    stream: bool = False        # Streaming support
```

### 6.7 T4 GPU Optimizations

**Hardware Target**: NVIDIA T4 Medium (16GB GPU, 16GB RAM)

**Optimizations**:
1. **Memory Management**:
   - Max 2 models in memory
   - Automatic model unloading
   - GPU memory clearing
   - Garbage collection

2. **Model Loading**:
   - Lazy loading (on-demand)
   - Intelligent caching
   - LRU eviction policy

3. **Inference**:
   - FP16 precision
   - Batch size: 1
   - Context window: 8192 tokens
   - GPU layer offloading (GGUF)

---

## 7. API Architecture

### 7.1 RESTful Design

**Principles**:
- Resource-oriented URLs
- HTTP methods for CRUD operations
- JSON request/response format
- Stateless communication
- Proper HTTP status codes

### 7.2 Request/Response Flow

```mermaid
sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant S as Service Layer
    participant M as Model Manager
    participant D as Database
    
    C->>A: POST /generate_patient_summary
    A->>A: Validate request
    A->>S: Create job
    S->>D: Log job creation
    A-->>C: 202 Accepted (job_id)
    
    S->>M: Load model
    M->>M: Check cache
    M->>M: Load if needed
    M-->>S: Model ready
    
    S->>M: Generate summary
    M->>M: Process prompt
    M-->>S: Generated text
    
    S->>D: Log completion
    S->>A: Update job status
    A-->>C: SSE: Progress updates
    
    C->>A: GET /job/{job_id}
    A->>S: Get job status
    S->>D: Retrieve job
    S-->>A: Job result
    A-->>C: 200 OK (result)
```

### 7.3 Authentication & Authorization

**Current State**: Basic API key authentication (optional)

**Planned Enhancements**:
- JWT-based authentication
- Role-based access control (RBAC)
- OAuth2 integration
- API rate limiting

### 7.4 Error Handling

**Error Response Format**:
```json
{
  "error": {
    "code": "MODEL_LOAD_FAILED",
    "message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct",
    "details": {
      "model_name": "microsoft/Phi-3-mini-4k-instruct",
      "error_type": "initialization_error",
      "timestamp": "2025-12-05T17:23:52Z"
    }
  }
}
```

**HTTP Status Codes**:
- `200 OK` - Successful request
- `202 Accepted` - Job created
- `400 Bad Request` - Invalid input
- `404 Not Found` - Resource not found
- `500 Internal Server Error` - Server error
- `503 Service Unavailable` - Service degraded

### 7.5 Rate Limiting

**Strategy**: Token bucket algorithm

**Limits**:
- 100 requests/minute per IP
- 1000 requests/hour per API key
- Burst allowance: 20 requests

---

## 8. Data Flow & Processing

### 8.1 Document Processing Pipeline

```mermaid
flowchart LR
    A[Upload Document] --> B{File Type?}
    B -->|PDF| C[PDF Parser]
    B -->|DOCX| D[DOCX Parser]
    B -->|Image| E[OCR Engine]
    B -->|Audio| F[Whisper Transcription]
    
    C --> G[Text Extraction]
    D --> G
    E --> G
    F --> G
    
    G --> H[PHI Scrubbing]
    H --> I[Medical Data Extraction]
    I --> J[Store Processed Data]
    J --> K[Return Results]
```

### 8.2 Patient Summary Generation Flow

```mermaid
flowchart TD
    A[Patient Data Input] --> B[Parse EHR Data]
    B --> C[Convert to Plain Text]
    C --> D{Data Size Check}
    
    D -->|Small| E[Single-pass Generation]
    D -->|Large| F[Chunking Strategy]
    
    F --> G[Chunk by Date/Size]
    G --> H[Process Chunks in Parallel]
    H --> I[Combine Chunk Summaries]
    
    E --> J[Generate with Model]
    I --> J
    
    J --> K[Format Clinical Output]
    K --> L[Evaluate Against Guidelines]
    L --> M[Return Summary]
```

### 8.3 Data Transformation

**Input Formats**:
- Raw EHR JSON
- HL7 FHIR resources
- Plain text documents
- Scanned images
- Audio recordings

**Output Formats**:
- Structured JSON
- Clinical summary (Markdown)
- FHIR-compliant resources
- Audit logs

### 8.4 Caching Strategy

**Multi-Level Caching**:

1. **Model Cache**: Loaded models in memory
2. **Result Cache**: Generated summaries (LRU)
3. **File Cache**: Processed documents
4. **Hugging Face Cache**: Downloaded models

**Cache Invalidation**:
- Time-based expiration
- Manual invalidation
- Memory pressure-based eviction

---

## 9. Database Design

### 9.1 Database Schema

**Primary Database**: PostgreSQL (optional, for audit logs)

#### Audit Logs Table

```sql
CREATE TABLE audit_logs (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
    user_id VARCHAR(255),
    action VARCHAR(100) NOT NULL,
    resource_type VARCHAR(100),
    resource_id VARCHAR(255),
    phi_accessed BOOLEAN DEFAULT FALSE,
    ip_address INET,
    user_agent TEXT,
    request_data JSONB,
    response_status INTEGER,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp);
CREATE INDEX idx_audit_user ON audit_logs(user_id);
CREATE INDEX idx_audit_action ON audit_logs(action);
CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed);
```

### 9.2 Data Models

**Patient Data Model** (In-memory):
```python
{
    "patient_id": "string",
    "demographics": {
        "name": "string",
        "dob": "date",
        "gender": "string",
        "mrn": "string"
    },
    "visits": [
        {
            "visit_id": "string",
            "date": "datetime",
            "chief_complaint": "string",
            "diagnoses": ["string"],
            "medications": ["string"],
            "procedures": ["string"],
            "vitals": {},
            "labs": []
        }
    ]
}
```

### 9.3 File Storage

**Storage Strategy**: Local filesystem or cloud storage

**Directory Structure**:
```
/data/
├── uploads/           # Uploaded documents
├── processed/         # Processed documents
├── cache/            # Temporary cache
└── models/           # Model files
```

---

## 10. Security Architecture

### 10.1 HIPAA Compliance

**Requirements Met**:
1. **Access Controls**: Authentication and authorization
2. **Audit Logging**: Comprehensive activity logs
3. **Data Encryption**: In-transit and at-rest
4. **PHI Scrubbing**: Automated anonymization
5. **Secure Communication**: HTTPS/TLS

### 10.2 PHI Scrubbing

**Scrubbing Patterns**:
```python
PATTERNS = {
    "name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
    "mrn": r'\bMRN[:\s]*\d{6,10}\b',
    "dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b',
    "phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    "ssn": r'\b\d{3}-\d{2}-\d{4}\b'
}
```

### 10.3 Container Security

**Security Measures**:
- Non-root user execution
- Read-only root filesystem
- Resource limits (CPU, memory)
- Network policies
- Secrets management
- Minimal base images

### 10.4 API Security

**Security Headers**:
```python
{
    "X-Content-Type-Options": "nosniff",
    "X-Frame-Options": "DENY",
    "X-XSS-Protection": "1; mode=block",
    "Strict-Transport-Security": "max-age=31536000"
}
```

---

## 11. Deployment Architecture

### 11.1 Deployment Options

#### 11.1.1 Hugging Face Spaces

**Configuration**: `.huggingface.yaml`

```yaml
runtime: docker
sdk: docker
python_version: "3.10"

build:
  dockerfile: Dockerfile.hf-spaces
  cache: true

hardware:
  gpu: t4-medium  # 16GB GPU RAM, 16GB System RAM

env:
  - SPACE_ID=$SPACE_ID
  - HF_HOME=/app/.cache/huggingface
  - TORCH_HOME=/app/.cache/torch
  - MODEL_CACHE_DIR=/app/models
  - PRELOAD_GGUF=true
  - HF_SPACES=true
```

**Optimizations**:
- Pre-cached models in Docker image
- Lazy model loading
- Memory-efficient inference
- Automatic GPU detection

#### 11.1.2 Kubernetes

**Deployment Manifest**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hntai-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hntai
  template:
    metadata:
      labels:
        app: hntai
    spec:
      containers:
      - name: hntai
        image: hntai:latest
        ports:
        - containerPort: 7860
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 7860
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 7860
          initialDelaySeconds: 10
          periodSeconds: 5
```

#### 11.1.3 Docker

**Multi-Stage Dockerfile**:

```dockerfile
# Stage 1: Builder
FROM python:3.10-slim AS builder
RUN apt-get update && apt-get install -y build-essential
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.10-slim AS runtime
COPY --from=builder /install /usr/local
WORKDIR /app
COPY . .
ENV PYTHONUNBUFFERED=1
EXPOSE 7860
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
```

### 11.2 Scaling Strategy

**Horizontal Scaling**:
- Multiple replicas behind load balancer
- Stateless design for easy scaling
- Shared model cache (optional)

**Vertical Scaling**:
- Increase CPU/memory per instance
- GPU acceleration for inference
- Larger model support

### 11.3 High Availability

**Components**:
1. **Load Balancer**: Distribute traffic
2. **Health Checks**: Automatic failover
3. **Auto-scaling**: Based on CPU/memory
4. **Graceful Shutdown**: Drain connections

---

## 12. Performance Optimization

### 12.1 Model Optimization

**Techniques**:
1. **Quantization**: GGUF Q4 models (4-bit)
2. **Precision**: FP16 for GPU inference
3. **Batching**: Batch size optimization
4. **Caching**: Model and result caching
5. **Lazy Loading**: On-demand model loading

### 12.2 Memory Management

**Strategies**:
- Automatic garbage collection
- GPU memory clearing
- Model unloading (LRU)
- Memory pressure monitoring

**Memory Limits**:
- T4 Medium: 16GB GPU, 16GB RAM
- Max 2 models in memory
- Automatic eviction at 80% usage

### 12.3 Inference Optimization

**T4-Specific Optimizations**:
```python
{
    "max_models": 2,
    "max_memory_mb": 14000,
    "n_ctx": 8192,
    "n_threads": 4,
    "n_gpu_layers": 35,
    "torch_dtype": "float16",
    "device_map": "auto"
}
```

### 12.4 Caching Strategy

**Cache Hierarchy**:
1. **L1 - Model Cache**: In-memory loaded models
2. **L2 - Result Cache**: Generated summaries (LRU, 100 items)
3. **L3 - File Cache**: Processed documents (disk)
4. **L4 - HF Cache**: Downloaded models (disk)

### 12.5 Performance Metrics

**Target Metrics**:
- Model load time: < 10 seconds
- Summary generation: < 60 seconds (small), < 180 seconds (large)
- API response time: < 100ms (excluding generation)
- Memory usage: < 80% of available
- GPU utilization: > 70% during inference

---

## 13. Monitoring & Observability

### 13.1 Health Checks

**Liveness Probe** (`/health/live`):
```python
{
    "status": "alive",
    "timestamp": "2025-12-05T17:23:52Z"
}
```

**Readiness Probe** (`/health/ready`):
```python
{
    "status": "ready",
    "checks": {
        "database": "ok",
        "model_manager": "ok",
        "file_storage": "ok"
    },
    "timestamp": "2025-12-05T17:23:52Z"
}
```

### 13.2 Metrics

**Prometheus Metrics** (`/metrics`):
```
# Model metrics
model_load_time_seconds{model_name="phi-3-gguf"} 8.5
model_inference_time_seconds{model_name="phi-3-gguf"} 45.2
model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296

# API metrics
http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234
http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3

# System metrics
memory_usage_percent 65.2
gpu_memory_usage_percent 72.1
cpu_usage_percent 45.8
```

### 13.3 Logging

**Structured Logging**:
```python
{
    "timestamp": "2025-12-05T17:23:52Z",
    "level": "INFO",
    "logger": "ai_med_extract.agents.patient_summary_agent",
    "message": "Generated patient summary",
    "context": {
        "job_id": "abc123",
        "model_name": "phi-3-gguf",
        "duration_seconds": 45.2,
        "token_count": 2048
    }
}
```

**Log Levels**:
- `DEBUG`: Detailed diagnostic information
- `INFO`: General informational messages
- `WARNING`: Warning messages
- `ERROR`: Error messages
- `CRITICAL`: Critical failures

### 13.4 Audit Logging

**HIPAA Audit Trail**:
```python
{
    "timestamp": "2025-12-05T17:23:52Z",
    "user_id": "user123",
    "action": "PHI_ACCESS",
    "resource_type": "patient_summary",
    "resource_id": "patient456",
    "phi_accessed": true,
    "ip_address": "192.168.1.100",
    "user_agent": "Mozilla/5.0...",
    "request_data": {...},
    "response_status": 200
}
```

---

## 14. Development Workflow

### 14.1 Local Development

**Setup**:
```bash
# Clone repository
git clone <repository-url>
cd HNTAI

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai"
export SECRET_KEY="your-secret-key"
export HF_HOME="/tmp/huggingface"

# Run development server
cd services/ai-service/src
python -m ai_med_extract.app run_dev
```

### 14.2 Testing

**Test Structure**:
```
tests/
├── unit/
│   ├── test_agents.py
│   ├── test_model_manager.py
│   └── test_utils.py
├── integration/
│   ├── test_api.py
│   └── test_workflows.py
└── conftest.py
```

**Running Tests**:
```bash
# Unit tests
python -m pytest tests/unit/

# Integration tests
python -m pytest tests/integration/

# Coverage report
python -m pytest --cov=ai_med_extract tests/
```

### 14.3 Code Quality

**Tools**:
```bash
# Format code
black .
isort .

# Lint code
flake8 .

# Type checking
mypy services/ai-service/src/ai_med_extract/
```

### 14.4 Git Workflow

**Branching Strategy**:
- `main`: Production-ready code
- `develop`: Integration branch
- `feature/*`: Feature branches
- `bugfix/*`: Bug fix branches
- `hotfix/*`: Production hotfixes

**Commit Convention**:
```
<type>(<scope>): <subject>

<body>

<footer>
```

Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`

---

## 15. Integration Patterns

### 15.1 External System Integration

**Integration Points**:
1. **EHR Systems**: HL7, FHIR APIs
2. **Document Management**: File uploads, cloud storage
3. **Authentication**: OAuth2, SAML
4. **Monitoring**: Prometheus, Grafana
5. **Logging**: ELK Stack, CloudWatch

### 15.2 API Integration

**Client Libraries** (Planned):
- Python SDK
- JavaScript SDK
- REST API documentation (OpenAPI/Swagger)

**Example Integration**:
```python
import requests

# Upload document
response = requests.post(
    "https://api.hntai.com/upload",
    files={"file": open("document.pdf", "rb")},
    headers={"Authorization": "Bearer <token>"}
)

# Generate patient summary
response = requests.post(
    "https://api.hntai.com/generate_patient_summary",
    json={
        "patient_data": {...},
        "model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
        "model_type": "gguf"
    },
    headers={"Authorization": "Bearer <token>"}
)

job_id = response.json()["job_id"]

# Poll for results
while True:
    response = requests.get(
        f"https://api.hntai.com/job/{job_id}",
        headers={"Authorization": "Bearer <token>"}
    )
    if response.json()["status"] == "completed":
        break
    time.sleep(5)
```

### 15.3 Webhook Support

**Planned Feature**: Webhook notifications for job completion

```python
{
    "event": "job.completed",
    "job_id": "abc123",
    "timestamp": "2025-12-05T17:23:52Z",
    "data": {
        "status": "completed",
        "result": {...}
    }
}
```

---

## 16. Scalability Considerations

### 16.1 Horizontal Scaling

**Strategies**:
1. **Stateless Design**: No session state in application
2. **Load Balancing**: Distribute requests across instances
3. **Shared Cache**: Redis for distributed caching
4. **Message Queue**: RabbitMQ/Kafka for async processing

### 16.2 Vertical Scaling

**Resource Scaling**:
- CPU: 2-8 cores per instance
- Memory: 8-32 GB per instance
- GPU: T4, V100, A100 for inference

### 16.3 Database Scaling

**Strategies**:
1. **Read Replicas**: For audit log queries
2. **Partitioning**: Time-based partitioning for logs
3. **Indexing**: Optimize query performance
4. **Archiving**: Move old logs to cold storage

### 16.4 Model Serving

**Scaling Options**:
1. **Model Replication**: Same model on multiple instances
2. **Model Sharding**: Different models on different instances
3. **Model Versioning**: A/B testing with multiple versions
4. **Dedicated Inference**: Separate inference service

---

## 17. Future Roadmap

### 17.1 Short-Term (3-6 months)

1. **Enhanced Model Support**:
   - Support for Llama 3, Mistral models
   - Fine-tuned medical models
   - Multi-modal models (text + images)

2. **Improved Performance**:
   - Model quantization (INT8, INT4)
   - Batch inference support
   - Streaming responses

3. **Additional Features**:
   - Real-time collaboration
   - Version control for summaries
   - Template-based summaries

### 17.2 Medium-Term (6-12 months)

1. **Advanced AI Capabilities**:
   - Multi-agent orchestration
   - Retrieval-Augmented Generation (RAG)
   - Knowledge graph integration

2. **Enterprise Features**:
   - Multi-tenancy support
   - Advanced RBAC
   - SSO integration
   - Compliance reporting

3. **Platform Enhancements**:
   - Web UI for management
   - Mobile app support
   - Plugin architecture

### 17.3 Long-Term (12+ months)

1. **AI/ML Advancements**:
   - Custom model training pipeline
   - Federated learning support
   - Explainable AI (XAI)

2. **Ecosystem Integration**:
   - FHIR server integration
   - HL7 v3 support
   - DICOM image analysis

3. **Global Expansion**:
   - Multi-language support
   - Regional compliance (GDPR, etc.)
   - Edge deployment

---

## Appendix A: Configuration Reference

### Environment Variables

| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `DATABASE_URL` | PostgreSQL connection string | - | No |
| `SECRET_KEY` | Application secret key | - | Yes |
| `JWT_SECRET_KEY` | JWT signing key | - | Yes |
| `HF_HOME` | Hugging Face cache directory | `/tmp/huggingface` | No |
| `TORCH_HOME` | PyTorch cache directory | `/tmp/torch` | No |
| `WHISPER_CACHE` | Whisper model cache | `/tmp/whisper` | No |
| `HF_SPACES` | Hugging Face Spaces mode | `false` | No |
| `PRELOAD_GGUF` | Preload GGUF models | `false` | No |
| `MAX_NEW_TOKENS` | Max output tokens | `8192` | No |
| `MAX_INPUT_TOKENS` | Max input tokens | `2048` | No |

---

## Appendix B: API Reference

### Complete Endpoint List

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/` | Root endpoint |
| `GET` | `/health/live` | Liveness probe |
| `GET` | `/health/ready` | Readiness probe |
| `GET` | `/metrics` | Prometheus metrics |
| `POST` | `/upload` | Upload document |
| `POST` | `/transcribe` | Transcribe audio |
| `POST` | `/generate_patient_summary` | Generate patient summary |
| `POST` | `/api/generate_summary` | Generate text summary |
| `POST` | `/api/patient_summary_openvino` | OpenVINO summary |
| `POST` | `/extract_medical_data` | Extract medical data |
| `GET` | `/get_updated_medical_data` | Get processed data |
| `PUT` | `/update_medical_data` | Update medical data |
| `POST` | `/api/load_model` | Load model |
| `GET` | `/api/model_info` | Get model info |
| `POST` | `/api/switch_model` | Switch model |

---

## Appendix C: Troubleshooting Guide

### Common Issues

#### Model Loading Failures

**Symptom**: Model fails to load  
**Causes**:
- Insufficient memory
- Missing dependencies
- Network issues (download)

**Solutions**:
1. Check memory availability
2. Verify dependencies installed
3. Check network connectivity
4. Use fallback model

#### Token Limit Errors

**Symptom**: "Input exceeds token limit"  
**Causes**:
- Input too long
- Model context window exceeded

**Solutions**:
1. Reduce input size
2. Use chunking strategy
3. Switch to larger context model

#### Performance Issues

**Symptom**: Slow inference  
**Causes**:
- CPU-only inference
- Large model size
- Memory pressure

**Solutions**:
1. Enable GPU acceleration
2. Use quantized models (GGUF)
3. Reduce batch size
4. Clear model cache

---

## Appendix D: Glossary

| Term | Definition |
|------|------------|
| **PHI** | Protected Health Information |
| **HIPAA** | Health Insurance Portability and Accountability Act |
| **EHR** | Electronic Health Record |
| **FHIR** | Fast Healthcare Interoperability Resources |
| **HL7** | Health Level 7 (healthcare data standard) |
| **GGUF** | GPT-Generated Unified Format (quantized models) |
| **OpenVINO** | Open Visual Inference and Neural Network Optimization |
| **T4** | NVIDIA Tesla T4 GPU |
| **LRU** | Least Recently Used (cache eviction) |
| **SSE** | Server-Sent Events |
| **ASGI** | Asynchronous Server Gateway Interface |

---

## Document Revision History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-12-05 | System | Initial comprehensive documentation |

---

**End of Technical Architecture Documentation**