HNTAI / TECHNICAL_ARCHITECTURE.md
sachinchandrankallar's picture
changes for publishing the latest including generate_generic api
4156c57
|
Raw
History Blame Contribute Delete
38.1 kB
# HNTAI - Comprehensive Technical Architecture Documentation
**Version:** 1.0
**Last Updated:** December 5, 2025
**Project:** Medical Data Extraction & AI Processing Platform
---
## Table of Contents
1. [Executive Summary](#executive-summary)
2. [System Overview](#system-overview)
3. [Architecture Design](#architecture-design)
4. [Technology Stack](#technology-stack)
5. [Core Components](#core-components)
6. [AI/ML Architecture](#aiml-architecture)
7. [API Architecture](#api-architecture)
8. [Data Flow & Processing](#data-flow--processing)
9. [Database Design](#database-design)
10. [Security Architecture](#security-architecture)
11. [Deployment Architecture](#deployment-architecture)
12. [Performance Optimization](#performance-optimization)
13. [Monitoring & Observability](#monitoring--observability)
14. [Development Workflow](#development-workflow)
15. [Integration Patterns](#integration-patterns)
16. [Scalability Considerations](#scalability-considerations)
17. [Future Roadmap](#future-roadmap)
---
## 1. Executive Summary
HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends.
### Key Capabilities
- **Multi-format Document Processing**: PDF, DOCX, images, and audio transcription
- **HIPAA Compliance**: Automated PHI scrubbing with comprehensive audit logging
- **Multi-Model AI Support**: Transformers, OpenVINO, and GGUF models with automatic optimization
- **Scalable Architecture**: Kubernetes-ready with horizontal scaling capabilities
- **Production-Ready**: Health checks, metrics, structured logging, and error handling
### Target Deployment Environments
- **Hugging Face Spaces** (T4 Medium GPU)
- **Kubernetes Clusters** (On-premise or cloud)
- **Docker Containers** (Standalone or orchestrated)
- **Local Development** (CPU or GPU)
---
## 2. System Overview
### 2.1 Purpose & Scope
HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to:
1. **Extract** structured medical data from unstructured documents
2. **Anonymize** protected health information (PHI) for compliance
3. **Summarize** patient records into comprehensive clinical assessments
4. **Process** multi-modal medical data (text, images, audio)
### 2.2 Design Principles
- **Simplicity**: Clean, maintainable codebase with essential features
- **Flexibility**: Support for multiple AI model types and backends
- **Security**: HIPAA-compliant with comprehensive audit trails
- **Performance**: Optimized for T4 GPU with intelligent caching
- **Reliability**: Robust error handling and automatic fallback mechanisms
### 2.3 High-Level Architecture
```mermaid
graph TB
subgraph "Client Layer"
A[Web Client]
B[Mobile Client]
C[API Client]
end
subgraph "API Gateway"
D[FastAPI Application]
E[Health Endpoints]
F[Metrics Endpoint]
end
subgraph "Service Layer"
G[Document Processing Service]
H[PHI Scrubbing Service]
I[Patient Summary Service]
J[Model Management Service]
end
subgraph "AI/ML Layer"
K[Unified Model Manager]
L[Transformers Models]
M[GGUF Models]
N[OpenVINO Models]
O[Whisper Audio Models]
end
subgraph "Data Layer"
P[PostgreSQL - Audit Logs]
Q[File Storage]
R[Model Cache]
end
A --> D
B --> D
C --> D
D --> E
D --> F
D --> G
D --> H
D --> I
D --> J
G --> K
H --> K
I --> K
J --> K
K --> L
K --> M
K --> N
K --> O
D --> P
G --> Q
K --> R
```
---
## 3. Architecture Design
### 3.1 Architectural Style
HNTAI follows a **Layered Monolithic Architecture** with clear separation of concerns:
1. **Presentation Layer**: FastAPI routes and endpoints
2. **Service Layer**: Business logic and orchestration
3. **Agent Layer**: Specialized AI agents for specific tasks
4. **Utility Layer**: Shared utilities and helpers
5. **Data Layer**: Database and file storage
### 3.2 Component Architecture
```mermaid
graph LR
subgraph "FastAPI Application"
A[routes_fastapi.py]
B[app.py]
C[main.py]
end
subgraph "Agents"
D[patient_summary_agent.py]
E[phi_scrubber.py]
F[text_extractor.py]
G[medical_data_extractor.py]
end
subgraph "Services"
H[job_manager.py]
I[request_queue.py]
J[error_handler.py]
K[sse_generator.py]
end
subgraph "Utils"
L[unified_model_manager.py]
M[model_config.py]
N[robust_json_parser.py]
O[memory_manager.py]
end
A --> D
A --> E
A --> F
A --> G
A --> H
A --> I
D --> L
E --> L
F --> L
G --> L
L --> M
L --> O
```
### 3.3 Directory Structure
```
HNTAI/
├── services/
│ └── ai-service/
│ └── src/
│ └── ai_med_extract/
│ ├── agents/ # AI agents for specific tasks
│ │ ├── patient_summary_agent.py
│ │ ├── phi_scrubber.py
│ │ ├── text_extractor.py
│ │ └── medical_data_extractor.py
│ ├── api/ # FastAPI routes
│ │ └── routes_fastapi.py
│ ├── services/ # Business logic services
│ │ ├── job_manager.py
│ │ ├── request_queue.py
│ │ ├── error_handler.py
│ │ └── sse_generator.py
│ ├── utils/ # Utilities and helpers
│ │ ├── unified_model_manager.py
│ │ ├── model_config.py
│ │ ├── robust_json_parser.py
│ │ ├── memory_manager.py
│ │ ├── openvino_summarizer_utils.py
│ │ └── patient_summary_utils.py
│ ├── app.py # FastAPI app factory
│ ├── main.py # Entry point
│ ├── health_endpoints.py # Health checks
│ └── database_audit.py # HIPAA audit logging
├── docs/ # Documentation
├── infra/ # Infrastructure configs
│ └── k8s/ # Kubernetes manifests
├── app.py # HF Spaces entry point
├── Dockerfile # Multi-stage Docker build
├── Dockerfile.hf-spaces # HF Spaces optimized
├── .huggingface.yaml # HF Spaces config
├── models_config.json # Model configuration
├── requirements.txt # Python dependencies
└── README.md # Project documentation
```
---
## 4. Technology Stack
### 4.1 Core Technologies
| Category | Technology | Version | Purpose |
|----------|-----------|---------|---------|
| **Runtime** | Python | 3.10+ | Primary language |
| **Web Framework** | FastAPI | Latest | REST API framework |
| **ASGI Server** | Uvicorn | Latest | Production server |
| **AI/ML Framework** | PyTorch | 2.x | Deep learning |
| **Transformers** | Hugging Face Transformers | Latest | Model loading |
| **GGUF Support** | llama-cpp-python | Latest | Quantized models |
| **OpenVINO** | optimum-intel | Latest | Intel optimization |
| **Audio Processing** | Whisper | Latest | Speech-to-text |
### 4.2 Supporting Technologies
| Category | Technology | Purpose |
|----------|-----------|---------|
| **Database** | PostgreSQL 13+ | Audit logs (optional) |
| **Caching** | In-memory LRU | Model caching |
| **Document Processing** | PyPDF2, python-docx | PDF/DOCX parsing |
| **OCR** | Tesseract | Image text extraction |
| **Audio** | FFmpeg | Audio processing |
| **Containerization** | Docker | Deployment |
| **Orchestration** | Kubernetes | Scaling |
| **Monitoring** | Prometheus | Metrics |
### 4.3 Development Tools
- **Code Quality**: Black, isort, flake8, mypy
- **Testing**: pytest
- **Version Control**: Git
- **CI/CD**: GitHub Actions (potential)
- **Documentation**: Markdown, Mermaid diagrams
---
## 5. Core Components
### 5.1 FastAPI Application (`app.py`)
**Purpose**: Application factory and initialization
**Key Responsibilities**:
- Create and configure FastAPI application
- Initialize agents and services
- Register routes and middleware
- Configure CORS and security
**Key Functions**:
```python
def create_app(initialize: bool = True) -> FastAPI
def initialize_agents(app: FastAPI, preload_small_models: bool = False)
def run_dev() # Development server
```
### 5.2 API Routes (`routes_fastapi.py`)
**Purpose**: RESTful API endpoints
**Endpoint Categories**:
#### Health & Monitoring
- `GET /health/live` - Liveness probe
- `GET /health/ready` - Readiness probe
- `GET /metrics` - Prometheus metrics
#### Document Processing
- `POST /upload` - Upload and process documents
- `POST /transcribe` - Audio transcription
- `GET /get_updated_medical_data` - Retrieve processed data
- `PUT /update_medical_data` - Update medical records
#### AI Processing
- `POST /generate_patient_summary` - Generate patient summaries
- `POST /api/generate_summary` - Text summarization
- `POST /api/patient_summary_openvino` - OpenVINO summaries
- `POST /extract_medical_data` - Extract structured data
#### Model Management
- `POST /api/load_model` - Load specific models
- `GET /api/model_info` - Model information
- `POST /api/switch_model` - Switch models
### 5.3 Agents
#### 5.3.1 Patient Summary Agent (`patient_summary_agent.py`)
**Purpose**: Generate comprehensive patient summaries
**Key Features**:
- Dynamic model configuration
- Multi-section summary generation
- Chronological narrative building
- Clinical guideline evaluation
- Fallback text-based summarization
**Core Methods**:
```python
def configure_model(model_name: str, model_type: str)
def generate_clinical_summary(patient_data: Union[List[str], Dict])
def generate_patient_summary(patient_data: Union[List[str], Dict])
def build_chronological_narrative(patient_data: dict)
def format_clinical_output(raw_summary: str, patient_data: dict)
```
#### 5.3.2 PHI Scrubber (`phi_scrubber.py`)
**Purpose**: Remove protected health information
**Scrubbing Capabilities**:
- Patient names
- Medical record numbers (MRN)
- Dates of birth
- Phone numbers
- Email addresses
- Social Security Numbers
- Addresses
**Compliance**: HIPAA-compliant with audit logging
#### 5.3.3 Text Extractor (`text_extractor.py`)
**Purpose**: Extract text from various document formats
**Supported Formats**:
- PDF documents
- DOCX files
- Images (via OCR)
- Plain text
#### 5.3.4 Medical Data Extractor (`medical_data_extractor.py`)
**Purpose**: Extract structured medical data from text
**Extraction Targets**:
- Diagnoses
- Medications
- Procedures
- Lab results
- Vital signs
- Allergies
### 5.4 Services
#### 5.4.1 Job Manager (`job_manager.py`)
**Purpose**: Manage long-running jobs
**Features**:
- Job lifecycle management
- Progress tracking
- Status updates
- Result caching
- Cleanup of completed jobs
#### 5.4.2 Request Queue (`request_queue.py`)
**Purpose**: Queue and prioritize requests
**Features**:
- Request queuing
- Priority handling
- Concurrency control
- Timeout management
#### 5.4.3 Error Handler (`error_handler.py`)
**Purpose**: Centralized error handling
**Features**:
- Error categorization
- Contextual logging
- Job error updates
- Graceful degradation
#### 5.4.4 SSE Generator (`sse_generator.py`)
**Purpose**: Server-Sent Events for real-time updates
**Features**:
- Progress streaming
- Status updates
- Error notifications
- Completion events
---
## 6. AI/ML Architecture
### 6.1 Unified Model Manager
**File**: `unified_model_manager.py`
**Purpose**: Single interface for all AI model types
**Architecture**:
```mermaid
classDiagram
class BaseModel {
<<abstract>>
+name: str
+model_type: str
+status: ModelStatus
+load()
+generate(prompt, config)*
+unload()
}
class TransformersModel {
+_model: Pipeline
+_load_implementation()
+generate(prompt, config)
}
class GGUFModel {
+_model: Llama
+filename: str
+_extract_filename()
+_load_implementation()
+generate(prompt, config)
}
class OpenVINOModel {
+_model: OVModelForCausalLM
+_tokenizer: AutoTokenizer
+_load_implementation()
+generate(prompt, config)
}
class FallbackModel {
+_load_implementation()
+generate(prompt, config)
}
class UnifiedModelManager {
+max_models: int
+max_memory_mb: int
+get_model(name, type)
+generate_text(name, prompt)
+cleanup()
}
BaseModel <|-- TransformersModel
BaseModel <|-- GGUFModel
BaseModel <|-- OpenVINOModel
BaseModel <|-- FallbackModel
UnifiedModelManager --> BaseModel
```
### 6.2 Model Types
#### 6.2.1 Transformers Models
**Backend**: Hugging Face Transformers
**Device**: GPU (CUDA) or CPU
**Use Cases**: General text generation, summarization
**Supported Models**:
- `microsoft/Phi-3-mini-4k-instruct`
- `facebook/bart-large-cnn` (deprecated)
- `google/flan-t5-large`
**Configuration**:
```python
{
"model_name": "microsoft/Phi-3-mini-4k-instruct",
"model_type": "text-generation",
"device_map": "auto",
"torch_dtype": "float16"
}
```
#### 6.2.2 GGUF Models
**Backend**: llama-cpp-python
**Device**: CPU or GPU (via Metal/CUDA)
**Use Cases**: Efficient inference with quantized models
**Supported Models**:
- `microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf` (PRIMARY)
**Configuration**:
```python
{
"model_path": "path/to/model.gguf",
"n_ctx": 8192,
"n_threads": 4,
"n_gpu_layers": 35 # GPU acceleration
}
```
#### 6.2.3 OpenVINO Models
**Backend**: Intel OpenVINO
**Device**: CPU (Intel optimized) or GPU
**Use Cases**: Production deployment on Intel hardware
**Supported Models**:
- `OpenVINO/Phi-3-mini-4k-instruct-fp16-ov`
**Configuration**:
```python
{
"model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
"device": "GPU" if available else "CPU"
}
```
### 6.3 Model Selection Strategy
```mermaid
flowchart TD
A[Request with model_name] --> B{Model specified?}
B -->|Yes| C{Model type?}
B -->|No| D[Use default: Phi-3 GGUF]
C -->|GGUF| E[Load GGUF Model]
C -->|OpenVINO| F[Load OpenVINO Model]
C -->|Transformers| G[Load Transformers Model]
C -->|Unknown| H[Auto-detect type]
E --> I{Load successful?}
F --> I
G --> I
H --> I
D --> I
I -->|Yes| J[Generate with model]
I -->|No| K[Try fallback model]
K --> L{Fallback successful?}
L -->|Yes| J
L -->|No| M[Use text-based fallback]
```
### 6.4 Model Configuration
**File**: `models_config.json`
```json
{
"patient_summary_models": [
{
"name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
"type": "gguf",
"is_active": true,
"cached": true,
"description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
"use_case": "Fast patient summary generation with CPU/GPU",
"repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
"filename": "Phi-3-mini-4k-instruct-q4.gguf"
}
],
"runtime_behavior": {
"allow_runtime_downloads": true,
"cache_runtime_downloads": true,
"fallback_to_cached": true
}
}
```
### 6.5 Token Management
**Token Limit Handling**:
- Automatic token counting (heuristic: ~4 chars/token)
- Pre-generation validation
- Token limit error detection
- Graceful degradation
**Token Limits by Model**:
- Phi-3 models: 4096 tokens (context window)
- BART models: 1024 tokens
- T5 models: 512 tokens
### 6.6 Generation Configuration
```python
@dataclass
class GenerationConfig:
max_tokens: int = 8192 # Maximum output tokens
min_tokens: int = 50 # Minimum output tokens
temperature: float = 0.3 # Deterministic for medical
top_p: float = 0.9 # Nucleus sampling
timeout: float = 180.0 # T4 timeout
stream: bool = False # Streaming support
```
### 6.7 T4 GPU Optimizations
**Hardware Target**: NVIDIA T4 Medium (16GB GPU, 16GB RAM)
**Optimizations**:
1. **Memory Management**:
- Max 2 models in memory
- Automatic model unloading
- GPU memory clearing
- Garbage collection
2. **Model Loading**:
- Lazy loading (on-demand)
- Intelligent caching
- LRU eviction policy
3. **Inference**:
- FP16 precision
- Batch size: 1
- Context window: 8192 tokens
- GPU layer offloading (GGUF)
---
## 7. API Architecture
### 7.1 RESTful Design
**Principles**:
- Resource-oriented URLs
- HTTP methods for CRUD operations
- JSON request/response format
- Stateless communication
- Proper HTTP status codes
### 7.2 Request/Response Flow
```mermaid
sequenceDiagram
participant C as Client
participant A as API Gateway
participant S as Service Layer
participant M as Model Manager
participant D as Database
C->>A: POST /generate_patient_summary
A->>A: Validate request
A->>S: Create job
S->>D: Log job creation
A-->>C: 202 Accepted (job_id)
S->>M: Load model
M->>M: Check cache
M->>M: Load if needed
M-->>S: Model ready
S->>M: Generate summary
M->>M: Process prompt
M-->>S: Generated text
S->>D: Log completion
S->>A: Update job status
A-->>C: SSE: Progress updates
C->>A: GET /job/{job_id}
A->>S: Get job status
S->>D: Retrieve job
S-->>A: Job result
A-->>C: 200 OK (result)
```
### 7.3 Authentication & Authorization
**Current State**: Basic API key authentication (optional)
**Planned Enhancements**:
- JWT-based authentication
- Role-based access control (RBAC)
- OAuth2 integration
- API rate limiting
### 7.4 Error Handling
**Error Response Format**:
```json
{
"error": {
"code": "MODEL_LOAD_FAILED",
"message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct",
"details": {
"model_name": "microsoft/Phi-3-mini-4k-instruct",
"error_type": "initialization_error",
"timestamp": "2025-12-05T17:23:52Z"
}
}
}
```
**HTTP Status Codes**:
- `200 OK` - Successful request
- `202 Accepted` - Job created
- `400 Bad Request` - Invalid input
- `404 Not Found` - Resource not found
- `500 Internal Server Error` - Server error
- `503 Service Unavailable` - Service degraded
### 7.5 Rate Limiting
**Strategy**: Token bucket algorithm
**Limits**:
- 100 requests/minute per IP
- 1000 requests/hour per API key
- Burst allowance: 20 requests
---
## 8. Data Flow & Processing
### 8.1 Document Processing Pipeline
```mermaid
flowchart LR
A[Upload Document] --> B{File Type?}
B -->|PDF| C[PDF Parser]
B -->|DOCX| D[DOCX Parser]
B -->|Image| E[OCR Engine]
B -->|Audio| F[Whisper Transcription]
C --> G[Text Extraction]
D --> G
E --> G
F --> G
G --> H[PHI Scrubbing]
H --> I[Medical Data Extraction]
I --> J[Store Processed Data]
J --> K[Return Results]
```
### 8.2 Patient Summary Generation Flow
```mermaid
flowchart TD
A[Patient Data Input] --> B[Parse EHR Data]
B --> C[Convert to Plain Text]
C --> D{Data Size Check}
D -->|Small| E[Single-pass Generation]
D -->|Large| F[Chunking Strategy]
F --> G[Chunk by Date/Size]
G --> H[Process Chunks in Parallel]
H --> I[Combine Chunk Summaries]
E --> J[Generate with Model]
I --> J
J --> K[Format Clinical Output]
K --> L[Evaluate Against Guidelines]
L --> M[Return Summary]
```
### 8.3 Data Transformation
**Input Formats**:
- Raw EHR JSON
- HL7 FHIR resources
- Plain text documents
- Scanned images
- Audio recordings
**Output Formats**:
- Structured JSON
- Clinical summary (Markdown)
- FHIR-compliant resources
- Audit logs
### 8.4 Caching Strategy
**Multi-Level Caching**:
1. **Model Cache**: Loaded models in memory
2. **Result Cache**: Generated summaries (LRU)
3. **File Cache**: Processed documents
4. **Hugging Face Cache**: Downloaded models
**Cache Invalidation**:
- Time-based expiration
- Manual invalidation
- Memory pressure-based eviction
---
## 9. Database Design
### 9.1 Database Schema
**Primary Database**: PostgreSQL (optional, for audit logs)
#### Audit Logs Table
```sql
CREATE TABLE audit_logs (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
user_id VARCHAR(255),
action VARCHAR(100) NOT NULL,
resource_type VARCHAR(100),
resource_id VARCHAR(255),
phi_accessed BOOLEAN DEFAULT FALSE,
ip_address INET,
user_agent TEXT,
request_data JSONB,
response_status INTEGER,
error_message TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp);
CREATE INDEX idx_audit_user ON audit_logs(user_id);
CREATE INDEX idx_audit_action ON audit_logs(action);
CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed);
```
### 9.2 Data Models
**Patient Data Model** (In-memory):
```python
{
"patient_id": "string",
"demographics": {
"name": "string",
"dob": "date",
"gender": "string",
"mrn": "string"
},
"visits": [
{
"visit_id": "string",
"date": "datetime",
"chief_complaint": "string",
"diagnoses": ["string"],
"medications": ["string"],
"procedures": ["string"],
"vitals": {},
"labs": []
}
]
}
```
### 9.3 File Storage
**Storage Strategy**: Local filesystem or cloud storage
**Directory Structure**:
```
/data/
├── uploads/ # Uploaded documents
├── processed/ # Processed documents
├── cache/ # Temporary cache
└── models/ # Model files
```
---
## 10. Security Architecture
### 10.1 HIPAA Compliance
**Requirements Met**:
1. **Access Controls**: Authentication and authorization
2. **Audit Logging**: Comprehensive activity logs
3. **Data Encryption**: In-transit and at-rest
4. **PHI Scrubbing**: Automated anonymization
5. **Secure Communication**: HTTPS/TLS
### 10.2 PHI Scrubbing
**Scrubbing Patterns**:
```python
PATTERNS = {
"name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
"mrn": r'\bMRN[:\s]*\d{6,10}\b',
"dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b',
"phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
"email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
"ssn": r'\b\d{3}-\d{2}-\d{4}\b'
}
```
### 10.3 Container Security
**Security Measures**:
- Non-root user execution
- Read-only root filesystem
- Resource limits (CPU, memory)
- Network policies
- Secrets management
- Minimal base images
### 10.4 API Security
**Security Headers**:
```python
{
"X-Content-Type-Options": "nosniff",
"X-Frame-Options": "DENY",
"X-XSS-Protection": "1; mode=block",
"Strict-Transport-Security": "max-age=31536000"
}
```
---
## 11. Deployment Architecture
### 11.1 Deployment Options
#### 11.1.1 Hugging Face Spaces
**Configuration**: `.huggingface.yaml`
```yaml
runtime: docker
sdk: docker
python_version: "3.10"
build:
dockerfile: Dockerfile.hf-spaces
cache: true
hardware:
gpu: t4-medium # 16GB GPU RAM, 16GB System RAM
env:
- SPACE_ID=$SPACE_ID
- HF_HOME=/app/.cache/huggingface
- TORCH_HOME=/app/.cache/torch
- MODEL_CACHE_DIR=/app/models
- PRELOAD_GGUF=true
- HF_SPACES=true
```
**Optimizations**:
- Pre-cached models in Docker image
- Lazy model loading
- Memory-efficient inference
- Automatic GPU detection
#### 11.1.2 Kubernetes
**Deployment Manifest**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hntai-deployment
spec:
replicas: 3
selector:
matchLabels:
app: hntai
template:
metadata:
labels:
app: hntai
spec:
containers:
- name: hntai
image: hntai:latest
ports:
- containerPort: 7860
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /health/live
port: 7860
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 7860
initialDelaySeconds: 10
periodSeconds: 5
```
#### 11.1.3 Docker
**Multi-Stage Dockerfile**:
```dockerfile
# Stage 1: Builder
FROM python:3.10-slim AS builder
RUN apt-get update && apt-get install -y build-essential
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt
# Stage 2: Runtime
FROM python:3.10-slim AS runtime
COPY --from=builder /install /usr/local
WORKDIR /app
COPY . .
ENV PYTHONUNBUFFERED=1
EXPOSE 7860
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
```
### 11.2 Scaling Strategy
**Horizontal Scaling**:
- Multiple replicas behind load balancer
- Stateless design for easy scaling
- Shared model cache (optional)
**Vertical Scaling**:
- Increase CPU/memory per instance
- GPU acceleration for inference
- Larger model support
### 11.3 High Availability
**Components**:
1. **Load Balancer**: Distribute traffic
2. **Health Checks**: Automatic failover
3. **Auto-scaling**: Based on CPU/memory
4. **Graceful Shutdown**: Drain connections
---
## 12. Performance Optimization
### 12.1 Model Optimization
**Techniques**:
1. **Quantization**: GGUF Q4 models (4-bit)
2. **Precision**: FP16 for GPU inference
3. **Batching**: Batch size optimization
4. **Caching**: Model and result caching
5. **Lazy Loading**: On-demand model loading
### 12.2 Memory Management
**Strategies**:
- Automatic garbage collection
- GPU memory clearing
- Model unloading (LRU)
- Memory pressure monitoring
**Memory Limits**:
- T4 Medium: 16GB GPU, 16GB RAM
- Max 2 models in memory
- Automatic eviction at 80% usage
### 12.3 Inference Optimization
**T4-Specific Optimizations**:
```python
{
"max_models": 2,
"max_memory_mb": 14000,
"n_ctx": 8192,
"n_threads": 4,
"n_gpu_layers": 35,
"torch_dtype": "float16",
"device_map": "auto"
}
```
### 12.4 Caching Strategy
**Cache Hierarchy**:
1. **L1 - Model Cache**: In-memory loaded models
2. **L2 - Result Cache**: Generated summaries (LRU, 100 items)
3. **L3 - File Cache**: Processed documents (disk)
4. **L4 - HF Cache**: Downloaded models (disk)
### 12.5 Performance Metrics
**Target Metrics**:
- Model load time: < 10 seconds
- Summary generation: < 60 seconds (small), < 180 seconds (large)
- API response time: < 100ms (excluding generation)
- Memory usage: < 80% of available
- GPU utilization: > 70% during inference
---
## 13. Monitoring & Observability
### 13.1 Health Checks
**Liveness Probe** (`/health/live`):
```python
{
"status": "alive",
"timestamp": "2025-12-05T17:23:52Z"
}
```
**Readiness Probe** (`/health/ready`):
```python
{
"status": "ready",
"checks": {
"database": "ok",
"model_manager": "ok",
"file_storage": "ok"
},
"timestamp": "2025-12-05T17:23:52Z"
}
```
### 13.2 Metrics
**Prometheus Metrics** (`/metrics`):
```
# Model metrics
model_load_time_seconds{model_name="phi-3-gguf"} 8.5
model_inference_time_seconds{model_name="phi-3-gguf"} 45.2
model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296
# API metrics
http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234
http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3
# System metrics
memory_usage_percent 65.2
gpu_memory_usage_percent 72.1
cpu_usage_percent 45.8
```
### 13.3 Logging
**Structured Logging**:
```python
{
"timestamp": "2025-12-05T17:23:52Z",
"level": "INFO",
"logger": "ai_med_extract.agents.patient_summary_agent",
"message": "Generated patient summary",
"context": {
"job_id": "abc123",
"model_name": "phi-3-gguf",
"duration_seconds": 45.2,
"token_count": 2048
}
}
```
**Log Levels**:
- `DEBUG`: Detailed diagnostic information
- `INFO`: General informational messages
- `WARNING`: Warning messages
- `ERROR`: Error messages
- `CRITICAL`: Critical failures
### 13.4 Audit Logging
**HIPAA Audit Trail**:
```python
{
"timestamp": "2025-12-05T17:23:52Z",
"user_id": "user123",
"action": "PHI_ACCESS",
"resource_type": "patient_summary",
"resource_id": "patient456",
"phi_accessed": true,
"ip_address": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"request_data": {...},
"response_status": 200
}
```
---
## 14. Development Workflow
### 14.1 Local Development
**Setup**:
```bash
# Clone repository
git clone <repository-url>
cd HNTAI
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai"
export SECRET_KEY="your-secret-key"
export HF_HOME="/tmp/huggingface"
# Run development server
cd services/ai-service/src
python -m ai_med_extract.app run_dev
```
### 14.2 Testing
**Test Structure**:
```
tests/
├── unit/
│ ├── test_agents.py
│ ├── test_model_manager.py
│ └── test_utils.py
├── integration/
│ ├── test_api.py
│ └── test_workflows.py
└── conftest.py
```
**Running Tests**:
```bash
# Unit tests
python -m pytest tests/unit/
# Integration tests
python -m pytest tests/integration/
# Coverage report
python -m pytest --cov=ai_med_extract tests/
```
### 14.3 Code Quality
**Tools**:
```bash
# Format code
black .
isort .
# Lint code
flake8 .
# Type checking
mypy services/ai-service/src/ai_med_extract/
```
### 14.4 Git Workflow
**Branching Strategy**:
- `main`: Production-ready code
- `develop`: Integration branch
- `feature/*`: Feature branches
- `bugfix/*`: Bug fix branches
- `hotfix/*`: Production hotfixes
**Commit Convention**:
```
<type>(<scope>): <subject>
<body>
<footer>
```
Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
---
## 15. Integration Patterns
### 15.1 External System Integration
**Integration Points**:
1. **EHR Systems**: HL7, FHIR APIs
2. **Document Management**: File uploads, cloud storage
3. **Authentication**: OAuth2, SAML
4. **Monitoring**: Prometheus, Grafana
5. **Logging**: ELK Stack, CloudWatch
### 15.2 API Integration
**Client Libraries** (Planned):
- Python SDK
- JavaScript SDK
- REST API documentation (OpenAPI/Swagger)
**Example Integration**:
```python
import requests
# Upload document
response = requests.post(
"https://api.hntai.com/upload",
files={"file": open("document.pdf", "rb")},
headers={"Authorization": "Bearer <token>"}
)
# Generate patient summary
response = requests.post(
"https://api.hntai.com/generate_patient_summary",
json={
"patient_data": {...},
"model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
"model_type": "gguf"
},
headers={"Authorization": "Bearer <token>"}
)
job_id = response.json()["job_id"]
# Poll for results
while True:
response = requests.get(
f"https://api.hntai.com/job/{job_id}",
headers={"Authorization": "Bearer <token>"}
)
if response.json()["status"] == "completed":
break
time.sleep(5)
```
### 15.3 Webhook Support
**Planned Feature**: Webhook notifications for job completion
```python
{
"event": "job.completed",
"job_id": "abc123",
"timestamp": "2025-12-05T17:23:52Z",
"data": {
"status": "completed",
"result": {...}
}
}
```
---
## 16. Scalability Considerations
### 16.1 Horizontal Scaling
**Strategies**:
1. **Stateless Design**: No session state in application
2. **Load Balancing**: Distribute requests across instances
3. **Shared Cache**: Redis for distributed caching
4. **Message Queue**: RabbitMQ/Kafka for async processing
### 16.2 Vertical Scaling
**Resource Scaling**:
- CPU: 2-8 cores per instance
- Memory: 8-32 GB per instance
- GPU: T4, V100, A100 for inference
### 16.3 Database Scaling
**Strategies**:
1. **Read Replicas**: For audit log queries
2. **Partitioning**: Time-based partitioning for logs
3. **Indexing**: Optimize query performance
4. **Archiving**: Move old logs to cold storage
### 16.4 Model Serving
**Scaling Options**:
1. **Model Replication**: Same model on multiple instances
2. **Model Sharding**: Different models on different instances
3. **Model Versioning**: A/B testing with multiple versions
4. **Dedicated Inference**: Separate inference service
---
## 17. Future Roadmap
### 17.1 Short-Term (3-6 months)
1. **Enhanced Model Support**:
- Support for Llama 3, Mistral models
- Fine-tuned medical models
- Multi-modal models (text + images)
2. **Improved Performance**:
- Model quantization (INT8, INT4)
- Batch inference support
- Streaming responses
3. **Additional Features**:
- Real-time collaboration
- Version control for summaries
- Template-based summaries
### 17.2 Medium-Term (6-12 months)
1. **Advanced AI Capabilities**:
- Multi-agent orchestration
- Retrieval-Augmented Generation (RAG)
- Knowledge graph integration
2. **Enterprise Features**:
- Multi-tenancy support
- Advanced RBAC
- SSO integration
- Compliance reporting
3. **Platform Enhancements**:
- Web UI for management
- Mobile app support
- Plugin architecture
### 17.3 Long-Term (12+ months)
1. **AI/ML Advancements**:
- Custom model training pipeline
- Federated learning support
- Explainable AI (XAI)
2. **Ecosystem Integration**:
- FHIR server integration
- HL7 v3 support
- DICOM image analysis
3. **Global Expansion**:
- Multi-language support
- Regional compliance (GDPR, etc.)
- Edge deployment
---
## Appendix A: Configuration Reference
### Environment Variables
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `DATABASE_URL` | PostgreSQL connection string | - | No |
| `SECRET_KEY` | Application secret key | - | Yes |
| `JWT_SECRET_KEY` | JWT signing key | - | Yes |
| `HF_HOME` | Hugging Face cache directory | `/tmp/huggingface` | No |
| `TORCH_HOME` | PyTorch cache directory | `/tmp/torch` | No |
| `WHISPER_CACHE` | Whisper model cache | `/tmp/whisper` | No |
| `HF_SPACES` | Hugging Face Spaces mode | `false` | No |
| `PRELOAD_GGUF` | Preload GGUF models | `false` | No |
| `MAX_NEW_TOKENS` | Max output tokens | `8192` | No |
| `MAX_INPUT_TOKENS` | Max input tokens | `2048` | No |
---
## Appendix B: API Reference
### Complete Endpoint List
| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/` | Root endpoint |
| `GET` | `/health/live` | Liveness probe |
| `GET` | `/health/ready` | Readiness probe |
| `GET` | `/metrics` | Prometheus metrics |
| `POST` | `/upload` | Upload document |
| `POST` | `/transcribe` | Transcribe audio |
| `POST` | `/generate_patient_summary` | Generate patient summary |
| `POST` | `/api/generate_summary` | Generate text summary |
| `POST` | `/api/patient_summary_openvino` | OpenVINO summary |
| `POST` | `/extract_medical_data` | Extract medical data |
| `GET` | `/get_updated_medical_data` | Get processed data |
| `PUT` | `/update_medical_data` | Update medical data |
| `POST` | `/api/load_model` | Load model |
| `GET` | `/api/model_info` | Get model info |
| `POST` | `/api/switch_model` | Switch model |
---
## Appendix C: Troubleshooting Guide
### Common Issues
#### Model Loading Failures
**Symptom**: Model fails to load
**Causes**:
- Insufficient memory
- Missing dependencies
- Network issues (download)
**Solutions**:
1. Check memory availability
2. Verify dependencies installed
3. Check network connectivity
4. Use fallback model
#### Token Limit Errors
**Symptom**: "Input exceeds token limit"
**Causes**:
- Input too long
- Model context window exceeded
**Solutions**:
1. Reduce input size
2. Use chunking strategy
3. Switch to larger context model
#### Performance Issues
**Symptom**: Slow inference
**Causes**:
- CPU-only inference
- Large model size
- Memory pressure
**Solutions**:
1. Enable GPU acceleration
2. Use quantized models (GGUF)
3. Reduce batch size
4. Clear model cache
---
## Appendix D: Glossary
| Term | Definition |
|------|------------|
| **PHI** | Protected Health Information |
| **HIPAA** | Health Insurance Portability and Accountability Act |
| **EHR** | Electronic Health Record |
| **FHIR** | Fast Healthcare Interoperability Resources |
| **HL7** | Health Level 7 (healthcare data standard) |
| **GGUF** | GPT-Generated Unified Format (quantized models) |
| **OpenVINO** | Open Visual Inference and Neural Network Optimization |
| **T4** | NVIDIA Tesla T4 GPU |
| **LRU** | Least Recently Used (cache eviction) |
| **SSE** | Server-Sent Events |
| **ASGI** | Asynchronous Server Gateway Interface |
---
## Document Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-12-05 | System | Initial comprehensive documentation |
---
**End of Technical Architecture Documentation**