Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

HNTAI / TECHNICAL_ARCHITECTURE.md

sachinchandrankallar

changes for publishing the latest including generate_generic api

4156c57 6 months ago

preview code

Raw

History Blame Contribute Delete

38.1 kB

	# HNTAI - Comprehensive Technical Architecture Documentation

	Version: 1.0
	Last Updated: December 5, 2025
	Project: Medical Data Extraction & AI Processing Platform

	---

	## Table of Contents

	1. [Executive Summary](#executive-summary)
	2. [System Overview](#system-overview)
	3. [Architecture Design](#architecture-design)
	4. [Technology Stack](#technology-stack)
	5. [Core Components](#core-components)
	6. [AI/ML Architecture](#aiml-architecture)
	7. [API Architecture](#api-architecture)
	8. [Data Flow & Processing](#data-flow--processing)
	9. [Database Design](#database-design)
	10. [Security Architecture](#security-architecture)
	11. [Deployment Architecture](#deployment-architecture)
	12. [Performance Optimization](#performance-optimization)
	13. [Monitoring & Observability](#monitoring--observability)
	14. [Development Workflow](#development-workflow)
	15. [Integration Patterns](#integration-patterns)
	16. [Scalability Considerations](#scalability-considerations)
	17. [Future Roadmap](#future-roadmap)

	---

	## 1. Executive Summary

	HNTAI (Healthcare AI Text Analysis & Interpretation) is a production-ready, enterprise-grade medical AI platform designed for medical data extraction, processing, and analysis. The system provides HIPAA-compliant document processing, PHI scrubbing, and AI-powered patient summary generation with support for multiple AI model backends.

	### Key Capabilities

	- Multi-format Document Processing: PDF, DOCX, images, and audio transcription
	- HIPAA Compliance: Automated PHI scrubbing with comprehensive audit logging
	- Multi-Model AI Support: Transformers, OpenVINO, and GGUF models with automatic optimization
	- Scalable Architecture: Kubernetes-ready with horizontal scaling capabilities
	- Production-Ready: Health checks, metrics, structured logging, and error handling

	### Target Deployment Environments

	- Hugging Face Spaces (T4 Medium GPU)
	- Kubernetes Clusters (On-premise or cloud)
	- Docker Containers (Standalone or orchestrated)
	- Local Development (CPU or GPU)

	---

	## 2. System Overview

	### 2.1 Purpose & Scope

	HNTAI serves as a comprehensive medical AI platform that bridges the gap between raw medical documents and actionable clinical insights. The system is designed to:

	1. Extract structured medical data from unstructured documents
	2. Anonymize protected health information (PHI) for compliance
	3. Summarize patient records into comprehensive clinical assessments
	4. Process multi-modal medical data (text, images, audio)

	### 2.2 Design Principles

	- Simplicity: Clean, maintainable codebase with essential features
	- Flexibility: Support for multiple AI model types and backends
	- Security: HIPAA-compliant with comprehensive audit trails
	- Performance: Optimized for T4 GPU with intelligent caching
	- Reliability: Robust error handling and automatic fallback mechanisms

	### 2.3 High-Level Architecture

	```mermaid
	graph TB
	subgraph "Client Layer"
	A[Web Client]
	B[Mobile Client]
	C[API Client]
	end

	subgraph "API Gateway"
	D[FastAPI Application]
	E[Health Endpoints]
	F[Metrics Endpoint]
	end

	subgraph "Service Layer"
	G[Document Processing Service]
	H[PHI Scrubbing Service]
	I[Patient Summary Service]
	J[Model Management Service]
	end

	subgraph "AI/ML Layer"
	K[Unified Model Manager]
	L[Transformers Models]
	M[GGUF Models]
	N[OpenVINO Models]
	O[Whisper Audio Models]
	end

	subgraph "Data Layer"
	P[PostgreSQL - Audit Logs]
	Q[File Storage]
	R[Model Cache]
	end

	A --> D
	B --> D
	C --> D
	D --> E
	D --> F
	D --> G
	D --> H
	D --> I
	D --> J
	G --> K
	H --> K
	I --> K
	J --> K
	K --> L
	K --> M
	K --> N
	K --> O
	D --> P
	G --> Q
	K --> R
	```

	---

	## 3. Architecture Design

	### 3.1 Architectural Style

	HNTAI follows a Layered Monolithic Architecture with clear separation of concerns:

	1. Presentation Layer: FastAPI routes and endpoints
	2. Service Layer: Business logic and orchestration
	3. Agent Layer: Specialized AI agents for specific tasks
	4. Utility Layer: Shared utilities and helpers
	5. Data Layer: Database and file storage

	### 3.2 Component Architecture

	```mermaid
	graph LR
	subgraph "FastAPI Application"
	A[routes_fastapi.py]
	B[app.py]
	C[main.py]
	end

	subgraph "Agents"
	D[patient_summary_agent.py]
	E[phi_scrubber.py]
	F[text_extractor.py]
	G[medical_data_extractor.py]
	end

	subgraph "Services"
	H[job_manager.py]
	I[request_queue.py]
	J[error_handler.py]
	K[sse_generator.py]
	end

	subgraph "Utils"
	L[unified_model_manager.py]
	M[model_config.py]
	N[robust_json_parser.py]
	O[memory_manager.py]
	end

	A --> D
	A --> E
	A --> F
	A --> G
	A --> H
	A --> I
	D --> L
	E --> L
	F --> L
	G --> L
	L --> M
	L --> O
	```

	### 3.3 Directory Structure

	```
	HNTAI/
	├── services/
	│ └── ai-service/
	│ └── src/
	│ └── ai_med_extract/
	│ ├── agents/ # AI agents for specific tasks
	│ │ ├── patient_summary_agent.py
	│ │ ├── phi_scrubber.py
	│ │ ├── text_extractor.py
	│ │ └── medical_data_extractor.py
	│ ├── api/ # FastAPI routes
	│ │ └── routes_fastapi.py
	│ ├── services/ # Business logic services
	│ │ ├── job_manager.py
	│ │ ├── request_queue.py
	│ │ ├── error_handler.py
	│ │ └── sse_generator.py
	│ ├── utils/ # Utilities and helpers
	│ │ ├── unified_model_manager.py
	│ │ ├── model_config.py
	│ │ ├── robust_json_parser.py
	│ │ ├── memory_manager.py
	│ │ ├── openvino_summarizer_utils.py
	│ │ └── patient_summary_utils.py
	│ ├── app.py # FastAPI app factory
	│ ├── main.py # Entry point
	│ ├── health_endpoints.py # Health checks
	│ └── database_audit.py # HIPAA audit logging
	├── docs/ # Documentation
	├── infra/ # Infrastructure configs
	│ └── k8s/ # Kubernetes manifests
	├── app.py # HF Spaces entry point
	├── Dockerfile # Multi-stage Docker build
	├── Dockerfile.hf-spaces # HF Spaces optimized
	├── .huggingface.yaml # HF Spaces config
	├── models_config.json # Model configuration
	├── requirements.txt # Python dependencies
	└── README.md # Project documentation
	```

	---

	## 4. Technology Stack

	### 4.1 Core Technologies

	\| Category \| Technology \| Version \| Purpose \|
	\|----------\|-----------\|---------\|---------\|
	\| Runtime \| Python \| 3.10+ \| Primary language \|
	\| Web Framework \| FastAPI \| Latest \| REST API framework \|
	\| ASGI Server \| Uvicorn \| Latest \| Production server \|
	\| AI/ML Framework \| PyTorch \| 2.x \| Deep learning \|
	\| Transformers \| Hugging Face Transformers \| Latest \| Model loading \|
	\| GGUF Support \| llama-cpp-python \| Latest \| Quantized models \|
	\| OpenVINO \| optimum-intel \| Latest \| Intel optimization \|
	\| Audio Processing \| Whisper \| Latest \| Speech-to-text \|

	### 4.2 Supporting Technologies

	\| Category \| Technology \| Purpose \|
	\|----------\|-----------\|---------\|
	\| Database \| PostgreSQL 13+ \| Audit logs (optional) \|
	\| Caching \| In-memory LRU \| Model caching \|
	\| Document Processing \| PyPDF2, python-docx \| PDF/DOCX parsing \|
	\| OCR \| Tesseract \| Image text extraction \|
	\| Audio \| FFmpeg \| Audio processing \|
	\| Containerization \| Docker \| Deployment \|
	\| Orchestration \| Kubernetes \| Scaling \|
	\| Monitoring \| Prometheus \| Metrics \|

	### 4.3 Development Tools

	- Code Quality: Black, isort, flake8, mypy
	- Testing: pytest
	- Version Control: Git
	- CI/CD: GitHub Actions (potential)
	- Documentation: Markdown, Mermaid diagrams

	---

	## 5. Core Components

	### 5.1 FastAPI Application (`app.py`)

	Purpose: Application factory and initialization

	Key Responsibilities:
	- Create and configure FastAPI application
	- Initialize agents and services
	- Register routes and middleware
	- Configure CORS and security

	Key Functions:
	```python
	def create_app(initialize: bool = True) -> FastAPI
	def initialize_agents(app: FastAPI, preload_small_models: bool = False)
	def run_dev() # Development server
	```

	### 5.2 API Routes (`routes_fastapi.py`)

	Purpose: RESTful API endpoints

	Endpoint Categories:

	#### Health & Monitoring
	- `GET /health/live` - Liveness probe
	- `GET /health/ready` - Readiness probe
	- `GET /metrics` - Prometheus metrics

	#### Document Processing
	- `POST /upload` - Upload and process documents
	- `POST /transcribe` - Audio transcription
	- `GET /get_updated_medical_data` - Retrieve processed data
	- `PUT /update_medical_data` - Update medical records

	#### AI Processing
	- `POST /generate_patient_summary` - Generate patient summaries
	- `POST /api/generate_summary` - Text summarization
	- `POST /api/patient_summary_openvino` - OpenVINO summaries
	- `POST /extract_medical_data` - Extract structured data

	#### Model Management
	- `POST /api/load_model` - Load specific models
	- `GET /api/model_info` - Model information
	- `POST /api/switch_model` - Switch models

	### 5.3 Agents

	#### 5.3.1 Patient Summary Agent (`patient_summary_agent.py`)

	Purpose: Generate comprehensive patient summaries

	Key Features:
	- Dynamic model configuration
	- Multi-section summary generation
	- Chronological narrative building
	- Clinical guideline evaluation
	- Fallback text-based summarization

	Core Methods:
	```python
	def configure_model(model_name: str, model_type: str)
	def generate_clinical_summary(patient_data: Union[List[str], Dict])
	def generate_patient_summary(patient_data: Union[List[str], Dict])
	def build_chronological_narrative(patient_data: dict)
	def format_clinical_output(raw_summary: str, patient_data: dict)
	```

	#### 5.3.2 PHI Scrubber (`phi_scrubber.py`)

	Purpose: Remove protected health information

	Scrubbing Capabilities:
	- Patient names
	- Medical record numbers (MRN)
	- Dates of birth
	- Phone numbers
	- Email addresses
	- Social Security Numbers
	- Addresses

	Compliance: HIPAA-compliant with audit logging

	#### 5.3.3 Text Extractor (`text_extractor.py`)

	Purpose: Extract text from various document formats

	Supported Formats:
	- PDF documents
	- DOCX files
	- Images (via OCR)
	- Plain text

	#### 5.3.4 Medical Data Extractor (`medical_data_extractor.py`)

	Purpose: Extract structured medical data from text

	Extraction Targets:
	- Diagnoses
	- Medications
	- Procedures
	- Lab results
	- Vital signs
	- Allergies

	### 5.4 Services

	#### 5.4.1 Job Manager (`job_manager.py`)

	Purpose: Manage long-running jobs

	Features:
	- Job lifecycle management
	- Progress tracking
	- Status updates
	- Result caching
	- Cleanup of completed jobs

	#### 5.4.2 Request Queue (`request_queue.py`)

	Purpose: Queue and prioritize requests

	Features:
	- Request queuing
	- Priority handling
	- Concurrency control
	- Timeout management

	#### 5.4.3 Error Handler (`error_handler.py`)

	Purpose: Centralized error handling

	Features:
	- Error categorization
	- Contextual logging
	- Job error updates
	- Graceful degradation

	#### 5.4.4 SSE Generator (`sse_generator.py`)

	Purpose: Server-Sent Events for real-time updates

	Features:
	- Progress streaming
	- Status updates
	- Error notifications
	- Completion events

	---

	## 6. AI/ML Architecture

	### 6.1 Unified Model Manager

	File: `unified_model_manager.py`

	Purpose: Single interface for all AI model types

	Architecture:

	```mermaid
	classDiagram
	class BaseModel {
	<<abstract>>
	+name: str
	+model_type: str
	+status: ModelStatus
	+load()
	+generate(prompt, config)*
	+unload()
	}

	class TransformersModel {
	+_model: Pipeline
	+_load_implementation()
	+generate(prompt, config)
	}

	class GGUFModel {
	+_model: Llama
	+filename: str
	+_extract_filename()
	+_load_implementation()
	+generate(prompt, config)
	}

	class OpenVINOModel {
	+_model: OVModelForCausalLM
	+_tokenizer: AutoTokenizer
	+_load_implementation()
	+generate(prompt, config)
	}

	class FallbackModel {
	+_load_implementation()
	+generate(prompt, config)
	}

	class UnifiedModelManager {
	+max_models: int
	+max_memory_mb: int
	+get_model(name, type)
	+generate_text(name, prompt)
	+cleanup()
	}

	BaseModel <\|-- TransformersModel
	BaseModel <\|-- GGUFModel
	BaseModel <\|-- OpenVINOModel
	BaseModel <\|-- FallbackModel
	UnifiedModelManager --> BaseModel
	```

	### 6.2 Model Types

	#### 6.2.1 Transformers Models

	Backend: Hugging Face Transformers
	Device: GPU (CUDA) or CPU
	Use Cases: General text generation, summarization

	Supported Models:
	- `microsoft/Phi-3-mini-4k-instruct`
	- `facebook/bart-large-cnn` (deprecated)
	- `google/flan-t5-large`

	Configuration:
	```python
	{
	"model_name": "microsoft/Phi-3-mini-4k-instruct",
	"model_type": "text-generation",
	"device_map": "auto",
	"torch_dtype": "float16"
	}
	```

	#### 6.2.2 GGUF Models

	Backend: llama-cpp-python
	Device: CPU or GPU (via Metal/CUDA)
	Use Cases: Efficient inference with quantized models

	Supported Models:
	- `microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf` (PRIMARY)

	Configuration:
	```python
	{
	"model_path": "path/to/model.gguf",
	"n_ctx": 8192,
	"n_threads": 4,
	"n_gpu_layers": 35 # GPU acceleration
	}
	```

	#### 6.2.3 OpenVINO Models

	Backend: Intel OpenVINO
	Device: CPU (Intel optimized) or GPU
	Use Cases: Production deployment on Intel hardware

	Supported Models:
	- `OpenVINO/Phi-3-mini-4k-instruct-fp16-ov`

	Configuration:
	```python
	{
	"model_path": "OpenVINO/Phi-3-mini-4k-instruct-fp16-ov",
	"device": "GPU" if available else "CPU"
	}
	```

	### 6.3 Model Selection Strategy

	```mermaid
	flowchart TD
	A[Request with model_name] --> B{Model specified?}
	B -->\|Yes\| C{Model type?}
	B -->\|No\| D[Use default: Phi-3 GGUF]

	C -->\|GGUF\| E[Load GGUF Model]
	C -->\|OpenVINO\| F[Load OpenVINO Model]
	C -->\|Transformers\| G[Load Transformers Model]
	C -->\|Unknown\| H[Auto-detect type]

	E --> I{Load successful?}
	F --> I
	G --> I
	H --> I
	D --> I

	I -->\|Yes\| J[Generate with model]
	I -->\|No\| K[Try fallback model]

	K --> L{Fallback successful?}
	L -->\|Yes\| J
	L -->\|No\| M[Use text-based fallback]
	```

	### 6.4 Model Configuration

	File: `models_config.json`

	```json
	{
	"patient_summary_models": [
	{
	"name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
	"type": "gguf",
	"is_active": true,
	"cached": true,
	"description": "Phi-3 Mini GGUF Q4 quantized - PRIMARY MODEL",
	"use_case": "Fast patient summary generation with CPU/GPU",
	"repo_id": "microsoft/Phi-3-mini-4k-instruct-gguf",
	"filename": "Phi-3-mini-4k-instruct-q4.gguf"
	}
	],
	"runtime_behavior": {
	"allow_runtime_downloads": true,
	"cache_runtime_downloads": true,
	"fallback_to_cached": true
	}
	}
	```

	### 6.5 Token Management

	Token Limit Handling:
	- Automatic token counting (heuristic: ~4 chars/token)
	- Pre-generation validation
	- Token limit error detection
	- Graceful degradation

	Token Limits by Model:
	- Phi-3 models: 4096 tokens (context window)
	- BART models: 1024 tokens
	- T5 models: 512 tokens

	### 6.6 Generation Configuration

	```python
	@dataclass
	class GenerationConfig:
	max_tokens: int = 8192 # Maximum output tokens
	min_tokens: int = 50 # Minimum output tokens
	temperature: float = 0.3 # Deterministic for medical
	top_p: float = 0.9 # Nucleus sampling
	timeout: float = 180.0 # T4 timeout
	stream: bool = False # Streaming support
	```

	### 6.7 T4 GPU Optimizations

	Hardware Target: NVIDIA T4 Medium (16GB GPU, 16GB RAM)

	Optimizations:
	1. Memory Management:
	- Max 2 models in memory
	- Automatic model unloading
	- GPU memory clearing
	- Garbage collection

	2. Model Loading:
	- Lazy loading (on-demand)
	- Intelligent caching
	- LRU eviction policy

	3. Inference:
	- FP16 precision
	- Batch size: 1
	- Context window: 8192 tokens
	- GPU layer offloading (GGUF)

	---

	## 7. API Architecture

	### 7.1 RESTful Design

	Principles:
	- Resource-oriented URLs
	- HTTP methods for CRUD operations
	- JSON request/response format
	- Stateless communication
	- Proper HTTP status codes

	### 7.2 Request/Response Flow

	```mermaid
	sequenceDiagram
	participant C as Client
	participant A as API Gateway
	participant S as Service Layer
	participant M as Model Manager
	participant D as Database

	C->>A: POST /generate_patient_summary
	A->>A: Validate request
	A->>S: Create job
	S->>D: Log job creation
	A-->>C: 202 Accepted (job_id)

	S->>M: Load model
	M->>M: Check cache
	M->>M: Load if needed
	M-->>S: Model ready

	S->>M: Generate summary
	M->>M: Process prompt
	M-->>S: Generated text

	S->>D: Log completion
	S->>A: Update job status
	A-->>C: SSE: Progress updates

	C->>A: GET /job/{job_id}
	A->>S: Get job status
	S->>D: Retrieve job
	S-->>A: Job result
	A-->>C: 200 OK (result)
	```

	### 7.3 Authentication & Authorization

	Current State: Basic API key authentication (optional)

	Planned Enhancements:
	- JWT-based authentication
	- Role-based access control (RBAC)
	- OAuth2 integration
	- API rate limiting

	### 7.4 Error Handling

	Error Response Format:
	```json
	{
	"error": {
	"code": "MODEL_LOAD_FAILED",
	"message": "Failed to load model: microsoft/Phi-3-mini-4k-instruct",
	"details": {
	"model_name": "microsoft/Phi-3-mini-4k-instruct",
	"error_type": "initialization_error",
	"timestamp": "2025-12-05T17:23:52Z"
	}
	}
	}
	```

	HTTP Status Codes:
	- `200 OK` - Successful request
	- `202 Accepted` - Job created
	- `400 Bad Request` - Invalid input
	- `404 Not Found` - Resource not found
	- `500 Internal Server Error` - Server error
	- `503 Service Unavailable` - Service degraded

	### 7.5 Rate Limiting

	Strategy: Token bucket algorithm

	Limits:
	- 100 requests/minute per IP
	- 1000 requests/hour per API key
	- Burst allowance: 20 requests

	---

	## 8. Data Flow & Processing

	### 8.1 Document Processing Pipeline

	```mermaid
	flowchart LR
	A[Upload Document] --> B{File Type?}
	B -->\|PDF\| C[PDF Parser]
	B -->\|DOCX\| D[DOCX Parser]
	B -->\|Image\| E[OCR Engine]
	B -->\|Audio\| F[Whisper Transcription]

	C --> G[Text Extraction]
	D --> G
	E --> G
	F --> G

	G --> H[PHI Scrubbing]
	H --> I[Medical Data Extraction]
	I --> J[Store Processed Data]
	J --> K[Return Results]
	```

	### 8.2 Patient Summary Generation Flow

	```mermaid
	flowchart TD
	A[Patient Data Input] --> B[Parse EHR Data]
	B --> C[Convert to Plain Text]
	C --> D{Data Size Check}

	D -->\|Small\| E[Single-pass Generation]
	D -->\|Large\| F[Chunking Strategy]

	F --> G[Chunk by Date/Size]
	G --> H[Process Chunks in Parallel]
	H --> I[Combine Chunk Summaries]

	E --> J[Generate with Model]
	I --> J

	J --> K[Format Clinical Output]
	K --> L[Evaluate Against Guidelines]
	L --> M[Return Summary]
	```

	### 8.3 Data Transformation

	Input Formats:
	- Raw EHR JSON
	- HL7 FHIR resources
	- Plain text documents
	- Scanned images
	- Audio recordings

	Output Formats:
	- Structured JSON
	- Clinical summary (Markdown)
	- FHIR-compliant resources
	- Audit logs

	### 8.4 Caching Strategy

	Multi-Level Caching:

	1. Model Cache: Loaded models in memory
	2. Result Cache: Generated summaries (LRU)
	3. File Cache: Processed documents
	4. Hugging Face Cache: Downloaded models

	Cache Invalidation:
	- Time-based expiration
	- Manual invalidation
	- Memory pressure-based eviction

	---

	## 9. Database Design

	### 9.1 Database Schema

	Primary Database: PostgreSQL (optional, for audit logs)

	#### Audit Logs Table

	```sql
	CREATE TABLE audit_logs (
	id SERIAL PRIMARY KEY,
	timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
	user_id VARCHAR(255),
	action VARCHAR(100) NOT NULL,
	resource_type VARCHAR(100),
	resource_id VARCHAR(255),
	phi_accessed BOOLEAN DEFAULT FALSE,
	ip_address INET,
	user_agent TEXT,
	request_data JSONB,
	response_status INTEGER,
	error_message TEXT,
	created_at TIMESTAMP DEFAULT NOW()
	);

	CREATE INDEX idx_audit_timestamp ON audit_logs(timestamp);
	CREATE INDEX idx_audit_user ON audit_logs(user_id);
	CREATE INDEX idx_audit_action ON audit_logs(action);
	CREATE INDEX idx_audit_phi ON audit_logs(phi_accessed);
	```

	### 9.2 Data Models

	Patient Data Model (In-memory):
	```python
	{
	"patient_id": "string",
	"demographics": {
	"name": "string",
	"dob": "date",
	"gender": "string",
	"mrn": "string"
	},
	"visits": [
	{
	"visit_id": "string",
	"date": "datetime",
	"chief_complaint": "string",
	"diagnoses": ["string"],
	"medications": ["string"],
	"procedures": ["string"],
	"vitals": {},
	"labs": []
	}
	]
	}
	```

	### 9.3 File Storage

	Storage Strategy: Local filesystem or cloud storage

	Directory Structure:
	```
	/data/
	├── uploads/ # Uploaded documents
	├── processed/ # Processed documents
	├── cache/ # Temporary cache
	└── models/ # Model files
	```

	---

	## 10. Security Architecture

	### 10.1 HIPAA Compliance

	Requirements Met:
	1. Access Controls: Authentication and authorization
	2. Audit Logging: Comprehensive activity logs
	3. Data Encryption: In-transit and at-rest
	4. PHI Scrubbing: Automated anonymization
	5. Secure Communication: HTTPS/TLS

	### 10.2 PHI Scrubbing

	Scrubbing Patterns:
	```python
	PATTERNS = {
	"name": r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
	"mrn": r'\bMRN[:\s]*\d{6,10}\b',
	"dob": r'\b\d{1,2}/\d{1,2}/\d{2,4}\b',
	"phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
	"email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z\|a-z]{2,}\b',
	"ssn": r'\b\d{3}-\d{2}-\d{4}\b'
	}
	```

	### 10.3 Container Security

	Security Measures:
	- Non-root user execution
	- Read-only root filesystem
	- Resource limits (CPU, memory)
	- Network policies
	- Secrets management
	- Minimal base images

	### 10.4 API Security

	Security Headers:
	```python
	{
	"X-Content-Type-Options": "nosniff",
	"X-Frame-Options": "DENY",
	"X-XSS-Protection": "1; mode=block",
	"Strict-Transport-Security": "max-age=31536000"
	}
	```

	---

	## 11. Deployment Architecture

	### 11.1 Deployment Options

	#### 11.1.1 Hugging Face Spaces

	Configuration: `.huggingface.yaml`

	```yaml
	runtime: docker
	sdk: docker
	python_version: "3.10"

	build:
	dockerfile: Dockerfile.hf-spaces
	cache: true

	hardware:
	gpu: t4-medium # 16GB GPU RAM, 16GB System RAM

	env:
	- SPACE_ID=$SPACE_ID
	- HF_HOME=/app/.cache/huggingface
	- TORCH_HOME=/app/.cache/torch
	- MODEL_CACHE_DIR=/app/models
	- PRELOAD_GGUF=true
	- HF_SPACES=true
	```

	Optimizations:
	- Pre-cached models in Docker image
	- Lazy model loading
	- Memory-efficient inference
	- Automatic GPU detection

	#### 11.1.2 Kubernetes

	Deployment Manifest:
	```yaml
	apiVersion: apps/v1
	kind: Deployment
	metadata:
	name: hntai-deployment
	spec:
	replicas: 3
	selector:
	matchLabels:
	app: hntai
	template:
	metadata:
	labels:
	app: hntai
	spec:
	containers:
	- name: hntai
	image: hntai:latest
	ports:
	- containerPort: 7860
	resources:
	requests:
	memory: "4Gi"
	cpu: "2"
	limits:
	memory: "8Gi"
	cpu: "4"
	livenessProbe:
	httpGet:
	path: /health/live
	port: 7860
	initialDelaySeconds: 30
	periodSeconds: 10
	readinessProbe:
	httpGet:
	path: /health/ready
	port: 7860
	initialDelaySeconds: 10
	periodSeconds: 5
	```

	#### 11.1.3 Docker

	Multi-Stage Dockerfile:

	```dockerfile
	# Stage 1: Builder
	FROM python:3.10-slim AS builder
	RUN apt-get update && apt-get install -y build-essential
	COPY requirements.txt .
	RUN pip install --prefix=/install -r requirements.txt

	# Stage 2: Runtime
	FROM python:3.10-slim AS runtime
	COPY --from=builder /install /usr/local
	WORKDIR /app
	COPY . .
	ENV PYTHONUNBUFFERED=1
	EXPOSE 7860
	CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
	```

	### 11.2 Scaling Strategy

	Horizontal Scaling:
	- Multiple replicas behind load balancer
	- Stateless design for easy scaling
	- Shared model cache (optional)

	Vertical Scaling:
	- Increase CPU/memory per instance
	- GPU acceleration for inference
	- Larger model support

	### 11.3 High Availability

	Components:
	1. Load Balancer: Distribute traffic
	2. Health Checks: Automatic failover
	3. Auto-scaling: Based on CPU/memory
	4. Graceful Shutdown: Drain connections

	---

	## 12. Performance Optimization

	### 12.1 Model Optimization

	Techniques:
	1. Quantization: GGUF Q4 models (4-bit)
	2. Precision: FP16 for GPU inference
	3. Batching: Batch size optimization
	4. Caching: Model and result caching
	5. Lazy Loading: On-demand model loading

	### 12.2 Memory Management

	Strategies:
	- Automatic garbage collection
	- GPU memory clearing
	- Model unloading (LRU)
	- Memory pressure monitoring

	Memory Limits:
	- T4 Medium: 16GB GPU, 16GB RAM
	- Max 2 models in memory
	- Automatic eviction at 80% usage

	### 12.3 Inference Optimization

	T4-Specific Optimizations:
	```python
	{
	"max_models": 2,
	"max_memory_mb": 14000,
	"n_ctx": 8192,
	"n_threads": 4,
	"n_gpu_layers": 35,
	"torch_dtype": "float16",
	"device_map": "auto"
	}
	```

	### 12.4 Caching Strategy

	Cache Hierarchy:
	1. L1 - Model Cache: In-memory loaded models
	2. L2 - Result Cache: Generated summaries (LRU, 100 items)
	3. L3 - File Cache: Processed documents (disk)
	4. L4 - HF Cache: Downloaded models (disk)

	### 12.5 Performance Metrics

	Target Metrics:
	- Model load time: < 10 seconds
	- Summary generation: < 60 seconds (small), < 180 seconds (large)
	- API response time: < 100ms (excluding generation)
	- Memory usage: < 80% of available
	- GPU utilization: > 70% during inference

	---

	## 13. Monitoring & Observability

	### 13.1 Health Checks

	Liveness Probe (`/health/live`):
	```python
	{
	"status": "alive",
	"timestamp": "2025-12-05T17:23:52Z"
	}
	```

	Readiness Probe (`/health/ready`):
	```python
	{
	"status": "ready",
	"checks": {
	"database": "ok",
	"model_manager": "ok",
	"file_storage": "ok"
	},
	"timestamp": "2025-12-05T17:23:52Z"
	}
	```

	### 13.2 Metrics

	Prometheus Metrics (`/metrics`):
	```
	# Model metrics
	model_load_time_seconds{model_name="phi-3-gguf"} 8.5
	model_inference_time_seconds{model_name="phi-3-gguf"} 45.2
	model_memory_usage_bytes{model_name="phi-3-gguf"} 4294967296

	# API metrics
	http_requests_total{method="POST",endpoint="/generate_patient_summary"} 1234
	http_request_duration_seconds{method="POST",endpoint="/generate_patient_summary"} 52.3

	# System metrics
	memory_usage_percent 65.2
	gpu_memory_usage_percent 72.1
	cpu_usage_percent 45.8
	```

	### 13.3 Logging

	Structured Logging:
	```python
	{
	"timestamp": "2025-12-05T17:23:52Z",
	"level": "INFO",
	"logger": "ai_med_extract.agents.patient_summary_agent",
	"message": "Generated patient summary",
	"context": {
	"job_id": "abc123",
	"model_name": "phi-3-gguf",
	"duration_seconds": 45.2,
	"token_count": 2048
	}
	}
	```

	Log Levels:
	- `DEBUG`: Detailed diagnostic information
	- `INFO`: General informational messages
	- `WARNING`: Warning messages
	- `ERROR`: Error messages
	- `CRITICAL`: Critical failures

	### 13.4 Audit Logging

	HIPAA Audit Trail:
	```python
	{
	"timestamp": "2025-12-05T17:23:52Z",
	"user_id": "user123",
	"action": "PHI_ACCESS",
	"resource_type": "patient_summary",
	"resource_id": "patient456",
	"phi_accessed": true,
	"ip_address": "192.168.1.100",
	"user_agent": "Mozilla/5.0...",
	"request_data": {...},
	"response_status": 200
	}
	```

	---

	## 14. Development Workflow

	### 14.1 Local Development

	Setup:
	```bash
	# Clone repository
	git clone <repository-url>
	cd HNTAI

	# Create virtual environment
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate

	# Install dependencies
	pip install -r requirements.txt

	# Set environment variables
	export DATABASE_URL="postgresql://user:pass@localhost:5432/hntai"
	export SECRET_KEY="your-secret-key"
	export HF_HOME="/tmp/huggingface"

	# Run development server
	cd services/ai-service/src
	python -m ai_med_extract.app run_dev
	```

	### 14.2 Testing

	Test Structure:
	```
	tests/
	├── unit/
	│ ├── test_agents.py
	│ ├── test_model_manager.py
	│ └── test_utils.py
	├── integration/
	│ ├── test_api.py
	│ └── test_workflows.py
	└── conftest.py
	```

	Running Tests:
	```bash
	# Unit tests
	python -m pytest tests/unit/

	# Integration tests
	python -m pytest tests/integration/

	# Coverage report
	python -m pytest --cov=ai_med_extract tests/
	```

	### 14.3 Code Quality

	Tools:
	```bash
	# Format code
	black .
	isort .

	# Lint code
	flake8 .

	# Type checking
	mypy services/ai-service/src/ai_med_extract/
	```

	### 14.4 Git Workflow

	Branching Strategy:
	- `main`: Production-ready code
	- `develop`: Integration branch
	- `feature/*`: Feature branches
	- `bugfix/*`: Bug fix branches
	- `hotfix/*`: Production hotfixes

	Commit Convention:
	```
	<type>(<scope>): <subject>

	<body>

	<footer>
	```

	Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`

	---

	## 15. Integration Patterns

	### 15.1 External System Integration

	Integration Points:
	1. EHR Systems: HL7, FHIR APIs
	2. Document Management: File uploads, cloud storage
	3. Authentication: OAuth2, SAML
	4. Monitoring: Prometheus, Grafana
	5. Logging: ELK Stack, CloudWatch

	### 15.2 API Integration

	Client Libraries (Planned):
	- Python SDK
	- JavaScript SDK
	- REST API documentation (OpenAPI/Swagger)

	Example Integration:
	```python
	import requests

	# Upload document
	response = requests.post(
	"https://api.hntai.com/upload",
	files={"file": open("document.pdf", "rb")},
	headers={"Authorization": "Bearer <token>"}
	)

	# Generate patient summary
	response = requests.post(
	"https://api.hntai.com/generate_patient_summary",
	json={
	"patient_data": {...},
	"model_name": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf",
	"model_type": "gguf"
	},
	headers={"Authorization": "Bearer <token>"}
	)

	job_id = response.json()["job_id"]

	# Poll for results
	while True:
	response = requests.get(
	f"https://api.hntai.com/job/{job_id}",
	headers={"Authorization": "Bearer <token>"}
	)
	if response.json()["status"] == "completed":
	break
	time.sleep(5)
	```

	### 15.3 Webhook Support

	Planned Feature: Webhook notifications for job completion

	```python
	{
	"event": "job.completed",
	"job_id": "abc123",
	"timestamp": "2025-12-05T17:23:52Z",
	"data": {
	"status": "completed",
	"result": {...}
	}
	}
	```

	---

	## 16. Scalability Considerations

	### 16.1 Horizontal Scaling

	Strategies:
	1. Stateless Design: No session state in application
	2. Load Balancing: Distribute requests across instances
	3. Shared Cache: Redis for distributed caching
	4. Message Queue: RabbitMQ/Kafka for async processing

	### 16.2 Vertical Scaling

	Resource Scaling:
	- CPU: 2-8 cores per instance
	- Memory: 8-32 GB per instance
	- GPU: T4, V100, A100 for inference

	### 16.3 Database Scaling

	Strategies:
	1. Read Replicas: For audit log queries
	2. Partitioning: Time-based partitioning for logs
	3. Indexing: Optimize query performance
	4. Archiving: Move old logs to cold storage

	### 16.4 Model Serving

	Scaling Options:
	1. Model Replication: Same model on multiple instances
	2. Model Sharding: Different models on different instances
	3. Model Versioning: A/B testing with multiple versions
	4. Dedicated Inference: Separate inference service

	---

	## 17. Future Roadmap

	### 17.1 Short-Term (3-6 months)

	1. Enhanced Model Support:
	- Support for Llama 3, Mistral models
	- Fine-tuned medical models
	- Multi-modal models (text + images)

	2. Improved Performance:
	- Model quantization (INT8, INT4)
	- Batch inference support
	- Streaming responses

	3. Additional Features:
	- Real-time collaboration
	- Version control for summaries
	- Template-based summaries

	### 17.2 Medium-Term (6-12 months)

	1. Advanced AI Capabilities:
	- Multi-agent orchestration
	- Retrieval-Augmented Generation (RAG)
	- Knowledge graph integration

	2. Enterprise Features:
	- Multi-tenancy support
	- Advanced RBAC
	- SSO integration
	- Compliance reporting

	3. Platform Enhancements:
	- Web UI for management
	- Mobile app support
	- Plugin architecture

	### 17.3 Long-Term (12+ months)

	1. AI/ML Advancements:
	- Custom model training pipeline
	- Federated learning support
	- Explainable AI (XAI)

	2. Ecosystem Integration:
	- FHIR server integration
	- HL7 v3 support
	- DICOM image analysis

	3. Global Expansion:
	- Multi-language support
	- Regional compliance (GDPR, etc.)
	- Edge deployment

	---

	## Appendix A: Configuration Reference

	### Environment Variables

	\| Variable \| Description \| Default \| Required \|
	\|----------\|-------------\|---------\|----------\|
	\| `DATABASE_URL` \| PostgreSQL connection string \| - \| No \|
	\| `SECRET_KEY` \| Application secret key \| - \| Yes \|
	\| `JWT_SECRET_KEY` \| JWT signing key \| - \| Yes \|
	\| `HF_HOME` \| Hugging Face cache directory \| `/tmp/huggingface` \| No \|
	\| `TORCH_HOME` \| PyTorch cache directory \| `/tmp/torch` \| No \|
	\| `WHISPER_CACHE` \| Whisper model cache \| `/tmp/whisper` \| No \|
	\| `HF_SPACES` \| Hugging Face Spaces mode \| `false` \| No \|
	\| `PRELOAD_GGUF` \| Preload GGUF models \| `false` \| No \|
	\| `MAX_NEW_TOKENS` \| Max output tokens \| `8192` \| No \|
	\| `MAX_INPUT_TOKENS` \| Max input tokens \| `2048` \| No \|

	---

	## Appendix B: API Reference

	### Complete Endpoint List

	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| `GET` \| `/` \| Root endpoint \|
	\| `GET` \| `/health/live` \| Liveness probe \|
	\| `GET` \| `/health/ready` \| Readiness probe \|
	\| `GET` \| `/metrics` \| Prometheus metrics \|
	\| `POST` \| `/upload` \| Upload document \|
	\| `POST` \| `/transcribe` \| Transcribe audio \|
	\| `POST` \| `/generate_patient_summary` \| Generate patient summary \|
	\| `POST` \| `/api/generate_summary` \| Generate text summary \|
	\| `POST` \| `/api/patient_summary_openvino` \| OpenVINO summary \|
	\| `POST` \| `/extract_medical_data` \| Extract medical data \|
	\| `GET` \| `/get_updated_medical_data` \| Get processed data \|
	\| `PUT` \| `/update_medical_data` \| Update medical data \|
	\| `POST` \| `/api/load_model` \| Load model \|
	\| `GET` \| `/api/model_info` \| Get model info \|
	\| `POST` \| `/api/switch_model` \| Switch model \|

	---

	## Appendix C: Troubleshooting Guide

	### Common Issues

	#### Model Loading Failures

	Symptom: Model fails to load
	Causes:
	- Insufficient memory
	- Missing dependencies
	- Network issues (download)

	Solutions:
	1. Check memory availability
	2. Verify dependencies installed
	3. Check network connectivity
	4. Use fallback model

	#### Token Limit Errors

	Symptom: "Input exceeds token limit"
	Causes:
	- Input too long
	- Model context window exceeded

	Solutions:
	1. Reduce input size
	2. Use chunking strategy
	3. Switch to larger context model

	#### Performance Issues

	Symptom: Slow inference
	Causes:
	- CPU-only inference
	- Large model size
	- Memory pressure

	Solutions:
	1. Enable GPU acceleration
	2. Use quantized models (GGUF)
	3. Reduce batch size
	4. Clear model cache

	---

	## Appendix D: Glossary

	\| Term \| Definition \|
	\|------\|------------\|
	\| PHI \| Protected Health Information \|
	\| HIPAA \| Health Insurance Portability and Accountability Act \|
	\| EHR \| Electronic Health Record \|
	\| FHIR \| Fast Healthcare Interoperability Resources \|
	\| HL7 \| Health Level 7 (healthcare data standard) \|
	\| GGUF \| GPT-Generated Unified Format (quantized models) \|
	\| OpenVINO \| Open Visual Inference and Neural Network Optimization \|
	\| T4 \| NVIDIA Tesla T4 GPU \|
	\| LRU \| Least Recently Used (cache eviction) \|
	\| SSE \| Server-Sent Events \|
	\| ASGI \| Asynchronous Server Gateway Interface \|

	---

	## Document Revision History

	\| Version \| Date \| Author \| Changes \|
	\|---------\|------\|--------\|---------\|
	\| 1.0 \| 2025-12-05 \| System \| Initial comprehensive documentation \|

	---

	End of Technical Architecture Documentation