retinasense-vit / ARCHITECTURE_DOCUMENT.md

Add ARCHITECTURE_DOCUMENT.md

4fd9a29 verified 3 months ago

preview code

raw

history blame contribute delete

17.8 kB

RetinaSense-ViT: System Architecture Document

Version: 1.0
Date: March 10, 2026
Author: Tanishq
Status: Production Ready

1. Introduction

1.1 Purpose

This document describes the system architecture of RetinaSense-ViT, a deep learning system for automated multi-class retinal disease classification from fundus images. The system detects five retinal conditions — Normal, Diabetic Retinopathy (DR), Glaucoma, Cataract, and Age-related Macular Degeneration (AMD) — achieving 84.48% accuracy and 0.840 macro F1 score.

1.2 Scope

This architecture covers:

Data ingestion and preprocessing pipeline
Model architecture (Vision Transformer and EfficientNet variants)
Training infrastructure and GPU optimization
Inference pipeline with threshold optimization
Evaluation and monitoring subsystems

1.3 Intended Audience

ML engineers, software architects, clinical researchers, and deployment teams.

2. High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                      RetinaSense-ViT System                        │
│                                                                     │
│  ┌──────────┐   ┌───────────────┐   ┌──────────┐   ┌────────────┐ │
│  │ Data     │──▶│ Preprocessing │──▶│ Model    │──▶│ Inference  │ │
│  │ Ingestion│   │ Pipeline      │   │ Training │   │ Pipeline   │ │
│  └──────────┘   └───────────────┘   └──────────┘   └────────────┘ │
│       │               │                  │               │         │
│       ▼               ▼                  ▼               ▼         │
│  ┌──────────┐   ┌───────────────┐   ┌──────────┐   ┌────────────┐ │
│  │ ODIR-5K  │   │ Ben Graham    │   │ViT-Base  │   │ Threshold  │ │
│  │ APTOS-19 │   │ Enhancement   │   │Patch16   │   │ Optimizer  │ │
│  │ Combined │   │ + Caching     │   │-224      │   │            │ │
│  └──────────┘   └───────────────┘   └──────────┘   └────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

3. Data Architecture

3.1 Data Sources

Source	Images	Resolution	Classes	Notes
ODIR-5K	4,966	512×512	All 5	Preprocessed fundus images
APTOS-2019	3,662	~1949×1500	DR only	Raw fundus, 5-level severity
Combined	8,540	224×224 (resized)	5 classes	Single-disease filtered

3.2 Class Distribution

Class	Samples	Percentage	Imbalance Ratio
Normal	2,071	24.3%	1.0x
Diabetes/DR	5,581	65.4%	21.1x
Glaucoma	308	3.6%	0.1x
Cataract	315	3.7%	0.1x
AMD	265	3.1%	0.1x

3.3 Data Split Strategy

Training: 6,832 samples (80%, stratified)
Validation: 1,708 samples (20%, stratified)
Stratified split preserves class distribution in both sets

3.4 Domain Shift: Critical Architectural Consideration

APTOS images have 10.7× lower sharpness than ODIR (25.5 vs 272.6)
All APTOS images map exclusively to the DR class
This creates two distinct visual subpopulations within DR
The ViT architecture handles this domain gap better than CNNs due to its global attention mechanism

4. Preprocessing Architecture

4.1 Ben Graham Enhancement Pipeline

Input Image ──▶ Resize (224×224) ──▶ Gaussian Blur (σ=10)
                                            │
                                            ▼
                                     Weighted Subtraction
                                     4*img - 4*blur + 128
                                            │
                                            ▼
                                     Circular Mask
                                     (r = 0.48 × size)
                                            │
                                            ▼
                                     ImageNet Normalization
                                     μ=[0.485,0.456,0.406]
                                     σ=[0.229,0.224,0.225]
                                            │
                                            ▼
                                     Output Tensor (3×224×224)

4.2 Pre-Caching Architecture

To eliminate the CPU bottleneck (Ben Graham preprocessing: 100–200ms/image), a caching layer stores preprocessed images as .npy files:

One-Time Caching Phase:
  Raw Image → Ben Graham Preprocessing → np.save('cache/{id}.npy')
  Cost: ~60 seconds for 8,540 images

Training Phase:
  np.load('cache/{id}.npy') → GPU tensor        (~1ms vs 100–200ms)

Impact: GPU utilization improved from 5–10% → 60–85%; training speedup ~4×.

4.3 Data Augmentation (Training Only)

Augmentation	Parameters	Purpose
RandomHorizontalFlip	p=0.5	Geometric invariance
RandomVerticalFlip	p=0.3	Geometric invariance
RandomRotation	20°	Orientation invariance
RandomAffine	translate=0.05, scale=0.95–1.05	Position invariance
ColorJitter	brightness=0.3, contrast=0.3	Lighting robustness
RandomErasing	p=0.2	Occlusion robustness

5. Model Architecture

5.1 Production Model: Vision Transformer (ViT-Base-Patch16-224)

Input Image (3×224×224)
        │
        ▼
┌──────────────────────────────────┐
│   Patch Embedding Layer          │
│   14×14 = 196 patches (16×16)    │
│   + 1 [CLS] token                │
│   + Position Embeddings          │
│   → 197 × 768                    │
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│   12× Transformer Encoder Blocks │
│   ┌────────────────────────────┐ │
│   │ Multi-Head Self-Attention  │ │
│   │ (12 heads, 768 dim)       │ │
│   ├────────────────────────────┤ │
│   │ Layer Norm + Residual      │ │
│   ├────────────────────────────┤ │
│   │ MLP (768 → 3072 → 768)   │ │
│   ├────────────────────────────┤ │
│   │ Layer Norm + Residual      │ │
│   └────────────────────────────┘ │
└──────────────────────────────────┘
        │
        ▼  [CLS] token output (768-dim)
        │
   ┌────┴────┐
   ▼         ▼
┌────────┐ ┌──────────┐
│Disease │ │ Severity │
│Head    │ │ Head     │
│        │ │          │
│768→512 │ │768→256   │
│BN+ReLU │ │BN+ReLU  │
│Drop 0.3│ │Drop 0.3 │
│512→256 │ │256→5    │
│BN+ReLU │ │(severity)│
│Drop 0.2│ └──────────┘
│256→5   │
│(class) │
└────────┘

Key Specifications:

Property	Value
Architecture	ViT-Base-Patch16-224 (timm)
Parameters	~86M
Pre-trained	ImageNet-21k
Feature Dimension	768
Patch Size	16×16
Sequence Length	197 (196 patches + 1 CLS)
Model File Size	331 MB

5.2 Backup Model: EfficientNet-B3

Property	Value
Architecture	EfficientNet-B3 (timm)
Parameters	~12M
Feature Dimension	1,536
Image Size	300×300
Model File Size	47 MB

5.3 Multi-Task Learning Design

Both models share a common backbone with two specialized heads:

Disease Classification Head → 5-class output (softmax)
Severity Grading Head → 5-level DR severity (for APTOS-sourced samples)

Loss = Focal Loss (disease) + 0.2 × CrossEntropy (severity)

6. Training Architecture

6.1 Training Configuration

Parameter	Value	Rationale
Epochs	30	Best checkpoint at epoch 30
Batch Size	32 (effective 64)	Gradient accumulation ×2
Optimizer	AdamW	Weight decay regularization
Learning Rate	3×10⁻⁴	Stable for ViT fine-tuning
LR Scheduler	Cosine Annealing (T_max=30, η_min=1×10⁻⁷)	Smooth decay
Mixed Precision	AMP (GradScaler)	2× speed, reduced VRAM
Early Stopping	Patience=10 on macro F1	Prevent overfitting

6.2 Loss Function: Focal Loss

FL(p_t) = −α_t × (1 − p_t)^γ × log(p_t)

Parameters:
  γ = 1.0   (focusing parameter)
  α = class_weights  (inverse class frequency)

Focal Loss down-weights easy (well-classified) examples, forcing the model to focus on hard minority samples — critical for the 21:1 class imbalance.

6.3 GPU Optimization Architecture

Original Pipeline:                  Optimized Pipeline:
┌────────┐ ┌──────────┐ ┌──────┐   ┌─────────┐ ┌──────┐
│Disk I/O│→│Ben Graham│→│ GPU  │   │Cache I/O│→│ GPU  │
│ 10ms   │ │ 100-200ms│ │ 20ms │   │  1ms    │ │ 25ms │
└────────┘ └──────────┘ └──────┘   └─────────┘ └──────┘

GPU Util: 6%                        GPU Util: 96%
Speed: ~1 it/s                      Speed: ~4-5 it/s

Optimizations applied:

Pre-cached preprocessing (100× faster data loading)
Batch size: 32 → 128 (4× larger)
DataLoader workers: 2 → 8 (4× parallel loading)
Persistent workers, prefetch_factor=2
Non-blocking GPU transfers
optimizer.zero_grad(set_to_none=True)

7. Inference Architecture

7.1 Single-Image Inference Pipeline

Input Image ──▶ Ben Graham Preprocess ──▶ ImageNet Normalize
                                                │
                                                ▼
                                         ViT Forward Pass
                                         (disease_logits, severity_logits)
                                                │
                                                ▼
                                            Softmax
                                                │
                                                ▼
                                     ┌──────────────────────┐
                                     │ Threshold-Based      │
                                     │ Decision Logic        │
                                     │                      │
                                     │ Per-Class Thresholds: │
                                     │  Normal:    0.540    │
                                     │  DR:        0.240    │
                                     │  Glaucoma:  0.810    │
                                     │  Cataract:  0.930    │
                                     │  AMD:       0.850    │
                                     └──────────────────────┘
                                                │
                                                ▼
                                     Prediction + Confidence Score

7.2 Threshold Optimization Method

Per-class thresholds are optimized via grid search (0.05 to 0.95, step 0.05) on the validation set, converting each class to a one-vs-rest binary problem and maximizing F1 score per class.

Two threshold strategies available:

Strategy	Accuracy	Macro F1	Use Case
Accuracy-focused (Default)	84.48%	0.840	General screening
F1-focused	80.44%	0.858	Rare disease detection

7.3 Inference Performance

Config	Latency	Throughput	GPU Memory
ViT Solo	~15ms	~66 img/s	~2 GB
ViT + TTA (8×)	~120ms	~8 img/s	~2 GB
Ensemble (3 models)	~45ms	~22 img/s	~4 GB

7.4 Optional: Hybrid Inference Architecture

Image ──▶ ViT First-Pass (fast, 15ms)
              │
              ├─ Confidence ≥ 0.75 AND majority class ──▶ Return prediction
              │
              └─ Confidence < 0.75 OR rare class ──▶ Ensemble Second-Pass ──▶ Return

8. Ensemble Architecture (Optional)

Model	Weight	Architecture	Size
ViT-Base-Patch16-224	0.85	Vision Transformer	331 MB
EfficientNet-B3 Extended	0.10	CNN (50 epochs)	47 MB
EfficientNet-B3 v2	0.05	CNN (20 epochs)	47 MB

Ensemble Strategy: Weighted probability averaging
final_prob = 0.85×ViT + 0.10×EffNetExt + 0.05×EffNetv2

9. Technology Stack

Layer	Technology	Version
Framework	PyTorch	2.0+
Model Library	timm	0.9+
Vision Utils	torchvision	0.18+
Image Processing	OpenCV	4.8+
Data Handling	pandas	2.0+
ML Metrics	scikit-learn	1.3+
Visualization	matplotlib, seaborn	Latest
GPU	NVIDIA H200	150 GB VRAM
Training	CUDA + AMP (Mixed Precision)	—

10. File and Directory Structure

retina-sense/
├── Notebooks
│   ├── RetinaSense_Production.ipynb        # Production inference ⭐
│   ├── RetinaSense_ViT_Training.ipynb      # ViT training process
│   └── RetinaSense_Optimized.ipynb         # GPU optimization experiments
│
├── Training Scripts
│   ├── retinasense_vit.py                  # ViT training (84.48%)
│   ├── retinasense_v2_extended.py          # Extended CNN (50 epochs)
│   ├── retinasense_v2.py                   # Baseline CNN (20 epochs)
│   └── retinasense_fixed.py                # Bug-fixed original
│
├── Optimization Scripts
│   ├── threshold_optimization_vit.py       # ViT threshold tuning
│   ├── threshold_optimization_simple.py    # v2 threshold tuning
│   ├── ensemble_inference.py               # Model ensemble
│   ├── tta_evaluation.py                   # Test-time augmentation
│   └── data_analysis.py                    # Dataset analysis
│
├── Model Outputs
│   ├── outputs_vit/                        # ViT checkpoints + results
│   ├── outputs_v2/                         # v2 baseline outputs
│   ├── outputs_v2_extended/                # Extended training outputs
│   ├── outputs_optimized/                  # GPU optimization outputs
│   ├── outputs_ensemble/                   # Ensemble results
│   └── outputs_analysis/                   # Data analysis outputs
│
└── Data
    ├── data/combined_dataset.csv           # Unified metadata
    └── final_unified_metadata.csv          # Full metadata file

11. Deployment Architecture

11.1 Production Deployment Specification

Model:
  Architecture: ViT-Base-Patch16-224
  Checkpoint: outputs_vit/best_model.pth
  Size: 331 MB
  Parameters: ~86M

Input:
  Image Size: 224×224 pixels
  Format: RGB fundus image
  Preprocessing: Ben Graham + ImageNet normalization

Output:
  Class: [Normal, DR, Glaucoma, Cataract, AMD]
  Confidence: Float [0.0, 1.0]
  All Probabilities: Array of 5 floats
  Flag for Review: If confidence < threshold

Hardware Requirements:
  GPU: NVIDIA (CUDA required), 2+ GB VRAM
  Inference Speed: ~66 images/sec

11.2 Monitoring Requirements

Track prediction class distribution for data drift
Monitor confidence score calibration over time
Log flagged (low-confidence) cases for expert review
Alert on out-of-distribution inputs
Track inference latency and throughput

12. Limitations and Constraints

Population Bias: Trained primarily on Asian populations (ODIR dataset)
Equipment Sensitivity: May not generalize across different fundus cameras
Image Quality Dependence: Requires high-quality fundus images
Single-Label: Does not handle co-morbidities (multi-label not supported)
Domain Shift: APTOS/ODIR quality gap (10× sharpness difference) is partially addressed by ViT but remains a concern
Not FDA/CE Approved: Research/educational use only

Document Version: 1.0 | Last Updated: March 10, 2026