retinasense-vit / ARCHITECTURE_DOCUMENT.md
tanishq74's picture
Add ARCHITECTURE_DOCUMENT.md
4fd9a29 verified

RetinaSense-ViT: System Architecture Document

Version: 1.0
Date: March 10, 2026
Author: Tanishq
Status: Production Ready


1. Introduction

1.1 Purpose

This document describes the system architecture of RetinaSense-ViT, a deep learning system for automated multi-class retinal disease classification from fundus images. The system detects five retinal conditions β€” Normal, Diabetic Retinopathy (DR), Glaucoma, Cataract, and Age-related Macular Degeneration (AMD) β€” achieving 84.48% accuracy and 0.840 macro F1 score.

1.2 Scope

This architecture covers:

  • Data ingestion and preprocessing pipeline
  • Model architecture (Vision Transformer and EfficientNet variants)
  • Training infrastructure and GPU optimization
  • Inference pipeline with threshold optimization
  • Evaluation and monitoring subsystems

1.3 Intended Audience

ML engineers, software architects, clinical researchers, and deployment teams.


2. High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RetinaSense-ViT System                        β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ Data     │──▢│ Preprocessing │──▢│ Model    │──▢│ Inference  β”‚ β”‚
β”‚  β”‚ Ingestionβ”‚   β”‚ Pipeline      β”‚   β”‚ Training β”‚   β”‚ Pipeline   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚       β”‚               β”‚                  β”‚               β”‚         β”‚
β”‚       β–Ό               β–Ό                  β–Ό               β–Ό         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ ODIR-5K  β”‚   β”‚ Ben Graham    β”‚   β”‚ViT-Base  β”‚   β”‚ Threshold  β”‚ β”‚
β”‚  β”‚ APTOS-19 β”‚   β”‚ Enhancement   β”‚   β”‚Patch16   β”‚   β”‚ Optimizer  β”‚ β”‚
β”‚  β”‚ Combined β”‚   β”‚ + Caching     β”‚   β”‚-224      β”‚   β”‚            β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Data Architecture

3.1 Data Sources

Source Images Resolution Classes Notes
ODIR-5K 4,966 512Γ—512 All 5 Preprocessed fundus images
APTOS-2019 3,662 ~1949Γ—1500 DR only Raw fundus, 5-level severity
Combined 8,540 224Γ—224 (resized) 5 classes Single-disease filtered

3.2 Class Distribution

Class Samples Percentage Imbalance Ratio
Normal 2,071 24.3% 1.0x
Diabetes/DR 5,581 65.4% 21.1x
Glaucoma 308 3.6% 0.1x
Cataract 315 3.7% 0.1x
AMD 265 3.1% 0.1x

3.3 Data Split Strategy

  • Training: 6,832 samples (80%, stratified)
  • Validation: 1,708 samples (20%, stratified)
  • Stratified split preserves class distribution in both sets

3.4 Domain Shift: Critical Architectural Consideration

  • APTOS images have 10.7Γ— lower sharpness than ODIR (25.5 vs 272.6)
  • All APTOS images map exclusively to the DR class
  • This creates two distinct visual subpopulations within DR
  • The ViT architecture handles this domain gap better than CNNs due to its global attention mechanism

4. Preprocessing Architecture

4.1 Ben Graham Enhancement Pipeline

Input Image ──▢ Resize (224Γ—224) ──▢ Gaussian Blur (Οƒ=10)
                                            β”‚
                                            β–Ό
                                     Weighted Subtraction
                                     4*img - 4*blur + 128
                                            β”‚
                                            β–Ό
                                     Circular Mask
                                     (r = 0.48 Γ— size)
                                            β”‚
                                            β–Ό
                                     ImageNet Normalization
                                     ΞΌ=[0.485,0.456,0.406]
                                     Οƒ=[0.229,0.224,0.225]
                                            β”‚
                                            β–Ό
                                     Output Tensor (3Γ—224Γ—224)

4.2 Pre-Caching Architecture

To eliminate the CPU bottleneck (Ben Graham preprocessing: 100–200ms/image), a caching layer stores preprocessed images as .npy files:

One-Time Caching Phase:
  Raw Image β†’ Ben Graham Preprocessing β†’ np.save('cache/{id}.npy')
  Cost: ~60 seconds for 8,540 images

Training Phase:
  np.load('cache/{id}.npy') β†’ GPU tensor        (~1ms vs 100–200ms)

Impact: GPU utilization improved from 5–10% β†’ 60–85%; training speedup ~4Γ—.

4.3 Data Augmentation (Training Only)

Augmentation Parameters Purpose
RandomHorizontalFlip p=0.5 Geometric invariance
RandomVerticalFlip p=0.3 Geometric invariance
RandomRotation 20Β° Orientation invariance
RandomAffine translate=0.05, scale=0.95–1.05 Position invariance
ColorJitter brightness=0.3, contrast=0.3 Lighting robustness
RandomErasing p=0.2 Occlusion robustness

5. Model Architecture

5.1 Production Model: Vision Transformer (ViT-Base-Patch16-224)

Input Image (3Γ—224Γ—224)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Patch Embedding Layer          β”‚
β”‚   14Γ—14 = 196 patches (16Γ—16)    β”‚
β”‚   + 1 [CLS] token                β”‚
β”‚   + Position Embeddings          β”‚
β”‚   β†’ 197 Γ— 768                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   12Γ— Transformer Encoder Blocks β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚ Multi-Head Self-Attention  β”‚ β”‚
β”‚   β”‚ (12 heads, 768 dim)       β”‚ β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚   β”‚ Layer Norm + Residual      β”‚ β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚   β”‚ MLP (768 β†’ 3072 β†’ 768)   β”‚ β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚   β”‚ Layer Norm + Residual      β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό  [CLS] token output (768-dim)
        β”‚
   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
   β–Ό         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Disease β”‚ β”‚ Severity β”‚
β”‚Head    β”‚ β”‚ Head     β”‚
β”‚        β”‚ β”‚          β”‚
β”‚768β†’512 β”‚ β”‚768β†’256   β”‚
β”‚BN+ReLU β”‚ β”‚BN+ReLU  β”‚
β”‚Drop 0.3β”‚ β”‚Drop 0.3 β”‚
β”‚512β†’256 β”‚ β”‚256β†’5    β”‚
β”‚BN+ReLU β”‚ β”‚(severity)β”‚
β”‚Drop 0.2β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚256β†’5   β”‚
β”‚(class) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Specifications:

Property Value
Architecture ViT-Base-Patch16-224 (timm)
Parameters ~86M
Pre-trained ImageNet-21k
Feature Dimension 768
Patch Size 16Γ—16
Sequence Length 197 (196 patches + 1 CLS)
Model File Size 331 MB

5.2 Backup Model: EfficientNet-B3

Property Value
Architecture EfficientNet-B3 (timm)
Parameters ~12M
Feature Dimension 1,536
Image Size 300Γ—300
Model File Size 47 MB

5.3 Multi-Task Learning Design

Both models share a common backbone with two specialized heads:

  1. Disease Classification Head β†’ 5-class output (softmax)
  2. Severity Grading Head β†’ 5-level DR severity (for APTOS-sourced samples)

Loss = Focal Loss (disease) + 0.2 Γ— CrossEntropy (severity)


6. Training Architecture

6.1 Training Configuration

Parameter Value Rationale
Epochs 30 Best checkpoint at epoch 30
Batch Size 32 (effective 64) Gradient accumulation Γ—2
Optimizer AdamW Weight decay regularization
Learning Rate 3Γ—10⁻⁴ Stable for ViT fine-tuning
LR Scheduler Cosine Annealing (T_max=30, Ξ·_min=1Γ—10⁻⁷) Smooth decay
Mixed Precision AMP (GradScaler) 2Γ— speed, reduced VRAM
Early Stopping Patience=10 on macro F1 Prevent overfitting

6.2 Loss Function: Focal Loss

FL(p_t) = βˆ’Ξ±_t Γ— (1 βˆ’ p_t)^Ξ³ Γ— log(p_t)

Parameters:
  Ξ³ = 1.0   (focusing parameter)
  Ξ± = class_weights  (inverse class frequency)

Focal Loss down-weights easy (well-classified) examples, forcing the model to focus on hard minority samples β€” critical for the 21:1 class imbalance.

6.3 GPU Optimization Architecture

Original Pipeline:                  Optimized Pipeline:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚Disk I/Oβ”‚β†’β”‚Ben Grahamβ”‚β†’β”‚ GPU  β”‚   β”‚Cache I/Oβ”‚β†’β”‚ GPU  β”‚
β”‚ 10ms   β”‚ β”‚ 100-200msβ”‚ β”‚ 20ms β”‚   β”‚  1ms    β”‚ β”‚ 25ms β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜

GPU Util: 6%                        GPU Util: 96%
Speed: ~1 it/s                      Speed: ~4-5 it/s

Optimizations applied:

  • Pre-cached preprocessing (100Γ— faster data loading)
  • Batch size: 32 β†’ 128 (4Γ— larger)
  • DataLoader workers: 2 β†’ 8 (4Γ— parallel loading)
  • Persistent workers, prefetch_factor=2
  • Non-blocking GPU transfers
  • optimizer.zero_grad(set_to_none=True)

7. Inference Architecture

7.1 Single-Image Inference Pipeline

Input Image ──▢ Ben Graham Preprocess ──▢ ImageNet Normalize
                                                β”‚
                                                β–Ό
                                         ViT Forward Pass
                                         (disease_logits, severity_logits)
                                                β”‚
                                                β–Ό
                                            Softmax
                                                β”‚
                                                β–Ό
                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                     β”‚ Threshold-Based      β”‚
                                     β”‚ Decision Logic        β”‚
                                     β”‚                      β”‚
                                     β”‚ Per-Class Thresholds: β”‚
                                     β”‚  Normal:    0.540    β”‚
                                     β”‚  DR:        0.240    β”‚
                                     β”‚  Glaucoma:  0.810    β”‚
                                     β”‚  Cataract:  0.930    β”‚
                                     β”‚  AMD:       0.850    β”‚
                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                β”‚
                                                β–Ό
                                     Prediction + Confidence Score

7.2 Threshold Optimization Method

Per-class thresholds are optimized via grid search (0.05 to 0.95, step 0.05) on the validation set, converting each class to a one-vs-rest binary problem and maximizing F1 score per class.

Two threshold strategies available:

Strategy Accuracy Macro F1 Use Case
Accuracy-focused (Default) 84.48% 0.840 General screening
F1-focused 80.44% 0.858 Rare disease detection

7.3 Inference Performance

Config Latency Throughput GPU Memory
ViT Solo ~15ms ~66 img/s ~2 GB
ViT + TTA (8Γ—) ~120ms ~8 img/s ~2 GB
Ensemble (3 models) ~45ms ~22 img/s ~4 GB

7.4 Optional: Hybrid Inference Architecture

Image ──▢ ViT First-Pass (fast, 15ms)
              β”‚
              β”œβ”€ Confidence β‰₯ 0.75 AND majority class ──▢ Return prediction
              β”‚
              └─ Confidence < 0.75 OR rare class ──▢ Ensemble Second-Pass ──▢ Return

8. Ensemble Architecture (Optional)

Model Weight Architecture Size
ViT-Base-Patch16-224 0.85 Vision Transformer 331 MB
EfficientNet-B3 Extended 0.10 CNN (50 epochs) 47 MB
EfficientNet-B3 v2 0.05 CNN (20 epochs) 47 MB

Ensemble Strategy: Weighted probability averaging
final_prob = 0.85Γ—ViT + 0.10Γ—EffNetExt + 0.05Γ—EffNetv2


9. Technology Stack

Layer Technology Version
Framework PyTorch 2.0+
Model Library timm 0.9+
Vision Utils torchvision 0.18+
Image Processing OpenCV 4.8+
Data Handling pandas 2.0+
ML Metrics scikit-learn 1.3+
Visualization matplotlib, seaborn Latest
GPU NVIDIA H200 150 GB VRAM
Training CUDA + AMP (Mixed Precision) β€”

10. File and Directory Structure

retina-sense/
β”œβ”€β”€ Notebooks
β”‚   β”œβ”€β”€ RetinaSense_Production.ipynb        # Production inference ⭐
β”‚   β”œβ”€β”€ RetinaSense_ViT_Training.ipynb      # ViT training process
β”‚   └── RetinaSense_Optimized.ipynb         # GPU optimization experiments
β”‚
β”œβ”€β”€ Training Scripts
β”‚   β”œβ”€β”€ retinasense_vit.py                  # ViT training (84.48%)
β”‚   β”œβ”€β”€ retinasense_v2_extended.py          # Extended CNN (50 epochs)
β”‚   β”œβ”€β”€ retinasense_v2.py                   # Baseline CNN (20 epochs)
β”‚   └── retinasense_fixed.py                # Bug-fixed original
β”‚
β”œβ”€β”€ Optimization Scripts
β”‚   β”œβ”€β”€ threshold_optimization_vit.py       # ViT threshold tuning
β”‚   β”œβ”€β”€ threshold_optimization_simple.py    # v2 threshold tuning
β”‚   β”œβ”€β”€ ensemble_inference.py               # Model ensemble
β”‚   β”œβ”€β”€ tta_evaluation.py                   # Test-time augmentation
β”‚   └── data_analysis.py                    # Dataset analysis
β”‚
β”œβ”€β”€ Model Outputs
β”‚   β”œβ”€β”€ outputs_vit/                        # ViT checkpoints + results
β”‚   β”œβ”€β”€ outputs_v2/                         # v2 baseline outputs
β”‚   β”œβ”€β”€ outputs_v2_extended/                # Extended training outputs
β”‚   β”œβ”€β”€ outputs_optimized/                  # GPU optimization outputs
β”‚   β”œβ”€β”€ outputs_ensemble/                   # Ensemble results
β”‚   └── outputs_analysis/                   # Data analysis outputs
β”‚
└── Data
    β”œβ”€β”€ data/combined_dataset.csv           # Unified metadata
    └── final_unified_metadata.csv          # Full metadata file

11. Deployment Architecture

11.1 Production Deployment Specification

Model:
  Architecture: ViT-Base-Patch16-224
  Checkpoint: outputs_vit/best_model.pth
  Size: 331 MB
  Parameters: ~86M

Input:
  Image Size: 224Γ—224 pixels
  Format: RGB fundus image
  Preprocessing: Ben Graham + ImageNet normalization

Output:
  Class: [Normal, DR, Glaucoma, Cataract, AMD]
  Confidence: Float [0.0, 1.0]
  All Probabilities: Array of 5 floats
  Flag for Review: If confidence < threshold

Hardware Requirements:
  GPU: NVIDIA (CUDA required), 2+ GB VRAM
  Inference Speed: ~66 images/sec

11.2 Monitoring Requirements

  • Track prediction class distribution for data drift
  • Monitor confidence score calibration over time
  • Log flagged (low-confidence) cases for expert review
  • Alert on out-of-distribution inputs
  • Track inference latency and throughput

12. Limitations and Constraints

  1. Population Bias: Trained primarily on Asian populations (ODIR dataset)
  2. Equipment Sensitivity: May not generalize across different fundus cameras
  3. Image Quality Dependence: Requires high-quality fundus images
  4. Single-Label: Does not handle co-morbidities (multi-label not supported)
  5. Domain Shift: APTOS/ODIR quality gap (10Γ— sharpness difference) is partially addressed by ViT but remains a concern
  6. Not FDA/CE Approved: Research/educational use only

Document Version: 1.0 | Last Updated: March 10, 2026