RetinaSense-ViT: System Architecture Document
Version: 1.0
Date: March 10, 2026
Author: Tanishq
Status: Production Ready
1. Introduction
1.1 Purpose
This document describes the system architecture of RetinaSense-ViT, a deep learning system for automated multi-class retinal disease classification from fundus images. The system detects five retinal conditions β Normal, Diabetic Retinopathy (DR), Glaucoma, Cataract, and Age-related Macular Degeneration (AMD) β achieving 84.48% accuracy and 0.840 macro F1 score.
1.2 Scope
This architecture covers:
- Data ingestion and preprocessing pipeline
- Model architecture (Vision Transformer and EfficientNet variants)
- Training infrastructure and GPU optimization
- Inference pipeline with threshold optimization
- Evaluation and monitoring subsystems
1.3 Intended Audience
ML engineers, software architects, clinical researchers, and deployment teams.
2. High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RetinaSense-ViT System β
β β
β ββββββββββββ βββββββββββββββββ ββββββββββββ ββββββββββββββ β
β β Data ββββΆβ Preprocessing ββββΆβ Model ββββΆβ Inference β β
β β Ingestionβ β Pipeline β β Training β β Pipeline β β
β ββββββββββββ βββββββββββββββββ ββββββββββββ ββββββββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β ββββββββββββ βββββββββββββββββ ββββββββββββ ββββββββββββββ β
β β ODIR-5K β β Ben Graham β βViT-Base β β Threshold β β
β β APTOS-19 β β Enhancement β βPatch16 β β Optimizer β β
β β Combined β β + Caching β β-224 β β β β
β ββββββββββββ βββββββββββββββββ ββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Data Architecture
3.1 Data Sources
| Source | Images | Resolution | Classes | Notes |
|---|---|---|---|---|
| ODIR-5K | 4,966 | 512Γ512 | All 5 | Preprocessed fundus images |
| APTOS-2019 | 3,662 | ~1949Γ1500 | DR only | Raw fundus, 5-level severity |
| Combined | 8,540 | 224Γ224 (resized) | 5 classes | Single-disease filtered |
3.2 Class Distribution
| Class | Samples | Percentage | Imbalance Ratio |
|---|---|---|---|
| Normal | 2,071 | 24.3% | 1.0x |
| Diabetes/DR | 5,581 | 65.4% | 21.1x |
| Glaucoma | 308 | 3.6% | 0.1x |
| Cataract | 315 | 3.7% | 0.1x |
| AMD | 265 | 3.1% | 0.1x |
3.3 Data Split Strategy
- Training: 6,832 samples (80%, stratified)
- Validation: 1,708 samples (20%, stratified)
- Stratified split preserves class distribution in both sets
3.4 Domain Shift: Critical Architectural Consideration
- APTOS images have 10.7Γ lower sharpness than ODIR (25.5 vs 272.6)
- All APTOS images map exclusively to the DR class
- This creates two distinct visual subpopulations within DR
- The ViT architecture handles this domain gap better than CNNs due to its global attention mechanism
4. Preprocessing Architecture
4.1 Ben Graham Enhancement Pipeline
Input Image βββΆ Resize (224Γ224) βββΆ Gaussian Blur (Ο=10)
β
βΌ
Weighted Subtraction
4*img - 4*blur + 128
β
βΌ
Circular Mask
(r = 0.48 Γ size)
β
βΌ
ImageNet Normalization
ΞΌ=[0.485,0.456,0.406]
Ο=[0.229,0.224,0.225]
β
βΌ
Output Tensor (3Γ224Γ224)
4.2 Pre-Caching Architecture
To eliminate the CPU bottleneck (Ben Graham preprocessing: 100β200ms/image), a caching layer stores preprocessed images as .npy files:
One-Time Caching Phase:
Raw Image β Ben Graham Preprocessing β np.save('cache/{id}.npy')
Cost: ~60 seconds for 8,540 images
Training Phase:
np.load('cache/{id}.npy') β GPU tensor (~1ms vs 100β200ms)
Impact: GPU utilization improved from 5β10% β 60β85%; training speedup ~4Γ.
4.3 Data Augmentation (Training Only)
| Augmentation | Parameters | Purpose |
|---|---|---|
| RandomHorizontalFlip | p=0.5 | Geometric invariance |
| RandomVerticalFlip | p=0.3 | Geometric invariance |
| RandomRotation | 20Β° | Orientation invariance |
| RandomAffine | translate=0.05, scale=0.95β1.05 | Position invariance |
| ColorJitter | brightness=0.3, contrast=0.3 | Lighting robustness |
| RandomErasing | p=0.2 | Occlusion robustness |
5. Model Architecture
5.1 Production Model: Vision Transformer (ViT-Base-Patch16-224)
Input Image (3Γ224Γ224)
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Patch Embedding Layer β
β 14Γ14 = 196 patches (16Γ16) β
β + 1 [CLS] token β
β + Position Embeddings β
β β 197 Γ 768 β
ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β 12Γ Transformer Encoder Blocks β
β ββββββββββββββββββββββββββββββ β
β β Multi-Head Self-Attention β β
β β (12 heads, 768 dim) β β
β ββββββββββββββββββββββββββββββ€ β
β β Layer Norm + Residual β β
β ββββββββββββββββββββββββββββββ€ β
β β MLP (768 β 3072 β 768) β β
β ββββββββββββββββββββββββββββββ€ β
β β Layer Norm + Residual β β
β ββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββ
β
βΌ [CLS] token output (768-dim)
β
ββββββ΄βββββ
βΌ βΌ
ββββββββββ ββββββββββββ
βDisease β β Severity β
βHead β β Head β
β β β β
β768β512 β β768β256 β
βBN+ReLU β βBN+ReLU β
βDrop 0.3β βDrop 0.3 β
β512β256 β β256β5 β
βBN+ReLU β β(severity)β
βDrop 0.2β ββββββββββββ
β256β5 β
β(class) β
ββββββββββ
Key Specifications:
| Property | Value |
|---|---|
| Architecture | ViT-Base-Patch16-224 (timm) |
| Parameters | ~86M |
| Pre-trained | ImageNet-21k |
| Feature Dimension | 768 |
| Patch Size | 16Γ16 |
| Sequence Length | 197 (196 patches + 1 CLS) |
| Model File Size | 331 MB |
5.2 Backup Model: EfficientNet-B3
| Property | Value |
|---|---|
| Architecture | EfficientNet-B3 (timm) |
| Parameters | ~12M |
| Feature Dimension | 1,536 |
| Image Size | 300Γ300 |
| Model File Size | 47 MB |
5.3 Multi-Task Learning Design
Both models share a common backbone with two specialized heads:
- Disease Classification Head β 5-class output (softmax)
- Severity Grading Head β 5-level DR severity (for APTOS-sourced samples)
Loss = Focal Loss (disease) + 0.2 Γ CrossEntropy (severity)
6. Training Architecture
6.1 Training Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Epochs | 30 | Best checkpoint at epoch 30 |
| Batch Size | 32 (effective 64) | Gradient accumulation Γ2 |
| Optimizer | AdamW | Weight decay regularization |
| Learning Rate | 3Γ10β»β΄ | Stable for ViT fine-tuning |
| LR Scheduler | Cosine Annealing (T_max=30, Ξ·_min=1Γ10β»β·) | Smooth decay |
| Mixed Precision | AMP (GradScaler) | 2Γ speed, reduced VRAM |
| Early Stopping | Patience=10 on macro F1 | Prevent overfitting |
6.2 Loss Function: Focal Loss
FL(p_t) = βΞ±_t Γ (1 β p_t)^Ξ³ Γ log(p_t)
Parameters:
Ξ³ = 1.0 (focusing parameter)
Ξ± = class_weights (inverse class frequency)
Focal Loss down-weights easy (well-classified) examples, forcing the model to focus on hard minority samples β critical for the 21:1 class imbalance.
6.3 GPU Optimization Architecture
Original Pipeline: Optimized Pipeline:
ββββββββββ ββββββββββββ ββββββββ βββββββββββ ββββββββ
βDisk I/OβββBen Grahamβββ GPU β βCache I/Oβββ GPU β
β 10ms β β 100-200msβ β 20ms β β 1ms β β 25ms β
ββββββββββ ββββββββββββ ββββββββ βββββββββββ ββββββββ
GPU Util: 6% GPU Util: 96%
Speed: ~1 it/s Speed: ~4-5 it/s
Optimizations applied:
- Pre-cached preprocessing (100Γ faster data loading)
- Batch size: 32 β 128 (4Γ larger)
- DataLoader workers: 2 β 8 (4Γ parallel loading)
- Persistent workers, prefetch_factor=2
- Non-blocking GPU transfers
optimizer.zero_grad(set_to_none=True)
7. Inference Architecture
7.1 Single-Image Inference Pipeline
Input Image βββΆ Ben Graham Preprocess βββΆ ImageNet Normalize
β
βΌ
ViT Forward Pass
(disease_logits, severity_logits)
β
βΌ
Softmax
β
βΌ
ββββββββββββββββββββββββ
β Threshold-Based β
β Decision Logic β
β β
β Per-Class Thresholds: β
β Normal: 0.540 β
β DR: 0.240 β
β Glaucoma: 0.810 β
β Cataract: 0.930 β
β AMD: 0.850 β
ββββββββββββββββββββββββ
β
βΌ
Prediction + Confidence Score
7.2 Threshold Optimization Method
Per-class thresholds are optimized via grid search (0.05 to 0.95, step 0.05) on the validation set, converting each class to a one-vs-rest binary problem and maximizing F1 score per class.
Two threshold strategies available:
| Strategy | Accuracy | Macro F1 | Use Case |
|---|---|---|---|
| Accuracy-focused (Default) | 84.48% | 0.840 | General screening |
| F1-focused | 80.44% | 0.858 | Rare disease detection |
7.3 Inference Performance
| Config | Latency | Throughput | GPU Memory |
|---|---|---|---|
| ViT Solo | ~15ms | ~66 img/s | ~2 GB |
| ViT + TTA (8Γ) | ~120ms | ~8 img/s | ~2 GB |
| Ensemble (3 models) | ~45ms | ~22 img/s | ~4 GB |
7.4 Optional: Hybrid Inference Architecture
Image βββΆ ViT First-Pass (fast, 15ms)
β
ββ Confidence β₯ 0.75 AND majority class βββΆ Return prediction
β
ββ Confidence < 0.75 OR rare class βββΆ Ensemble Second-Pass βββΆ Return
8. Ensemble Architecture (Optional)
| Model | Weight | Architecture | Size |
|---|---|---|---|
| ViT-Base-Patch16-224 | 0.85 | Vision Transformer | 331 MB |
| EfficientNet-B3 Extended | 0.10 | CNN (50 epochs) | 47 MB |
| EfficientNet-B3 v2 | 0.05 | CNN (20 epochs) | 47 MB |
Ensemble Strategy: Weighted probability averagingfinal_prob = 0.85ΓViT + 0.10ΓEffNetExt + 0.05ΓEffNetv2
9. Technology Stack
| Layer | Technology | Version |
|---|---|---|
| Framework | PyTorch | 2.0+ |
| Model Library | timm | 0.9+ |
| Vision Utils | torchvision | 0.18+ |
| Image Processing | OpenCV | 4.8+ |
| Data Handling | pandas | 2.0+ |
| ML Metrics | scikit-learn | 1.3+ |
| Visualization | matplotlib, seaborn | Latest |
| GPU | NVIDIA H200 | 150 GB VRAM |
| Training | CUDA + AMP (Mixed Precision) | β |
10. File and Directory Structure
retina-sense/
βββ Notebooks
β βββ RetinaSense_Production.ipynb # Production inference β
β βββ RetinaSense_ViT_Training.ipynb # ViT training process
β βββ RetinaSense_Optimized.ipynb # GPU optimization experiments
β
βββ Training Scripts
β βββ retinasense_vit.py # ViT training (84.48%)
β βββ retinasense_v2_extended.py # Extended CNN (50 epochs)
β βββ retinasense_v2.py # Baseline CNN (20 epochs)
β βββ retinasense_fixed.py # Bug-fixed original
β
βββ Optimization Scripts
β βββ threshold_optimization_vit.py # ViT threshold tuning
β βββ threshold_optimization_simple.py # v2 threshold tuning
β βββ ensemble_inference.py # Model ensemble
β βββ tta_evaluation.py # Test-time augmentation
β βββ data_analysis.py # Dataset analysis
β
βββ Model Outputs
β βββ outputs_vit/ # ViT checkpoints + results
β βββ outputs_v2/ # v2 baseline outputs
β βββ outputs_v2_extended/ # Extended training outputs
β βββ outputs_optimized/ # GPU optimization outputs
β βββ outputs_ensemble/ # Ensemble results
β βββ outputs_analysis/ # Data analysis outputs
β
βββ Data
βββ data/combined_dataset.csv # Unified metadata
βββ final_unified_metadata.csv # Full metadata file
11. Deployment Architecture
11.1 Production Deployment Specification
Model:
Architecture: ViT-Base-Patch16-224
Checkpoint: outputs_vit/best_model.pth
Size: 331 MB
Parameters: ~86M
Input:
Image Size: 224Γ224 pixels
Format: RGB fundus image
Preprocessing: Ben Graham + ImageNet normalization
Output:
Class: [Normal, DR, Glaucoma, Cataract, AMD]
Confidence: Float [0.0, 1.0]
All Probabilities: Array of 5 floats
Flag for Review: If confidence < threshold
Hardware Requirements:
GPU: NVIDIA (CUDA required), 2+ GB VRAM
Inference Speed: ~66 images/sec
11.2 Monitoring Requirements
- Track prediction class distribution for data drift
- Monitor confidence score calibration over time
- Log flagged (low-confidence) cases for expert review
- Alert on out-of-distribution inputs
- Track inference latency and throughput
12. Limitations and Constraints
- Population Bias: Trained primarily on Asian populations (ODIR dataset)
- Equipment Sensitivity: May not generalize across different fundus cameras
- Image Quality Dependence: Requires high-quality fundus images
- Single-Label: Does not handle co-morbidities (multi-label not supported)
- Domain Shift: APTOS/ODIR quality gap (10Γ sharpness difference) is partially addressed by ViT but remains a concern
- Not FDA/CE Approved: Research/educational use only
Document Version: 1.0 | Last Updated: March 10, 2026