# RetinaSense-ViT: System Architecture Document

**Version:** 1.0  
**Date:** March 10, 2026  
**Author:** Tanishq  
**Status:** Production Ready  

---

## 1. Introduction

### 1.1 Purpose
This document describes the system architecture of **RetinaSense-ViT**, a deep learning system for automated multi-class retinal disease classification from fundus images. The system detects five retinal conditions — Normal, Diabetic Retinopathy (DR), Glaucoma, Cataract, and Age-related Macular Degeneration (AMD) — achieving **84.48% accuracy** and **0.840 macro F1 score**.

### 1.2 Scope
This architecture covers:
- Data ingestion and preprocessing pipeline
- Model architecture (Vision Transformer and EfficientNet variants)
- Training infrastructure and GPU optimization
- Inference pipeline with threshold optimization
- Evaluation and monitoring subsystems

### 1.3 Intended Audience
ML engineers, software architects, clinical researchers, and deployment teams.

---

## 2. High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                      RetinaSense-ViT System                        │
│                                                                     │
│  ┌──────────┐   ┌───────────────┐   ┌──────────┐   ┌────────────┐ │
│  │ Data     │──▶│ Preprocessing │──▶│ Model    │──▶│ Inference  │ │
│  │ Ingestion│   │ Pipeline      │   │ Training │   │ Pipeline   │ │
│  └──────────┘   └───────────────┘   └──────────┘   └────────────┘ │
│       │               │                  │               │         │
│       ▼               ▼                  ▼               ▼         │
│  ┌──────────┐   ┌───────────────┐   ┌──────────┐   ┌────────────┐ │
│  │ ODIR-5K  │   │ Ben Graham    │   │ViT-Base  │   │ Threshold  │ │
│  │ APTOS-19 │   │ Enhancement   │   │Patch16   │   │ Optimizer  │ │
│  │ Combined │   │ + Caching     │   │-224      │   │            │ │
│  └──────────┘   └───────────────┘   └──────────┘   └────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 3. Data Architecture

### 3.1 Data Sources

| Source | Images | Resolution | Classes | Notes |
|--------|--------|-----------|---------|-------|
| **ODIR-5K** | 4,966 | 512×512 | All 5 | Preprocessed fundus images |
| **APTOS-2019** | 3,662 | ~1949×1500 | DR only | Raw fundus, 5-level severity |
| **Combined** | **8,540** | 224×224 (resized) | 5 classes | Single-disease filtered |

### 3.2 Class Distribution

| Class | Samples | Percentage | Imbalance Ratio |
|-------|---------|-----------|-----------------|
| Normal | 2,071 | 24.3% | 1.0x |
| Diabetes/DR | 5,581 | 65.4% | **21.1x** |
| Glaucoma | 308 | 3.6% | 0.1x |
| Cataract | 315 | 3.7% | 0.1x |
| AMD | 265 | 3.1% | 0.1x |

### 3.3 Data Split Strategy
- **Training:** 6,832 samples (80%, stratified)
- **Validation:** 1,708 samples (20%, stratified)
- Stratified split preserves class distribution in both sets

### 3.4 Domain Shift: Critical Architectural Consideration
- **APTOS images** have 10.7× lower sharpness than ODIR (25.5 vs 272.6)
- All APTOS images map exclusively to the DR class
- This creates two distinct visual subpopulations within DR
- The ViT architecture handles this domain gap better than CNNs due to its global attention mechanism

---

## 4. Preprocessing Architecture

### 4.1 Ben Graham Enhancement Pipeline

```
Input Image ──▶ Resize (224×224) ──▶ Gaussian Blur (σ=10)
                                            │
                                            ▼
                                     Weighted Subtraction
                                     4*img - 4*blur + 128
                                            │
                                            ▼
                                     Circular Mask
                                     (r = 0.48 × size)
                                            │
                                            ▼
                                     ImageNet Normalization
                                     μ=[0.485,0.456,0.406]
                                     σ=[0.229,0.224,0.225]
                                            │
                                            ▼
                                     Output Tensor (3×224×224)
```

### 4.2 Pre-Caching Architecture
To eliminate the CPU bottleneck (Ben Graham preprocessing: 100–200ms/image), a caching layer stores preprocessed images as `.npy` files:

```
One-Time Caching Phase:
  Raw Image → Ben Graham Preprocessing → np.save('cache/{id}.npy')
  Cost: ~60 seconds for 8,540 images

Training Phase:
  np.load('cache/{id}.npy') → GPU tensor        (~1ms vs 100–200ms)
```

**Impact:** GPU utilization improved from 5–10% → 60–85%; training speedup ~4×.

### 4.3 Data Augmentation (Training Only)

| Augmentation | Parameters | Purpose |
|-------------|-----------|---------|
| RandomHorizontalFlip | p=0.5 | Geometric invariance |
| RandomVerticalFlip | p=0.3 | Geometric invariance |
| RandomRotation | 20° | Orientation invariance |
| RandomAffine | translate=0.05, scale=0.95–1.05 | Position invariance |
| ColorJitter | brightness=0.3, contrast=0.3 | Lighting robustness |
| RandomErasing | p=0.2 | Occlusion robustness |

---

## 5. Model Architecture

### 5.1 Production Model: Vision Transformer (ViT-Base-Patch16-224)

```
Input Image (3×224×224)
        │
        ▼
┌──────────────────────────────────┐
│   Patch Embedding Layer          │
│   14×14 = 196 patches (16×16)    │
│   + 1 [CLS] token                │
│   + Position Embeddings          │
│   → 197 × 768                    │
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│   12× Transformer Encoder Blocks │
│   ┌────────────────────────────┐ │
│   │ Multi-Head Self-Attention  │ │
│   │ (12 heads, 768 dim)       │ │
│   ├────────────────────────────┤ │
│   │ Layer Norm + Residual      │ │
│   ├────────────────────────────┤ │
│   │ MLP (768 → 3072 → 768)   │ │
│   ├────────────────────────────┤ │
│   │ Layer Norm + Residual      │ │
│   └────────────────────────────┘ │
└──────────────────────────────────┘
        │
        ▼  [CLS] token output (768-dim)
        │
   ┌────┴────┐
   ▼         ▼
┌────────┐ ┌──────────┐
│Disease │ │ Severity │
│Head    │ │ Head     │
│        │ │          │
│768→512 │ │768→256   │
│BN+ReLU │ │BN+ReLU  │
│Drop 0.3│ │Drop 0.3 │
│512→256 │ │256→5    │
│BN+ReLU │ │(severity)│
│Drop 0.2│ └──────────┘
│256→5   │
│(class) │
└────────┘
```

**Key Specifications:**

| Property | Value |
|----------|-------|
| Architecture | ViT-Base-Patch16-224 (timm) |
| Parameters | ~86M |
| Pre-trained | ImageNet-21k |
| Feature Dimension | 768 |
| Patch Size | 16×16 |
| Sequence Length | 197 (196 patches + 1 CLS) |
| Model File Size | 331 MB |

### 5.2 Backup Model: EfficientNet-B3

| Property | Value |
|----------|-------|
| Architecture | EfficientNet-B3 (timm) |
| Parameters | ~12M |
| Feature Dimension | 1,536 |
| Image Size | 300×300 |
| Model File Size | 47 MB |

### 5.3 Multi-Task Learning Design
Both models share a common backbone with two specialized heads:
1. **Disease Classification Head** → 5-class output (softmax)
2. **Severity Grading Head** → 5-level DR severity (for APTOS-sourced samples)

Loss = `Focal Loss (disease)` + `0.2 × CrossEntropy (severity)`

---

## 6. Training Architecture

### 6.1 Training Configuration

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Epochs | 30 | Best checkpoint at epoch 30 |
| Batch Size | 32 (effective 64) | Gradient accumulation ×2 |
| Optimizer | AdamW | Weight decay regularization |
| Learning Rate | 3×10⁻⁴ | Stable for ViT fine-tuning |
| LR Scheduler | Cosine Annealing (T_max=30, η_min=1×10⁻⁷) | Smooth decay |
| Mixed Precision | AMP (GradScaler) | 2× speed, reduced VRAM |
| Early Stopping | Patience=10 on macro F1 | Prevent overfitting |

### 6.2 Loss Function: Focal Loss

```
FL(p_t) = −α_t × (1 − p_t)^γ × log(p_t)

Parameters:
  γ = 1.0   (focusing parameter)
  α = class_weights  (inverse class frequency)
```

Focal Loss down-weights easy (well-classified) examples, forcing the model to focus on hard minority samples — critical for the 21:1 class imbalance.

### 6.3 GPU Optimization Architecture

```
Original Pipeline:                  Optimized Pipeline:
┌────────┐ ┌──────────┐ ┌──────┐   ┌─────────┐ ┌──────┐
│Disk I/O│→│Ben Graham│→│ GPU  │   │Cache I/O│→│ GPU  │
│ 10ms   │ │ 100-200ms│ │ 20ms │   │  1ms    │ │ 25ms │
└────────┘ └──────────┘ └──────┘   └─────────┘ └──────┘

GPU Util: 6%                        GPU Util: 96%
Speed: ~1 it/s                      Speed: ~4-5 it/s
```

**Optimizations applied:**
- Pre-cached preprocessing (100× faster data loading)
- Batch size: 32 → 128 (4× larger)
- DataLoader workers: 2 → 8 (4× parallel loading)
- Persistent workers, prefetch_factor=2
- Non-blocking GPU transfers
- `optimizer.zero_grad(set_to_none=True)`

---

## 7. Inference Architecture

### 7.1 Single-Image Inference Pipeline

```
Input Image ──▶ Ben Graham Preprocess ──▶ ImageNet Normalize
                                                │
                                                ▼
                                         ViT Forward Pass
                                         (disease_logits, severity_logits)
                                                │
                                                ▼
                                            Softmax
                                                │
                                                ▼
                                     ┌──────────────────────┐
                                     │ Threshold-Based      │
                                     │ Decision Logic        │
                                     │                      │
                                     │ Per-Class Thresholds: │
                                     │  Normal:    0.540    │
                                     │  DR:        0.240    │
                                     │  Glaucoma:  0.810    │
                                     │  Cataract:  0.930    │
                                     │  AMD:       0.850    │
                                     └──────────────────────┘
                                                │
                                                ▼
                                     Prediction + Confidence Score
```

### 7.2 Threshold Optimization Method
Per-class thresholds are optimized via grid search (0.05 to 0.95, step 0.05) on the validation set, converting each class to a one-vs-rest binary problem and maximizing F1 score per class.

**Two threshold strategies available:**

| Strategy | Accuracy | Macro F1 | Use Case |
|----------|----------|----------|----------|
| Accuracy-focused (Default) | **84.48%** | 0.840 | General screening |
| F1-focused | 80.44% | **0.858** | Rare disease detection |

### 7.3 Inference Performance

| Config | Latency | Throughput | GPU Memory |
|--------|---------|-----------|------------|
| ViT Solo | ~15ms | ~66 img/s | ~2 GB |
| ViT + TTA (8×) | ~120ms | ~8 img/s | ~2 GB |
| Ensemble (3 models) | ~45ms | ~22 img/s | ~4 GB |

### 7.4 Optional: Hybrid Inference Architecture

```
Image ──▶ ViT First-Pass (fast, 15ms)
              │
              ├─ Confidence ≥ 0.75 AND majority class ──▶ Return prediction
              │
              └─ Confidence < 0.75 OR rare class ──▶ Ensemble Second-Pass ──▶ Return
```

---

## 8. Ensemble Architecture (Optional)

| Model | Weight | Architecture | Size |
|-------|--------|-------------|------|
| ViT-Base-Patch16-224 | 0.85 | Vision Transformer | 331 MB |
| EfficientNet-B3 Extended | 0.10 | CNN (50 epochs) | 47 MB |
| EfficientNet-B3 v2 | 0.05 | CNN (20 epochs) | 47 MB |

**Ensemble Strategy:** Weighted probability averaging  
`final_prob = 0.85×ViT + 0.10×EffNetExt + 0.05×EffNetv2`

---

## 9. Technology Stack

| Layer | Technology | Version |
|-------|-----------|---------|
| Framework | PyTorch | 2.0+ |
| Model Library | timm | 0.9+ |
| Vision Utils | torchvision | 0.18+ |
| Image Processing | OpenCV | 4.8+ |
| Data Handling | pandas | 2.0+ |
| ML Metrics | scikit-learn | 1.3+ |
| Visualization | matplotlib, seaborn | Latest |
| GPU | NVIDIA H200 | 150 GB VRAM |
| Training | CUDA + AMP (Mixed Precision) | — |

---

## 10. File and Directory Structure

```
retina-sense/
├── Notebooks
│   ├── RetinaSense_Production.ipynb        # Production inference ⭐
│   ├── RetinaSense_ViT_Training.ipynb      # ViT training process
│   └── RetinaSense_Optimized.ipynb         # GPU optimization experiments
│
├── Training Scripts
│   ├── retinasense_vit.py                  # ViT training (84.48%)
│   ├── retinasense_v2_extended.py          # Extended CNN (50 epochs)
│   ├── retinasense_v2.py                   # Baseline CNN (20 epochs)
│   └── retinasense_fixed.py                # Bug-fixed original
│
├── Optimization Scripts
│   ├── threshold_optimization_vit.py       # ViT threshold tuning
│   ├── threshold_optimization_simple.py    # v2 threshold tuning
│   ├── ensemble_inference.py               # Model ensemble
│   ├── tta_evaluation.py                   # Test-time augmentation
│   └── data_analysis.py                    # Dataset analysis
│
├── Model Outputs
│   ├── outputs_vit/                        # ViT checkpoints + results
│   ├── outputs_v2/                         # v2 baseline outputs
│   ├── outputs_v2_extended/                # Extended training outputs
│   ├── outputs_optimized/                  # GPU optimization outputs
│   ├── outputs_ensemble/                   # Ensemble results
│   └── outputs_analysis/                   # Data analysis outputs
│
└── Data
    ├── data/combined_dataset.csv           # Unified metadata
    └── final_unified_metadata.csv          # Full metadata file
```

---

## 11. Deployment Architecture

### 11.1 Production Deployment Specification

```yaml
Model:
  Architecture: ViT-Base-Patch16-224
  Checkpoint: outputs_vit/best_model.pth
  Size: 331 MB
  Parameters: ~86M

Input:
  Image Size: 224×224 pixels
  Format: RGB fundus image
  Preprocessing: Ben Graham + ImageNet normalization

Output:
  Class: [Normal, DR, Glaucoma, Cataract, AMD]
  Confidence: Float [0.0, 1.0]
  All Probabilities: Array of 5 floats
  Flag for Review: If confidence < threshold

Hardware Requirements:
  GPU: NVIDIA (CUDA required), 2+ GB VRAM
  Inference Speed: ~66 images/sec
```

### 11.2 Monitoring Requirements
- Track prediction class distribution for data drift
- Monitor confidence score calibration over time
- Log flagged (low-confidence) cases for expert review
- Alert on out-of-distribution inputs
- Track inference latency and throughput

---

## 12. Limitations and Constraints

1. **Population Bias:** Trained primarily on Asian populations (ODIR dataset)
2. **Equipment Sensitivity:** May not generalize across different fundus cameras
3. **Image Quality Dependence:** Requires high-quality fundus images
4. **Single-Label:** Does not handle co-morbidities (multi-label not supported)
5. **Domain Shift:** APTOS/ODIR quality gap (10× sharpness difference) is partially addressed by ViT but remains a concern
6. **Not FDA/CE Approved:** Research/educational use only

---

*Document Version: 1.0 | Last Updated: March 10, 2026*