# RetinaSense-ViT: Functional Document — Module Description

**Version:** 1.0  
**Date:** March 10, 2026  
**Author:** Tanishq  
**Status:** Production Ready  

---

## 1. System Overview

RetinaSense-ViT is a multi-class retinal disease classification system that analyzes fundus images to detect five conditions: **Normal, Diabetic Retinopathy, Glaucoma, Cataract, and AMD**. The system is organized into the following functional modules:

---

## 2. Module Map

```
┌───────────────────────────────────────────────────────────────┐
│                     RetinaSense-ViT System                    │
│                                                               │
│  M1: Data         M2: Preprocessing   M3: Model              │
│  Ingestion        Pipeline             Architecture           │
│                                                               │
│  M4: Training     M5: Threshold       M6: Inference           │
│  Engine           Optimization         Pipeline               │
│                                                               │
│  M7: Ensemble     M8: Evaluation      M9: Data               │
│  System           & Visualization      Analysis               │
└───────────────────────────────────────────────────────────────┘
```

---

## 3. Module Descriptions

### M1: Data Ingestion Module

**Purpose:** Load, validate, and unify data from ODIR-5K and APTOS-2019 datasets.

| Attribute | Detail |
|-----------|--------|
| **Input** | ODIR images (512×512), APTOS images (~1949×1500), metadata CSVs |
| **Output** | `combined_dataset.csv` with 8,540 entries (path, disease_label, severity_label) |
| **Key Files** | `final_unified_metadata.csv`, `data/combined_dataset.csv` |
| **Functions** | Path cleaning (remove `./` prefixes), single-disease filtering, stratified train/val split (80/20) |

**Key Logic:**
- APTOS images are exclusively DR class with 5-level severity grading
- ODIR images span all 5 disease classes; multi-disease samples are filtered out
- Paths are normalized for cross-platform compatibility

---

### M2: Preprocessing Pipeline Module

**Purpose:** Apply Ben Graham contrast enhancement and caching to prepare images for model input.

| Attribute | Detail |
|-----------|--------|
| **Input** | Raw fundus image (any resolution) |
| **Output** | Normalized tensor (3×224×224) |
| **Key Files** | All training scripts (ben_graham_preprocess function) |
| **Dependencies** | OpenCV, NumPy, torchvision transforms |

**Functional Steps:**
1. **Resize** to target resolution (224×224 for ViT, 300×300 for EfficientNet)
2. **Ben Graham Enhancement:** `4×img − 4×GaussianBlur(σ=10) + 128`
3. **Circular Mask** application (radius = 0.48 × image_size)
4. **Caching:** Pre-compute and store as `.npy` files (one-time; ~60s for 8,540 images)
5. **Augmentation** (training only): flip, rotate, affine, color jitter, random erasing
6. **ImageNet Normalization:** μ=[0.485,0.456,0.406], σ=[0.229,0.224,0.225]

---

### M3: Model Architecture Module

**Purpose:** Define the neural network architectures for disease classification.

#### M3.1: ViT-Base-Patch16-224 (Production Model)

| Attribute | Detail |
|-----------|--------|
| **Backbone** | ViT-Base-Patch16-224 (timm, pre-trained on ImageNet) |
| **Parameters** | ~86M |
| **Feature Dim** | 768 |
| **Disease Head** | 768→512→256→5 (BatchNorm, ReLU, Dropout) |
| **Severity Head** | 768→256→5 (BatchNorm, ReLU, Dropout) |
| **Model Size** | 331 MB |

**Why ViT Excels:**
- Global self-attention captures vessel patterns across the entire fundus
- Position encoding preserves spatial relationships (optic disc, macula location)
- Handles APTOS/ODIR domain shift better than CNNs (less texture-dependent)
- Superior on minority classes: Glaucoma +144%, AMD +199% over CNN baseline

#### M3.2: EfficientNet-B3 (Backup Model)

| Attribute | Detail |
|-----------|--------|
| **Backbone** | EfficientNet-B3 (timm, pre-trained on ImageNet) |
| **Parameters** | ~12M |
| **Feature Dim** | 1,536 |
| **Model Size** | 47 MB |

---

### M4: Training Engine Module

**Purpose:** Train the model with class-imbalance-aware strategies and GPU optimization.

| Attribute | Detail |
|-----------|--------|
| **Key Files** | `retinasense_vit.py`, `retinasense_v2_extended.py`, `retinasense_v2.py` |
| **Loss Function** | Focal Loss (γ=1.0, α=class_weights) + 0.2×CE (severity) |
| **Optimizer** | AdamW (lr=3×10⁻⁴) |
| **Scheduler** | Cosine Annealing (T_max=30, η_min=1×10⁻⁷) |
| **Mixed Precision** | AMP with GradScaler |
| **Gradient Accumulation** | 2 steps (effective batch=64) |
| **Early Stopping** | Patience=10 on macro F1 |

**GPU Optimization Features:**
- Pre-cached preprocessing (100× faster data loading)
- Batch size scaling (32→128 for raw speed, 64 recommended for stability)
- 8 DataLoader workers with persistent_workers and prefetch_factor=2
- Non-blocking GPU transfers

**Training Duration:**
- ViT: ~6 minutes (30 epochs on H200)
- EfficientNet-B3 Extended: ~15 minutes (50 epochs)

---

### M5: Threshold Optimization Module

**Purpose:** Post-training optimization of per-class decision thresholds to maximize F1 score.

| Attribute | Detail |
|-----------|--------|
| **Key Files** | `threshold_optimization_vit.py`, `threshold_optimization_simple.py` |
| **Method** | Grid search (0.05–0.95, step 0.05) per class, one-vs-rest binary F1 |
| **Input** | Model softmax probabilities on validation set |
| **Output** | JSON file with optimal thresholds per class |

**Optimal Thresholds (ViT, Accuracy-focused):**

| Class | Threshold | Clinical Rationale |
|-------|-----------|-------------------|
| Normal | 0.540 | Balanced |
| Diabetes/DR | 0.240 | Lenient → high sensitivity (catch all DR) |
| Glaucoma | 0.810 | Strict → high specificity (require confidence) |
| Cataract | 0.930 | Very strict → minimize false positives |
| AMD | 0.850 | Strict → rare disease, need confidence |

**Impact:** +2.22% accuracy for ViT (82.26→84.48%); +9.84% for v2 baseline (63.52→73.36%).

---

### M6: Inference Pipeline Module

**Purpose:** Classify new fundus images using the trained model and optimized thresholds.

| Attribute | Detail |
|-----------|--------|
| **Key Files** | `RetinaSense_Production.ipynb` |
| **Latency** | ~15ms per image |
| **Throughput** | ~66 images/sec |
| **GPU Memory** | ~2 GB |

**Inference Flow:**
1. Load and preprocess image (Ben Graham)
2. Forward pass through ViT → disease logits + severity logits
3. Apply softmax → class probabilities
4. Apply per-class thresholds → final prediction
5. If confidence < threshold for all classes → flag for expert review
6. Return: class label, confidence score, all probabilities

---

### M7: Ensemble System Module

**Purpose:** Combine predictions from multiple models for improved minority class detection.

| Attribute | Detail |
|-----------|--------|
| **Key Files** | `ensemble_inference.py` |
| **Models** | ViT (85%), EfficientNet-Extended (10%), EfficientNet-v2 (5%) |
| **Strategy** | Weighted probability averaging |

**Performance Trade-off:**
- Ensemble: 80.44% accuracy, 0.858 macro F1, Cataract F1=0.952, AMD F1=0.920
- ViT Solo: 84.48% accuracy, 0.840 macro F1 (simpler, faster, recommended)

---

### M8: Evaluation & Visualization Module

**Purpose:** Comprehensive model evaluation with per-class metrics and visual dashboards.

| Attribute | Detail |
|-----------|--------|
| **Key Files** | Training scripts (eval sections), `RetinaSense_Production.ipynb` |
| **Primary Metrics** | Macro F1, accuracy, per-class F1/precision/recall |
| **Secondary Metrics** | Weighted F1, Macro AUC-ROC, confusion matrix |
| **Outputs** | `dashboard.png`, `threshold_comparison.png`, `training_curves.png` |

**Why Macro F1 (not accuracy):** Accuracy is misleading with 21:1 class imbalance (65% accuracy by always predicting DR). Macro F1 treats all classes equally.

---

### M9: Data Analysis Module

**Purpose:** Comprehensive dataset exploration to inform training strategy.

| Attribute | Detail |
|-----------|--------|
| **Key Files** | `data_analysis.py` |
| **Outputs** | `outputs_analysis/` (11 files: plots, reports, CSVs) |

**Analyses Performed:**
1. **Class distribution** — confirmed 21.1× imbalance
2. **Image quality metrics** — brightness, contrast, sharpness per class
3. **APTOS domain shift discovery** — 10.7× sharpness difference vs ODIR
4. **Error analysis** — most-confused class pairs (DR↔Normal, Normal↔AMD)
5. **Augmentation effectiveness** — light augmentation best during warmup
6. **Preprocessing impact** — Ben Graham boosts Glaucoma brightness most (+34.2)

---

## 4. Module Interaction Matrix

| From \ To | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 | M9 |
|-----------|----|----|----|----|----|----|----|----|-----|
| **M1** Data | — | ✓ | | | | | | | ✓ |
| **M2** Preprocess | | — | | ✓ | | ✓ | | | |
| **M3** Model | | | — | ✓ | | ✓ | ✓ | | |
| **M4** Training | | | ✓ | — | ✓ | | | ✓ | |
| **M5** Threshold | | | | | — | ✓ | ✓ | ✓ | |
| **M6** Inference | | ✓ | ✓ | | ✓ | — | | | |
| **M7** Ensemble | | | ✓ | | ✓ | | — | ✓ | |
| **M8** Evaluation | | | | | | | | — | |
| **M9** Analysis | ✓ | | | | | | | | — |

---

## 5. Test-Time Augmentation (TTA) Sub-Module

**Purpose:** Improve predictions by averaging over augmented versions of the input.

**8 Augmentations:** Original, H-flip, V-flip, Both flips, Rot 90°, Rot 180°, Rot 270°, Brightness  
**Impact:** +0.29% accuracy (modest; optional for production)  
**Trade-off:** 8× slower inference  
**Recommendation:** Use selectively for uncertain cases (confidence < threshold)

---

## 6. Configuration Parameters

| Parameter | Default | Range | Notes |
|-----------|---------|-------|-------|
| `IMG_SIZE` | 224 | 224–300 | 224 for ViT, 300 for EfficientNet |
| `BATCH_SIZE` | 32 | 16–128 | 64 recommended for stability |
| `NUM_WORKERS` | 8 | 0–16 | Match to CPU cores |
| `USE_CACHE` | True | True/False | 4× speedup when True |
| `EPOCHS` | 30 | 10–100 | ViT converges by 30 |
| `ACCUM_STEPS` | 2 | 1–8 | Gradient accumulation factor |
| `PATIENCE` | 10 | 5–15 | Early stopping on macro F1 |
| `FOCAL_GAMMA` | 1.0 | 0.5–3.0 | Focusing parameter for class imbalance |

---

*Document Version: 1.0 | Last Updated: March 10, 2026*