# RetinaSense v4 — Progress Report

> Last updated: 2026-03-11
> Status: **ALL MILESTONES COMPLETE**

---

## 5-Fold Cross-Validation Results (HEADLINE)

| Metric | Mean | Std | Min | Max |
|--------|------|-----|-----|-----|
| **Accuracy** | **91.13%** | +/- 0.55% | 90.4% | 92.1% |
| **Macro F1** | **0.910** | +/- 0.006 | 0.903 | 0.920 |
| **Macro AUC** | **0.986** | +/- 0.001 | 0.985 | 0.988 |

| Fold | Accuracy | Macro F1 | AUC |
|------|----------|----------|-----|
| 1 | 91.2% | 0.910 | 0.986 |
| 2 | 91.0% | 0.909 | 0.985 |
| 3 | 92.1% | 0.920 | 0.988 |
| 4 | 90.4% | 0.903 | 0.986 |
| 5 | 91.0% | 0.908 | 0.987 |

- Dataset: 10,000 balanced images (2,000/class), 5 folds
- v3 transfer learning + LLRD + OneCycleLR + FocalLoss + MixUp
- ~7.5 min per fold on A100 80GB (~38 min total)

---

## Held-Out Test Set Results (1,486 samples)

| Metric | Value |
|--------|-------|
| Accuracy | 80.9% (82.0% with optimized thresholds) |
| Macro F1 | 0.813 (0.822 with thresholds) |
| AUC (macro) | 0.969 |
| Cohen Kappa | 0.761 |
| Matthews Correlation | 0.768 |
| MC Dropout Acc@90% retention | 86.0% |

### Per-Class Test Results
| Class | F1 | AUC | Precision | Recall |
|-------|----|-----|-----------|--------|
| Normal | 0.69 | 0.926 | 0.573 | 0.857 |
| Diabetes/DR | 0.78 | 0.965 | 0.844 | 0.726 |
| Glaucoma | 0.78 | 0.981 | 0.925 | 0.670 |
| Cataract | 0.95 | 0.997 | 0.940 | 0.966 |
| AMD | 0.87 | 0.977 | 0.917 | 0.827 |

---

## Key Training Fixes Applied (vs. initial v4 config)

1. **Loaded v3 pretrained weights** -- biggest single improvement (+10% accuracy)
2. Removed WeightedRandomSampler and Focal Loss alpha (data already balanced)
3. Added LLRD (Layer-wise Learning Rate Decay)
4. Added LR warmup and label smoothing (0.1)
5. Added SWA (Stochastic Weight Averaging)
6. Grad clip 1.0 -> 5.0
7. Weight decay 1e-4 -> 0.01
8. Batch size 32 -> 64 (effective 128 with grad accumulation)

---

## Completed Pipeline

| Step | Status | Output |
|------|--------|--------|
| Dataset merge + clean + balance | DONE | 10,000 images (2,000/class) |
| Preprocessing cache | DONE | 10,000 .npy files |
| Main training (3 rounds) | DONE | outputs_v4/best_model.pth |
| Temperature scaling | DONE | outputs_v4/temperature.json |
| Threshold optimization | DONE | outputs_v4/thresholds.json |
| Evaluation dashboard | DONE | outputs_v4/evaluation/ |
| FAISS retrieval index | DONE | outputs_v4/retrieval/ |
| 5-fold cross-validation | DONE | outputs_v4/kfold/ |
| HuggingFace upload | DONE | tanishq74/retinasense-vit |
| Lesion attention training | DONE | outputs_v4/lesion_attention/ |
| Gradio demo app | DONE | app.py |
| HuggingFace model card | DONE | README.md |

---

## Lesion-Aware Attention Training (COMPLETE)

- Model: HybridRetinaModel (ViT-Base + EfficientNet-B3), fine-tuned from v4 best_model.pth
- Method: GradCAM-derived attention maps + pseudo-mask supervision (508 masks generated)
- Loss: Classification (Focal) + 0.2 * Attention (soft-IoU + entropy regularizer)
- GPU: NVIDIA A100-SXM4-80GB, ~30 min total (10 epochs)

### Final Results

| Metric | Value |
|--------|-------|
| **Best Val Acc** | **86.0%** (epoch 9) |
| **Best Macro F1** | **0.8607** (epoch 9) |
| Train Acc (final) | 90.0% |
| Pseudo-masks generated | 508 |

### Per-Class F1 (best epoch 9)
| Normal | DR | Glaucoma | Cataract | AMD |
|--------|----|----------|----------|-----|
| 0.75 | 0.79 | 0.86 | 0.97 | 0.94 |

### Full Epoch Log

| Epoch | Loss | Cls Loss | Attn Loss | Train Acc | Val Acc | Macro F1 | Notes |
|-------|------|----------|-----------|-----------|---------|----------|-------|
| 1 | 0.528 | 0.471 | 0.284 | 75.6% | 85.2% | 0.8517 | BEST |
| 2 | 0.355 | 0.335 | 0.101 | 81.6% | 84.6% | 0.8462 | |
| 3 | 0.287 | 0.269 | 0.092 | 83.8% | 83.5% | 0.8353 | |
| 4 | 0.249 | 0.232 | 0.084 | 85.9% | 85.0% | 0.8501 | |
| 5 | 0.212 | 0.195 | 0.086 | 88.4% | 85.2% | 0.8537 | BEST |
| 6 | 0.190 | 0.173 | 0.089 | 88.9% | 85.8% | 0.8595 | BEST |
| 7 | 0.185 | 0.169 | 0.082 | 88.7% | 83.6% | 0.8390 | |
| 8 | 0.179 | 0.162 | 0.085 | 89.2% | 85.4% | 0.8546 | |
| 9 | 0.164 | 0.147 | 0.082 | 89.8% | 86.0% | 0.8607 | BEST |
| 10 | 0.161 | 0.146 | 0.076 | 90.0% | 85.0% | 0.8513 | |

---

## All Tasks Complete

- [x] Lesion attention training
- [x] Gradio demo app (`app.py`)
- [x] Full HuggingFace model card (`README.md`)
- [x] Upload kfold results + plots to HuggingFace

---

## Output Files
```
outputs_v4/
  best_model.pth          (391MB) -- hybrid model checkpoint
  final_metrics.json               -- test set metrics
  temperature.json                 -- calibration temperature
  thresholds.json                  -- per-class thresholds
  training_curves.png              -- loss/acc/F1 plots
  history.json                     -- epoch-by-epoch history
  progress_snapshot.json           -- quick progress reference
  evaluation/
    confusion_matrix.png           -- 5-class confusion matrix
    roc_curves.png                 -- per-class ROC (all AUC > 0.92)
    uncertainty_analysis.png       -- MC Dropout uncertainty
    metrics_report.json            -- comprehensive metrics JSON
    evaluation_report.txt          -- human-readable summary
  retrieval/
    index_flat_l2.faiss            -- exact search index (7,038 vectors)
    index_ivf_flat.faiss           -- approximate search index
    embeddings.npy                 -- 768-dim ViT embeddings
    metadata.json                  -- image paths + labels
  kfold/
    fold_1_best.pth ... fold_5_best.pth  -- per-fold checkpoints
    kfold_results.json             -- aggregate CV results
    fold_comparison.png            -- fold comparison bar charts
    perclass_f1_boxplot.png        -- per-class F1 boxplot
  lesion_attention/
    best_model.pth                 -- lesion-attention fine-tuned checkpoint
    training_history.json          -- epoch-by-epoch attention training log
  pseudo_masks/                    -- 508 GradCAM-derived pseudo lesion masks
```

---

## Model Architecture

```
Input (B, 3, 224, 224)
    |
    +-- EfficientNet-B3 --> (B, 1536)  [local/texture features]
    |
    +-- ViT-Base/16     --> (B, 768)   [global/structural features]
    |
    v
Concatenate --> (B, 2304)
    |
    v
Linear(2304, 512) + ReLU + Dropout(0.3)
Linear(512, 256)  + ReLU + Dropout(0.3)
Linear(256, 5)
    |
    v
Logits (B, 5)

Total Parameters: 97,807,661
```

---

## For Claude: Resume Instructions

When starting a new session, read this file, MEMORY.md, and PROJECT_SUMMARY.md.

### Current State (2026-03-11)
- **ALL v4 MILESTONES COMPLETE.** Nothing remaining from the original plan.
- Everything has been pushed to HuggingFace: `tanishq74/retinasense-vit`

### Key Files
- v3 weights (transfer learning source): `best_model.pth`, `efficientnet_b3.pth`
- v4 main model: `outputs_v4/best_model.pth` (391MB, epoch 5 + SWA)
- Lesion attention model: `outputs_v4/lesion_attention/best_model.pth` (preferred for inference)
- Gradio demo: `python app.py --share` (auto-selects lesion attention model)
- Full project summary: `PROJECT_SUMMARY.md`
- HuggingFace model card: `README.md`

### Performance Summary
- **5-Fold CV**: 91.1% accuracy, 0.910 F1, 0.986 AUC
- **Held-out test**: 80.9% accuracy (82.0% w/ thresholds), 0.969 AUC
- **Lesion attention**: 86.0% val accuracy, 0.861 F1
- **Calibration**: ECE reduced from 0.140 to 0.026
- **Uncertainty**: MC Dropout Acc@90% retention = 86.0%

### What Has Been Completed
1. Dataset: merge APTOS+ODIR -> clean -> balance (10K) -> split (patient-aware)
2. Preprocessing: CLAHE + circular mask -> 10K cached .npy files
3. Main training: Hybrid ViT+EfficientNet-B3, transfer learning, LLRD, SWA, FocalLoss
4. 5-fold cross-validation: 91.1% mean accuracy across 5 folds
5. Evaluation dashboard: confusion matrix, ROC curves, MC Dropout uncertainty
6. Temperature calibration + per-class threshold optimization
7. FAISS retrieval index: 7,038 vectors, FlatL2 + IVFFlat
8. Lesion attention training: GradCAM-guided, 508 pseudo-masks, 10 epochs
9. Gradio demo app: classification + GradCAM + uncertainty + retrieval
10. HuggingFace: model card, all outputs, plots uploaded

### Potential Future Work (not started)
- Deploy to HuggingFace Spaces (permanent hosting)
- ONNX export for edge deployment
- Additional datasets (REFUGE, MESSIDOR, ADAM)
- Multi-scale ViT patches (8x8) for finer lesion detection
- Full LLM-based RAG report generation (Claude API)
- Lesion segmentation head (needs pixel-level annotations)

### HuggingFace Access
- Repo: `tanishq74/retinasense-vit`
- Token: stored in user's environment (do NOT hardcode)

### Technical Notes for Future Sessions
- Column name in CSVs is `label` (not `disease_label`)
- Validation file is `val_split.csv` (not `calib_split.csv`)
- Use `weights_only=False` when loading v3/v4 checkpoints with torch.load
- Grad clip = 5.0 across all training scripts
- Norm stats: mean=[0.4298, 0.2784, 0.1559], std=[0.2857, 0.2065, 0.1465]