# RetinaSense v4 — Progress Report > Last updated: 2026-03-11 > Status: **ALL MILESTONES COMPLETE** --- ## 5-Fold Cross-Validation Results (HEADLINE) | Metric | Mean | Std | Min | Max | |--------|------|-----|-----|-----| | **Accuracy** | **91.13%** | +/- 0.55% | 90.4% | 92.1% | | **Macro F1** | **0.910** | +/- 0.006 | 0.903 | 0.920 | | **Macro AUC** | **0.986** | +/- 0.001 | 0.985 | 0.988 | | Fold | Accuracy | Macro F1 | AUC | |------|----------|----------|-----| | 1 | 91.2% | 0.910 | 0.986 | | 2 | 91.0% | 0.909 | 0.985 | | 3 | 92.1% | 0.920 | 0.988 | | 4 | 90.4% | 0.903 | 0.986 | | 5 | 91.0% | 0.908 | 0.987 | - Dataset: 10,000 balanced images (2,000/class), 5 folds - v3 transfer learning + LLRD + OneCycleLR + FocalLoss + MixUp - ~7.5 min per fold on A100 80GB (~38 min total) --- ## Held-Out Test Set Results (1,486 samples) | Metric | Value | |--------|-------| | Accuracy | 80.9% (82.0% with optimized thresholds) | | Macro F1 | 0.813 (0.822 with thresholds) | | AUC (macro) | 0.969 | | Cohen Kappa | 0.761 | | Matthews Correlation | 0.768 | | MC Dropout Acc@90% retention | 86.0% | ### Per-Class Test Results | Class | F1 | AUC | Precision | Recall | |-------|----|-----|-----------|--------| | Normal | 0.69 | 0.926 | 0.573 | 0.857 | | Diabetes/DR | 0.78 | 0.965 | 0.844 | 0.726 | | Glaucoma | 0.78 | 0.981 | 0.925 | 0.670 | | Cataract | 0.95 | 0.997 | 0.940 | 0.966 | | AMD | 0.87 | 0.977 | 0.917 | 0.827 | --- ## Key Training Fixes Applied (vs. initial v4 config) 1. **Loaded v3 pretrained weights** -- biggest single improvement (+10% accuracy) 2. Removed WeightedRandomSampler and Focal Loss alpha (data already balanced) 3. Added LLRD (Layer-wise Learning Rate Decay) 4. Added LR warmup and label smoothing (0.1) 5. Added SWA (Stochastic Weight Averaging) 6. Grad clip 1.0 -> 5.0 7. Weight decay 1e-4 -> 0.01 8. Batch size 32 -> 64 (effective 128 with grad accumulation) --- ## Completed Pipeline | Step | Status | Output | |------|--------|--------| | Dataset merge + clean + balance | DONE | 10,000 images (2,000/class) | | Preprocessing cache | DONE | 10,000 .npy files | | Main training (3 rounds) | DONE | outputs_v4/best_model.pth | | Temperature scaling | DONE | outputs_v4/temperature.json | | Threshold optimization | DONE | outputs_v4/thresholds.json | | Evaluation dashboard | DONE | outputs_v4/evaluation/ | | FAISS retrieval index | DONE | outputs_v4/retrieval/ | | 5-fold cross-validation | DONE | outputs_v4/kfold/ | | HuggingFace upload | DONE | tanishq74/retinasense-vit | | Lesion attention training | DONE | outputs_v4/lesion_attention/ | | Gradio demo app | DONE | app.py | | HuggingFace model card | DONE | README.md | --- ## Lesion-Aware Attention Training (COMPLETE) - Model: HybridRetinaModel (ViT-Base + EfficientNet-B3), fine-tuned from v4 best_model.pth - Method: GradCAM-derived attention maps + pseudo-mask supervision (508 masks generated) - Loss: Classification (Focal) + 0.2 * Attention (soft-IoU + entropy regularizer) - GPU: NVIDIA A100-SXM4-80GB, ~30 min total (10 epochs) ### Final Results | Metric | Value | |--------|-------| | **Best Val Acc** | **86.0%** (epoch 9) | | **Best Macro F1** | **0.8607** (epoch 9) | | Train Acc (final) | 90.0% | | Pseudo-masks generated | 508 | ### Per-Class F1 (best epoch 9) | Normal | DR | Glaucoma | Cataract | AMD | |--------|----|----------|----------|-----| | 0.75 | 0.79 | 0.86 | 0.97 | 0.94 | ### Full Epoch Log | Epoch | Loss | Cls Loss | Attn Loss | Train Acc | Val Acc | Macro F1 | Notes | |-------|------|----------|-----------|-----------|---------|----------|-------| | 1 | 0.528 | 0.471 | 0.284 | 75.6% | 85.2% | 0.8517 | BEST | | 2 | 0.355 | 0.335 | 0.101 | 81.6% | 84.6% | 0.8462 | | | 3 | 0.287 | 0.269 | 0.092 | 83.8% | 83.5% | 0.8353 | | | 4 | 0.249 | 0.232 | 0.084 | 85.9% | 85.0% | 0.8501 | | | 5 | 0.212 | 0.195 | 0.086 | 88.4% | 85.2% | 0.8537 | BEST | | 6 | 0.190 | 0.173 | 0.089 | 88.9% | 85.8% | 0.8595 | BEST | | 7 | 0.185 | 0.169 | 0.082 | 88.7% | 83.6% | 0.8390 | | | 8 | 0.179 | 0.162 | 0.085 | 89.2% | 85.4% | 0.8546 | | | 9 | 0.164 | 0.147 | 0.082 | 89.8% | 86.0% | 0.8607 | BEST | | 10 | 0.161 | 0.146 | 0.076 | 90.0% | 85.0% | 0.8513 | | --- ## All Tasks Complete - [x] Lesion attention training - [x] Gradio demo app (`app.py`) - [x] Full HuggingFace model card (`README.md`) - [x] Upload kfold results + plots to HuggingFace --- ## Output Files ``` outputs_v4/ best_model.pth (391MB) -- hybrid model checkpoint final_metrics.json -- test set metrics temperature.json -- calibration temperature thresholds.json -- per-class thresholds training_curves.png -- loss/acc/F1 plots history.json -- epoch-by-epoch history progress_snapshot.json -- quick progress reference evaluation/ confusion_matrix.png -- 5-class confusion matrix roc_curves.png -- per-class ROC (all AUC > 0.92) uncertainty_analysis.png -- MC Dropout uncertainty metrics_report.json -- comprehensive metrics JSON evaluation_report.txt -- human-readable summary retrieval/ index_flat_l2.faiss -- exact search index (7,038 vectors) index_ivf_flat.faiss -- approximate search index embeddings.npy -- 768-dim ViT embeddings metadata.json -- image paths + labels kfold/ fold_1_best.pth ... fold_5_best.pth -- per-fold checkpoints kfold_results.json -- aggregate CV results fold_comparison.png -- fold comparison bar charts perclass_f1_boxplot.png -- per-class F1 boxplot lesion_attention/ best_model.pth -- lesion-attention fine-tuned checkpoint training_history.json -- epoch-by-epoch attention training log pseudo_masks/ -- 508 GradCAM-derived pseudo lesion masks ``` --- ## Model Architecture ``` Input (B, 3, 224, 224) | +-- EfficientNet-B3 --> (B, 1536) [local/texture features] | +-- ViT-Base/16 --> (B, 768) [global/structural features] | v Concatenate --> (B, 2304) | v Linear(2304, 512) + ReLU + Dropout(0.3) Linear(512, 256) + ReLU + Dropout(0.3) Linear(256, 5) | v Logits (B, 5) Total Parameters: 97,807,661 ``` --- ## For Claude: Resume Instructions When starting a new session, read this file, MEMORY.md, and PROJECT_SUMMARY.md. ### Current State (2026-03-11) - **ALL v4 MILESTONES COMPLETE.** Nothing remaining from the original plan. - Everything has been pushed to HuggingFace: `tanishq74/retinasense-vit` ### Key Files - v3 weights (transfer learning source): `best_model.pth`, `efficientnet_b3.pth` - v4 main model: `outputs_v4/best_model.pth` (391MB, epoch 5 + SWA) - Lesion attention model: `outputs_v4/lesion_attention/best_model.pth` (preferred for inference) - Gradio demo: `python app.py --share` (auto-selects lesion attention model) - Full project summary: `PROJECT_SUMMARY.md` - HuggingFace model card: `README.md` ### Performance Summary - **5-Fold CV**: 91.1% accuracy, 0.910 F1, 0.986 AUC - **Held-out test**: 80.9% accuracy (82.0% w/ thresholds), 0.969 AUC - **Lesion attention**: 86.0% val accuracy, 0.861 F1 - **Calibration**: ECE reduced from 0.140 to 0.026 - **Uncertainty**: MC Dropout Acc@90% retention = 86.0% ### What Has Been Completed 1. Dataset: merge APTOS+ODIR -> clean -> balance (10K) -> split (patient-aware) 2. Preprocessing: CLAHE + circular mask -> 10K cached .npy files 3. Main training: Hybrid ViT+EfficientNet-B3, transfer learning, LLRD, SWA, FocalLoss 4. 5-fold cross-validation: 91.1% mean accuracy across 5 folds 5. Evaluation dashboard: confusion matrix, ROC curves, MC Dropout uncertainty 6. Temperature calibration + per-class threshold optimization 7. FAISS retrieval index: 7,038 vectors, FlatL2 + IVFFlat 8. Lesion attention training: GradCAM-guided, 508 pseudo-masks, 10 epochs 9. Gradio demo app: classification + GradCAM + uncertainty + retrieval 10. HuggingFace: model card, all outputs, plots uploaded ### Potential Future Work (not started) - Deploy to HuggingFace Spaces (permanent hosting) - ONNX export for edge deployment - Additional datasets (REFUGE, MESSIDOR, ADAM) - Multi-scale ViT patches (8x8) for finer lesion detection - Full LLM-based RAG report generation (Claude API) - Lesion segmentation head (needs pixel-level annotations) ### HuggingFace Access - Repo: `tanishq74/retinasense-vit` - Token: stored in user's environment (do NOT hardcode) ### Technical Notes for Future Sessions - Column name in CSVs is `label` (not `disease_label`) - Validation file is `val_split.csv` (not `calib_split.csv`) - Use `weights_only=False` when loading v3/v4 checkpoints with torch.load - Grad clip = 5.0 across all training scripts - Norm stats: mean=[0.4298, 0.2784, 0.1559], std=[0.2857, 0.2065, 0.1465]