--- license: mit library_name: pytorch pipeline_tag: depth-estimation tags: - depth-estimation - monocular-depth - knowledge-distillation - robotics - indoor-navigation - semantic-segmentation - efficientvit - bootstrap-perception - vortex-depth datasets: - sayakpaul/nyu_depth_v2 metrics: - rmse - mae - miou model-index: - name: vortex-depth-v5-general results: - task: type: depth-estimation name: Monocular Indoor Depth Estimation dataset: name: NYU Depth V2 (val) type: nyu_depth_v2 metrics: - type: rmse value: 0.572 name: NYU val RMSE (m) - type: mIoU value: 63.7 name: 6-class Segmentation mIoU (%) --- # Vortex-Depth-V5-General (Atlas) A 5.31 × 10⁶ parameter monocular depth + 6-class segmentation student model for general-purpose indoor depth estimation. The recommended deployable checkpoint of the Vortex-Depth lineage for unconstrained indoor scenes (apartments, kitchens, offices, mixed room geometries). | Property | Value | |---|---| | Codename | **Atlas** | | Lineage version | V5 | | Architecture | EfficientViT-B1 encoder + dual transposed-convolution decoder | | Parameters | 5.31 × 10⁶ | | Input | RGB, 240 × 320, ImageNet-normalized within forward pass | | Output | depth `[B, 1, 240, 320]` in meters; segmentation `[B, 6, 240, 320]` logits | | Training corpus | NYU Depth V2 with deployment-targeted augmentation pipeline | | Teacher | DA3-Metric-Large | | Loss | berHu (depth) + cross-entropy (segmentation) + edge-aware smoothness, Kendall-weighted | | Inference latency | ~5 ms on Jetson Orin Nano (TensorRT FP16) | ## Use case Recommended for general indoor depth estimation across diverse room geometries. This checkpoint is the lineage's most well-rounded model on standard indoor benchmarks: - NYU val RMSE: **0.572 m** - NYU val mIoU (6-class: floor, wall, person, furniture, glass, other): **63.7 %** For corridor-class environments specifically, the [vortex-depth-v9-corridor (Lighthouse)](https://huggingface.co/NishantPushparaju/vortex-depth-v9-corridor) checkpoint achieves 0.382 m corridor RMSE and is the recommended choice when the deployment domain is restricted to corridors. For users intending to fine-tune for additional domain specialists, the [vortex-depth-v6-pretrained (Cornerstone)](https://huggingface.co/NishantPushparaju/vortex-depth-v6-pretrained) checkpoint is the recommended initialization. ## Loading ```python import torch from models.student import build_student # from the Vortex codebase from config import Config cfg = Config() model = build_student(num_classes=cfg.NUM_CLASSES, pretrained=False, backbone=cfg.BACKBONE) state = torch.load("best_depth_v5.pt", map_location="cpu") model.load_state_dict(state) model.eval() # Inference with torch.no_grad(): depth, seg_logits = model(rgb_tensor) # rgb_tensor: [B, 3, 240, 320] ``` ## Training The configuration applies three augmentation operations to RGB inputs at training time, on top of the V4 baseline: - Horizontal flip (probability 0.5) - ColorJitter: brightness ± 0.2, contrast ± 0.2, saturation ± 0.2, hue ± 0.1 - Random crop or bilinear resize to 240 × 320 Training schedule: AdamW optimizer with encoder LR 3 × 10⁻⁵ and decoder LR 3 × 10⁻⁴ (10 × encoder LR), cosine annealing over 200 epochs, batch size 16. Encoder frozen for the first 5 epochs. Training was performed on NVIDIA L40S 48 GB hardware (NYU Greene HPC, partition `l40s_public`), HPC job 3070058. ## Bootstrap perception context This checkpoint is one component of a three-checkpoint family released as part of the Vortex bootstrap-perception pipeline for indoor robot navigation under hardware depth failure. The pipeline addresses the operational reality that Time-of-Flight depth sensors lose ~78 % of their pixels on reflective indoor surfaces (polished floors, glass walls). The student model fills the dead pixels with consistent learned geometry; runtime fusion combines surviving sensor pixels with the student output. The deployment pipeline applies confidence-gated fusion: where the ToF confidence map exceeds 0.5 and depth lies in [0.05, 10.0] m, the sensor reading is used directly; elsewhere the student depth (median-scale aligned to surviving pixels per frame) is used. ## Project resources - **Codebase**: [github.com/Nishant-ZFYII/ml_inference](https://github.com/Nishant-ZFYII/ml_inference) - **Documentation**: [nishant-zfyii.github.io/ml_inference](https://nishant-zfyii.github.io/ml_inference/) - **V5 model page**: [Atlas (V5)](https://nishant-zfyii.github.io/ml_inference/models/v5-deployment-aug) ## Reference If you use this model in your work, please reference the project repository: ``` https://github.com/Nishant-ZFYII/ml_inference ``` ## License MIT.