NishantPushparaju's picture
docs: anonymize for active review window
830f15c verified
|
Raw
History Blame Contribute Delete
4.79 kB
metadata
license: mit
library_name: pytorch
pipeline_tag: depth-estimation
tags:
  - depth-estimation
  - monocular-depth
  - knowledge-distillation
  - robotics
  - indoor-navigation
  - semantic-segmentation
  - efficientvit
  - bootstrap-perception
  - vortex-depth
datasets:
  - sayakpaul/nyu_depth_v2
metrics:
  - rmse
  - mae
  - miou
model-index:
  - name: vortex-depth-v5-general
    results:
      - task:
          type: depth-estimation
          name: Monocular Indoor Depth Estimation
        dataset:
          name: NYU Depth V2 (val)
          type: nyu_depth_v2
        metrics:
          - type: rmse
            value: 0.572
            name: NYU val RMSE (m)
          - type: mIoU
            value: 63.7
            name: 6-class Segmentation mIoU (%)

Vortex-Depth-V5-General (Atlas)

A 5.31 × 10⁶ parameter monocular depth + 6-class segmentation student model for general-purpose indoor depth estimation. The recommended deployable checkpoint of the Vortex-Depth lineage for unconstrained indoor scenes (apartments, kitchens, offices, mixed room geometries).

Property Value
Codename Atlas
Lineage version V5
Architecture EfficientViT-B1 encoder + dual transposed-convolution decoder
Parameters 5.31 × 10⁶
Input RGB, 240 × 320, ImageNet-normalized within forward pass
Output depth [B, 1, 240, 320] in meters; segmentation [B, 6, 240, 320] logits
Training corpus NYU Depth V2 with deployment-targeted augmentation pipeline
Teacher DA3-Metric-Large
Loss berHu (depth) + cross-entropy (segmentation) + edge-aware smoothness, Kendall-weighted
Inference latency ~5 ms on Jetson Orin Nano (TensorRT FP16)

Use case

Recommended for general indoor depth estimation across diverse room geometries. This checkpoint is the lineage's most well-rounded model on standard indoor benchmarks:

  • NYU val RMSE: 0.572 m
  • NYU val mIoU (6-class: floor, wall, person, furniture, glass, other): 63.7 %

For corridor-class environments specifically, the vortex-depth-v9-corridor (Lighthouse) checkpoint achieves 0.382 m corridor RMSE and is the recommended choice when the deployment domain is restricted to corridors.

For users intending to fine-tune for additional domain specialists, the vortex-depth-v6-pretrained (Cornerstone) checkpoint is the recommended initialization.

Loading

import torch
from models.student import build_student   # from the Vortex codebase
from config import Config

cfg = Config()
model = build_student(num_classes=cfg.NUM_CLASSES, pretrained=False, backbone=cfg.BACKBONE)
state = torch.load("best_depth_v5.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

# Inference
with torch.no_grad():
    depth, seg_logits = model(rgb_tensor)  # rgb_tensor: [B, 3, 240, 320]

Training

The configuration applies three augmentation operations to RGB inputs at training time, on top of the V4 baseline:

  • Horizontal flip (probability 0.5)
  • ColorJitter: brightness ± 0.2, contrast ± 0.2, saturation ± 0.2, hue ± 0.1
  • Random crop or bilinear resize to 240 × 320

Training schedule: AdamW optimizer with encoder LR 3 × 10⁻⁵ and decoder LR 3 × 10⁻⁴ (10 × encoder LR), cosine annealing over 200 epochs, batch size 16. Encoder frozen for the first 5 epochs.

Training was performed on NVIDIA L40S 48 GB hardware (NYU Greene HPC, partition l40s_public), HPC job 3070058.

Bootstrap perception context

This checkpoint is one component of a three-checkpoint family released as part of the Vortex bootstrap-perception pipeline for indoor robot navigation under hardware depth failure. The pipeline addresses the operational reality that Time-of-Flight depth sensors lose ~78 % of their pixels on reflective indoor surfaces (polished floors, glass walls). The student model fills the dead pixels with consistent learned geometry; runtime fusion combines surviving sensor pixels with the student output.

The deployment pipeline applies confidence-gated fusion: where the ToF confidence map exceeds 0.5 and depth lies in [0.05, 10.0] m, the sensor reading is used directly; elsewhere the student depth (median-scale aligned to surviving pixels per frame) is used.

Project resources

Reference

If you use this model in your work, please reference the project repository:

https://github.com/Nishant-ZFYII/ml_inference

License

MIT.