File size: 4,791 Bytes
064ff9e cf1a417 064ff9e cf1a417 064ff9e 830f15c 064ff9e 830f15c 064ff9e 830f15c 064ff9e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | ---
license: mit
library_name: pytorch
pipeline_tag: depth-estimation
tags:
- depth-estimation
- monocular-depth
- knowledge-distillation
- robotics
- indoor-navigation
- semantic-segmentation
- efficientvit
- bootstrap-perception
- vortex-depth
datasets:
- sayakpaul/nyu_depth_v2
metrics:
- rmse
- mae
- miou
model-index:
- name: vortex-depth-v5-general
results:
- task:
type: depth-estimation
name: Monocular Indoor Depth Estimation
dataset:
name: NYU Depth V2 (val)
type: nyu_depth_v2
metrics:
- type: rmse
value: 0.572
name: NYU val RMSE (m)
- type: mIoU
value: 63.7
name: 6-class Segmentation mIoU (%)
---
# Vortex-Depth-V5-General (Atlas)
A 5.31 × 10⁶ parameter monocular depth + 6-class segmentation student model for general-purpose indoor depth estimation. The recommended deployable checkpoint of the Vortex-Depth lineage for unconstrained indoor scenes (apartments, kitchens, offices, mixed room geometries).
| Property | Value |
|---|---|
| Codename | **Atlas** |
| Lineage version | V5 |
| Architecture | EfficientViT-B1 encoder + dual transposed-convolution decoder |
| Parameters | 5.31 × 10⁶ |
| Input | RGB, 240 × 320, ImageNet-normalized within forward pass |
| Output | depth `[B, 1, 240, 320]` in meters; segmentation `[B, 6, 240, 320]` logits |
| Training corpus | NYU Depth V2 with deployment-targeted augmentation pipeline |
| Teacher | DA3-Metric-Large |
| Loss | berHu (depth) + cross-entropy (segmentation) + edge-aware smoothness, Kendall-weighted |
| Inference latency | ~5 ms on Jetson Orin Nano (TensorRT FP16) |
## Use case
Recommended for general indoor depth estimation across diverse room geometries. This checkpoint is the lineage's most well-rounded model on standard indoor benchmarks:
- NYU val RMSE: **0.572 m**
- NYU val mIoU (6-class: floor, wall, person, furniture, glass, other): **63.7 %**
For corridor-class environments specifically, the [vortex-depth-v9-corridor (Lighthouse)](https://huggingface.co/NishantPushparaju/vortex-depth-v9-corridor) checkpoint achieves 0.382 m corridor RMSE and is the recommended choice when the deployment domain is restricted to corridors.
For users intending to fine-tune for additional domain specialists, the [vortex-depth-v6-pretrained (Cornerstone)](https://huggingface.co/NishantPushparaju/vortex-depth-v6-pretrained) checkpoint is the recommended initialization.
## Loading
```python
import torch
from models.student import build_student # from the Vortex codebase
from config import Config
cfg = Config()
model = build_student(num_classes=cfg.NUM_CLASSES, pretrained=False, backbone=cfg.BACKBONE)
state = torch.load("best_depth_v5.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()
# Inference
with torch.no_grad():
depth, seg_logits = model(rgb_tensor) # rgb_tensor: [B, 3, 240, 320]
```
## Training
The configuration applies three augmentation operations to RGB inputs at training time, on top of the V4 baseline:
- Horizontal flip (probability 0.5)
- ColorJitter: brightness ± 0.2, contrast ± 0.2, saturation ± 0.2, hue ± 0.1
- Random crop or bilinear resize to 240 × 320
Training schedule: AdamW optimizer with encoder LR 3 × 10⁻⁵ and decoder LR 3 × 10⁻⁴ (10 × encoder LR), cosine annealing over 200 epochs, batch size 16. Encoder frozen for the first 5 epochs.
Training was performed on NVIDIA L40S 48 GB hardware (NYU Greene HPC, partition `l40s_public`), HPC job 3070058.
## Bootstrap perception context
This checkpoint is one component of a three-checkpoint family released as part of the Vortex bootstrap-perception pipeline for indoor robot navigation under hardware depth failure. The pipeline addresses the operational reality that Time-of-Flight depth sensors lose ~78 % of their pixels on reflective indoor surfaces (polished floors, glass walls). The student model fills the dead pixels with consistent learned geometry; runtime fusion combines surviving sensor pixels with the student output.
The deployment pipeline applies confidence-gated fusion: where the ToF confidence map exceeds 0.5 and depth lies in [0.05, 10.0] m, the sensor reading is used directly; elsewhere the student depth (median-scale aligned to surviving pixels per frame) is used.
## Project resources
- **Codebase**: [github.com/Nishant-ZFYII/ml_inference](https://github.com/Nishant-ZFYII/ml_inference)
- **Documentation**: [nishant-zfyii.github.io/ml_inference](https://nishant-zfyii.github.io/ml_inference/)
- **V5 model page**: [Atlas (V5)](https://nishant-zfyii.github.io/ml_inference/models/v5-deployment-aug)
## Reference
If you use this model in your work, please reference the project repository:
```
https://github.com/Nishant-ZFYII/ml_inference
```
## License
MIT.
|