--- license: apache-2.0 library_name: pytorch pipeline_tag: image-to-image tags: - panoramic vision - multi-task-learning - computer-vision - scene-understanding - 360vision base_model: - facebook/dinov3-vitl16-pretrain-lvd1689m --- # 🌐 MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

Paper Hugging Face Project Page GitHub

## 📜 Introduction MTPano is a robust multi-task panoramic foundation model designed to overcome the limitations of geometric distortions and the scarcity of high-resolution annotations in 360° vision. By leveraging powerful perspective dense priors, MTPano establishes a unified representation for spherical scene understanding. ### Key Contributions - **Label-Free Training Pipeline:** Circummvents data scarcity by projecting panoramas into distortion-free perspective patches, generating high-quality pseudo-labels using foundation models (InternImage-H, MoGe-2), and re-projecting them for patch-wise supervision. - **Panoramic Dual BridgeNet (PD-BridgeNet):** A dual-stream architecture that disentangles **rotation-invariant** features (Semantic Segmentation, Depth) from **rotation-variant** features (Surface Normals). - **ERP Token Mixer:** A latitude-adaptive mechanism that handles Equirectangular Projection (ERP) distortion by dynamically adjusting kernels based on pixel stretching. - **Truncated Gradient Flow:** Facilitates synergistic cross-task interaction while strictly blocking conflicting gradients between feature branches to avoid negative transfer. ## 📊 Training Data MTPano is trained on a large-scale composite dataset of over 408k images, combining real-world captures with high-fidelity synthetic scenes. ### Model Versions Comparison | Dataset | 140k Weights | 408k Weights | | :--- | :---: | :---: | | [Structured3D](https://github.com/bertjiazheng/Structured3D/) | 16.6k | 16.6k | | Sun360 | 34.3k | 34.3k | | [Matterport3D](https://github.com/niessner/Matterport/) | 7.9k | 7.9k | | [DiT360](https://github.com/Insta360-Research-Team/DiT360) (Synthetic) | 82k | 182k | | [Hunyuan](https://github.com/Tencent-Hunyuan/HunyuanWorld-1.0) (Synthetic) | - | 100k | | [ZInD](https://github.com/zillow/zind) | - | 67.4k | | **Total Images** | **140k** | **408k** | ## 🚀 Performance MTPano achieves state-of-the-art performance across all tasks on both synthetic and real-world benchmarks, consistently outperforming previous single-task specialists and multi-task models. ### Structured3D (Synthetic Benchmark) | Task | Metric | MTPano | | :--- | :--- | :---: | | **Semantic Segmentation** | mIoU (↑) | **75.66** | | **Depth Estimation** | AbsRel (↓) | **0.0248** | | | RMSE (↓) | **0.0968** | | | $\delta_1$ (↑) | **99.27** | | **Surface Normal** | Mean Error (↓) | **3.85°** | | | Median Error (↓) | **0.01°** | | | $<11.5^\circ$ (↑) | **91.66** | ### Stanford2D3D (Real-World Benchmark) | Task | Metric | MTPano (Ours) | | :--- | :--- | :---: | | **Semantic Segmentation** | mIoU (↑) | **69.47** | | **Depth Estimation** | AbsRel (↓) | **0.0675** | | | RMSE (↓) | **0.4317** | | | $\delta_1$ (↑) | **96.86** | | **Surface Normal** | Mean Error (↓) | **9.71°** | | | Median Error (↓) | **0.93°** | | | $<11.5^\circ$ (↑) | **80.65** | *Note: On the real-world Stanford2D3D dataset, MTPano (a multi-task model) achieves performance highly competitive with single-task specialist foundation models while maintaining superior cross-task consistency.* ## 🛠️ Implementation Please refer to [https://github.com/Evergreen0929/MTPano](https://github.com/Evergreen0929/MTPano) for detailed implementations. ## 🎓 Citation ```bibtex @article{zhang2026mtpano, title={MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors}, author={Zhang, Jingdong and Zhan, Xiaohang and Zhang, Lingzhi and Wang, Yizhou and Yu, Zhengming and Wang, Jionghao and Wang, Wenping and Li, Xin}, journal={arXiv preprint}, year={2026} } ``` ## 👏 Acknowledgement This work is supported by researchers from **Texas A&M University** and **Adobe**. We thank the creators of DINOv3, InternImage, and MoGe for their foundational contributions to the field.