--- license: cc-by-nc-sa-4.0 tags: - semantic-segmentation - panoramic - spherical-images - modality-fusion - sam library_name: panosamic pipeline_tag: image-segmentation datasets: - stanford2d3ds - matterport3d --- # PanoSAMic PanoSAMic is a multi-modal semantic segmentation model for panoramic (360°) images. It integrates the **frozen** Segment Anything Model (SAM) encoder, modified to output multi-stage features, with a spatio-modal fusion module (MCBAM), a spherical-attention semantic decoder, and dual-view fusion to handle the distortion and edge discontinuity of equirectangular images. - **Paper:** PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion (ICPR 2026) - **Code:** https://github.com/dfki-av/PanoSAMic - **arXiv:** https://arxiv.org/abs/2601.07447 - **Authors:** Mahdi Chamseddine, Didier Stricker, Jason Rambach (DFKI / RPTU Kaiserslautern-Landau) ## What is in this repository Only the **trainable** PanoSAMic components are hosted here: - **Feature fusion blocks (MCBAM)** — spatio-modal cross-attention applied to the branch features extracted by the frozen encoder - **Semantic decoder** — convolutional decoder with spherical attention and dual-view fusion head The full model state dict has two parts: | Module prefix | Trainable | In Hub checkpoint | |---|---|---| | `feature_fuser.*` | ✅ yes | ✅ yes | | `semantic_decoder.*` | ✅ yes | ✅ yes | | `image_encoder.*` | ❌ frozen (SAM ViT) | ❌ no | | `prompt_encoder.*` | ❌ frozen (SAM) | ❌ no | | `mask_decoder.*` | ❌ frozen (SAM) | ❌ no | The **frozen SAM ViT backbone is NOT hosted here.** It is downloaded separately from Meta's official release (Apache-2.0) and combined at load time. This keeps each checkpoint small and avoids redistributing the SAM weights. ## Available checkpoints Each variant lives in its own subfolder of `dfki-av/PanoSAMic` (e.g. `stanford2d3ds-vith-rgbdn-fold1/model.safetensors`). 3-fold checkpoints are published per fold so each can be evaluated on its held-out split. | Checkpoint | Backbone | Modalities | Dataset | Split | |---|---|---|---|---| | `stanford2d3ds-vith-rgb-fold1` | ViT-H | RGB | Stanford2D3DS | Fold 1 | | `stanford2d3ds-vith-rgb-fold2` | ViT-H | RGB | Stanford2D3DS | Fold 2 | | `stanford2d3ds-vith-rgb-fold3` | ViT-H | RGB | Stanford2D3DS | Fold 3 | | `stanford2d3ds-vith-rgbd-fold1` | ViT-H | RGB-D | Stanford2D3DS | Fold 1 | | `stanford2d3ds-vith-rgbd-fold2` | ViT-H | RGB-D | Stanford2D3DS | Fold 2 | | `stanford2d3ds-vith-rgbd-fold3` | ViT-H | RGB-D | Stanford2D3DS | Fold 3 | | `stanford2d3ds-vith-rgbdn-fold1` | ViT-H | RGB-D-N | Stanford2D3DS | Fold 1 | | `stanford2d3ds-vith-rgbdn-fold2` | ViT-H | RGB-D-N | Stanford2D3DS | Fold 2 | | `stanford2d3ds-vith-rgbdn-fold3` | ViT-H | RGB-D-N | Stanford2D3DS | Fold 3 | | `stanford2d3ds-vitl-rgbdn-fold1` | ViT-L | RGB-D-N | Stanford2D3DS | Fold 1 | | `stanford2d3ds-vitl-rgbdn-fold2` | ViT-L | RGB-D-N | Stanford2D3DS | Fold 2 | | `stanford2d3ds-vitl-rgbdn-fold3` | ViT-L | RGB-D-N | Stanford2D3DS | Fold 3 | | `stanford2d3ds-vitb-rgbdn-fold1` | ViT-B | RGB-D-N | Stanford2D3DS | Fold 1 | | `stanford2d3ds-vitb-rgbdn-fold2` | ViT-B | RGB-D-N | Stanford2D3DS | Fold 2 | | `stanford2d3ds-vitb-rgbdn-fold3` | ViT-B | RGB-D-N | Stanford2D3DS | Fold 3 | | `matterport3d-vith-rgb` | ViT-H | RGB | Matterport3D | BEV360 | | `matterport3d-vith-rgbd` | ViT-H | RGB-D | Matterport3D | BEV360 | ## Reported results **Stanford2D3DS (3-fold validation), main table:** | Checkpoint | mIoU % | mAcc % | Trainable params (M) | |---|---|---|---| | `stanford2d3ds-vith-rgb` | 59.62 | 74.11 | 178 | | `stanford2d3ds-vith-rgbd` | 60.90 | 73.95 | 184 | | `stanford2d3ds-vith-rgbdn` | 61.57 | 74.04 | 191 | **Encoder-size study (Stanford2D3DS, 3-fold, RGB-D-N):** | Checkpoint | mIoU % | mAcc % | |---|---|---| | `stanford2d3ds-vitb-rgbdn` | 56.68 | 70.49 | | `stanford2d3ds-vitl-rgbdn` | 60.90 | 73.09 | | `stanford2d3ds-vith-rgbdn` | 61.57 | 74.04 | **Matterport3D (BEV360 splits):** | Checkpoint | mIoU % | |---|---| | `matterport3d-vith-rgb` | 46.59 | | `matterport3d-vith-rgbd` | 48.43 | ## How to reproduce ### 1. Environment - Python 3.11+ - Install with `uv sync` from the GitHub repo (`pyproject.toml` pins dependencies) - 1× GPU with ≥16 GB VRAM for ViT-H inference (≥24 GB for training) ### 2. Get the frozen SAM backbone Download the official SAM weights from Meta and place them in `sam_weights/`: - `sam_vit_h_4b8939.pth` - `sam_vit_l_0b3195.pth` - `sam_vit_b_01ec64.pth` (See https://github.com/facebookresearch/segment-anything#model-checkpoints) ### 3. Load a checkpoint ```python from panosamic.model import PanoSAMic model = PanoSAMic.from_pretrained_panosamic( "dfki-av/PanoSAMic", subfolder="stanford2d3ds-vith-rgbdn-fold1", config_path="config/config_stanford2d3ds_dv.json", vit_model="vit_h", modalities=("image", "depth", "normals"), num_classes=13, sam_weights_path="./sam_weights", # omit to auto-download from Meta's servers ) ``` `from_pretrained_panosamic` loads only the trainable weights from the Hub, initialises the frozen SAM backbone from the local `sam_weights/` directory (auto-downloaded if not present), and returns the model in `eval()` mode. ### 4. Run inference ```python import torch from panosamic.model.instance_semantic_fusion import refine_semantic_with_instances # batched_input: list of dicts, one per image. # Each dict maps modality name → float tensor (3, H, W), values in [0, 255]. # Image must be equirectangular 2:1 (e.g. 512 × 1024). batched_input = [{"image": image_tensor, "depth": depth_tensor, "normals": normals_tensor}] with torch.no_grad(): outputs = model(batched_input) sem_preds = outputs[0]["sem_preds"] # (num_classes, H, W) — logits instance_masks = outputs[0]["instance_masks"] # Instance-guided refinement: each SAM mask is assigned the majority # semantic class within it, sharpening boundaries. if instance_masks: sem_preds = refine_semantic_with_instances(sem_preds, instance_masks) seg_map = sem_preds.argmax(dim=0) # (H, W) — integer class indices ``` ### 5. Prepare the data Use the exact splits reported in the paper: - **Stanford2D3DS:** the authors' 3-fold cross-validation splits. Source: https://github.com/alexsax/2D-3D-Semantics . Preprocess with `panosamic/data_preparation/` into the processed structure documented in the repo README. - **Matterport3D:** the **BEV360** pre-processed data and splits (20-class subset) for a fair comparison. Source: https://github.com/InSAI-Lab/360BEV . ### 6. Run evaluation **From a released Hub checkpoint** (trainable weights only, SAM loaded separately): ```bash python panosamic/evaluation/evaluate.py \ --dataset_path /path/to/processed/dataset \ --config_path config/config_stanford2d3ds_dv.json \ --checkpoint dfki-av/PanoSAMic \ --subfolder stanford2d3ds-vith-rgbdn-fold1 \ --sam_weights_path ./sam_weights \ --dataset stanford2d3ds \ --fold 1 \ --vit_model vit_h \ --modalities image,depth,normals \ --num_gpus 1 ``` **From a local training run** (full checkpoint including frozen backbone): ```bash python panosamic/evaluation/evaluate.py \ --dataset_path /path/to/processed/dataset \ --config_path config/config_stanford2d3ds_dv.json \ --experiments_path ./experiments \ --dataset stanford2d3ds \ --fold 1 \ --vit_model vit_h \ --modalities image,depth,normals \ --num_gpus 1 ``` Repeat for folds 1–3 and average for the 3-fold numbers. For Matterport3D use `config/config_matterport3d_dv.json`, `--dataset matterport3d`, and the modalities for that row. ### 7. Key configuration (matches the paper) - Frozen SAM ViT-H, encoder depth 32, global attention at blocks [8, 16, 24, 32] - Batch size 8, 50 epochs, Ranger21 optimizer - Max LR 0.0005 (Stanford2D3DS) / 0.001 (Matterport3D) - Input resized to 512 × 1024 - MCBAM window 8×8, stride 4; spherical attention kernel 7×7, stride 1 - Dual-view shift s = W/2 - Loss: Jaccard (Stanford2D3DS); alternating Cross-Entropy/Jaccard schedule (Matterport3D) - Depth preprocessed to pseudo-disparity (threshold = 99.5th percentile of train depths, rounded to nearest 10 cm), replicated to 3 channels ## Intended use and limitations Indoor panoramic semantic segmentation with RGB / RGB-D / RGB-D-N input. Evaluated only on indoor datasets; outdoor generalization is not guaranteed. ## License and access terms - This model card and the released trainable weights: **CC BY-NC-SA 4.0** (Attribution–NonCommercial–ShareAlike). Use is restricted to **non-commercial** purposes. - The frozen SAM backbone (downloaded separately) remains under its original **Apache-2.0** license from Meta AI. ## Citation ```bibtex @article{chamseddine2026panosamic, title = {PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion}, author = {Chamseddine, Mahdi and Stricker, Didier and Rambach, Jason}, journal = {arXiv preprint arXiv:2601.07447}, year = {2026} } ``` ## Acknowledgement Funded by the European Union as part of the projects HumanTech (Grant Agreement 101058236) and ShieldBOT (Grant Agreement 101235093).