---
license: cc-by-nc-sa-4.0
tags:
  - semantic-segmentation
  - panoramic
  - spherical-images
  - modality-fusion
  - sam
library_name: panosamic
pipeline_tag: image-segmentation
datasets:
  - stanford2d3ds
  - matterport3d
---

# PanoSAMic

PanoSAMic is a multi-modal semantic segmentation model for panoramic (360°)
images. It integrates the **frozen** Segment Anything Model (SAM) encoder,
modified to output multi-stage features, with a spatio-modal fusion module
(MCBAM), a spherical-attention semantic decoder, and dual-view fusion to handle
the distortion and edge discontinuity of equirectangular images.

- **Paper:** PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion (ICPR 2026)
- **Code:** https://github.com/dfki-av/PanoSAMic
- **arXiv:** https://arxiv.org/abs/2601.07447
- **Authors:** Mahdi Chamseddine, Didier Stricker, Jason Rambach (DFKI / RPTU Kaiserslautern-Landau)

## What is in this repository

Only the **trainable** PanoSAMic components are hosted here:

- **Feature fusion blocks (MCBAM)** — spatio-modal cross-attention applied to the branch features extracted by the frozen encoder
- **Semantic decoder** — convolutional decoder with spherical attention and dual-view fusion head

The full model state dict has two parts:

| Module prefix | Trainable | In Hub checkpoint |
|---|---|---|
| `feature_fuser.*` | ✅ yes | ✅ yes |
| `semantic_decoder.*` | ✅ yes | ✅ yes |
| `image_encoder.*` | ❌ frozen (SAM ViT) | ❌ no |
| `prompt_encoder.*` | ❌ frozen (SAM) | ❌ no |
| `mask_decoder.*` | ❌ frozen (SAM) | ❌ no |

The **frozen SAM ViT backbone is NOT hosted here.** It is downloaded separately
from Meta's official release (Apache-2.0) and combined at load time. This keeps
each checkpoint small and avoids redistributing the SAM weights.

## Available checkpoints

Each variant lives in its own subfolder of `dfki-av/PanoSAMic`
(e.g. `stanford2d3ds-vith-rgbdn-fold1/model.safetensors`).
3-fold checkpoints are published per fold so each can be evaluated on its held-out split.

| Checkpoint | Backbone | Modalities | Dataset | Split |
|---|---|---|---|---|
| `stanford2d3ds-vith-rgb-fold1` | ViT-H | RGB | Stanford2D3DS | Fold 1 |
| `stanford2d3ds-vith-rgb-fold2` | ViT-H | RGB | Stanford2D3DS | Fold 2 |
| `stanford2d3ds-vith-rgb-fold3` | ViT-H | RGB | Stanford2D3DS | Fold 3 |
| `stanford2d3ds-vith-rgbd-fold1` | ViT-H | RGB-D | Stanford2D3DS | Fold 1 |
| `stanford2d3ds-vith-rgbd-fold2` | ViT-H | RGB-D | Stanford2D3DS | Fold 2 |
| `stanford2d3ds-vith-rgbd-fold3` | ViT-H | RGB-D | Stanford2D3DS | Fold 3 |
| `stanford2d3ds-vith-rgbdn-fold1` | ViT-H | RGB-D-N | Stanford2D3DS | Fold 1 |
| `stanford2d3ds-vith-rgbdn-fold2` | ViT-H | RGB-D-N | Stanford2D3DS | Fold 2 |
| `stanford2d3ds-vith-rgbdn-fold3` | ViT-H | RGB-D-N | Stanford2D3DS | Fold 3 |
| `stanford2d3ds-vitl-rgbdn-fold1` | ViT-L | RGB-D-N | Stanford2D3DS | Fold 1 |
| `stanford2d3ds-vitl-rgbdn-fold2` | ViT-L | RGB-D-N | Stanford2D3DS | Fold 2 |
| `stanford2d3ds-vitl-rgbdn-fold3` | ViT-L | RGB-D-N | Stanford2D3DS | Fold 3 |
| `stanford2d3ds-vitb-rgbdn-fold1` | ViT-B | RGB-D-N | Stanford2D3DS | Fold 1 |
| `stanford2d3ds-vitb-rgbdn-fold2` | ViT-B | RGB-D-N | Stanford2D3DS | Fold 2 |
| `stanford2d3ds-vitb-rgbdn-fold3` | ViT-B | RGB-D-N | Stanford2D3DS | Fold 3 |
| `matterport3d-vith-rgb` | ViT-H | RGB | Matterport3D | BEV360 |
| `matterport3d-vith-rgbd` | ViT-H | RGB-D | Matterport3D | BEV360 |

## Reported results

**Stanford2D3DS (3-fold validation), main table:**

| Checkpoint | mIoU % | mAcc % | Trainable params (M) |
|---|---|---|---|
| `stanford2d3ds-vith-rgb` | 59.62 | 74.11 | 178 |
| `stanford2d3ds-vith-rgbd` | 60.90 | 73.95 | 184 |
| `stanford2d3ds-vith-rgbdn` | 61.57 | 74.04 | 191 |

**Encoder-size study (Stanford2D3DS, 3-fold, RGB-D-N):**

| Checkpoint | mIoU % | mAcc % |
|---|---|---|
| `stanford2d3ds-vitb-rgbdn` | 56.68 | 70.49 |
| `stanford2d3ds-vitl-rgbdn` | 60.90 | 73.09 |
| `stanford2d3ds-vith-rgbdn` | 61.57 | 74.04 |

**Matterport3D (BEV360 splits):**

| Checkpoint | mIoU % |
|---|---|
| `matterport3d-vith-rgb` | 46.59 |
| `matterport3d-vith-rgbd` | 48.43 |

## How to reproduce

### 1. Environment

- Python 3.11+
- Install with `uv sync` from the GitHub repo (`pyproject.toml` pins dependencies)
- 1× GPU with ≥16 GB VRAM for ViT-H inference (≥24 GB for training)

### 2. Get the frozen SAM backbone

Download the official SAM weights from Meta and place them in `sam_weights/`:

- `sam_vit_h_4b8939.pth`
- `sam_vit_l_0b3195.pth`
- `sam_vit_b_01ec64.pth`

(See https://github.com/facebookresearch/segment-anything#model-checkpoints)

### 3. Load a checkpoint

```python
from panosamic.model import PanoSAMic

model = PanoSAMic.from_pretrained_panosamic(
    "dfki-av/PanoSAMic",
    subfolder="stanford2d3ds-vith-rgbdn-fold1",
    config_path="config/config_stanford2d3ds_dv.json",
    vit_model="vit_h",
    modalities=("image", "depth", "normals"),
    num_classes=13,
    sam_weights_path="./sam_weights",  # omit to auto-download from Meta's servers
)
```

`from_pretrained_panosamic` loads only the trainable weights from the Hub,
initialises the frozen SAM backbone from the local `sam_weights/` directory
(auto-downloaded if not present), and returns the model in `eval()` mode.

### 4. Run inference

```python
import torch
from panosamic.model.instance_semantic_fusion import refine_semantic_with_instances

# batched_input: list of dicts, one per image.
# Each dict maps modality name → float tensor (3, H, W), values in [0, 255].
# Image must be equirectangular 2:1 (e.g. 512 × 1024).
batched_input = [{"image": image_tensor, "depth": depth_tensor, "normals": normals_tensor}]

with torch.no_grad():
    outputs = model(batched_input)

sem_preds = outputs[0]["sem_preds"]        # (num_classes, H, W) — logits
instance_masks = outputs[0]["instance_masks"]

# Instance-guided refinement: each SAM mask is assigned the majority
# semantic class within it, sharpening boundaries.
if instance_masks:
    sem_preds = refine_semantic_with_instances(sem_preds, instance_masks)

seg_map = sem_preds.argmax(dim=0)  # (H, W) — integer class indices
```

### 5. Prepare the data

Use the exact splits reported in the paper:

- **Stanford2D3DS:** the authors' 3-fold cross-validation splits. Source:
  https://github.com/alexsax/2D-3D-Semantics . Preprocess with
  `panosamic/data_preparation/` into the processed structure documented in the
  repo README.
- **Matterport3D:** the **BEV360** pre-processed data and splits (20-class
  subset) for a fair comparison. Source:
  https://github.com/InSAI-Lab/360BEV .

### 6. Run evaluation

**From a released Hub checkpoint** (trainable weights only, SAM loaded separately):

```bash
python panosamic/evaluation/evaluate.py \
    --dataset_path /path/to/processed/dataset \
    --config_path config/config_stanford2d3ds_dv.json \
    --checkpoint dfki-av/PanoSAMic \
    --subfolder stanford2d3ds-vith-rgbdn-fold1 \
    --sam_weights_path ./sam_weights \
    --dataset stanford2d3ds \
    --fold 1 \
    --vit_model vit_h \
    --modalities image,depth,normals \
    --num_gpus 1
```

**From a local training run** (full checkpoint including frozen backbone):

```bash
python panosamic/evaluation/evaluate.py \
    --dataset_path /path/to/processed/dataset \
    --config_path config/config_stanford2d3ds_dv.json \
    --experiments_path ./experiments \
    --dataset stanford2d3ds \
    --fold 1 \
    --vit_model vit_h \
    --modalities image,depth,normals \
    --num_gpus 1
```

Repeat for folds 1–3 and average for the 3-fold numbers. For Matterport3D use
`config/config_matterport3d_dv.json`, `--dataset matterport3d`, and the
modalities for that row.

### 7. Key configuration (matches the paper)

- Frozen SAM ViT-H, encoder depth 32, global attention at blocks [8, 16, 24, 32]
- Batch size 8, 50 epochs, Ranger21 optimizer
- Max LR 0.0005 (Stanford2D3DS) / 0.001 (Matterport3D)
- Input resized to 512 × 1024
- MCBAM window 8×8, stride 4; spherical attention kernel 7×7, stride 1
- Dual-view shift s = W/2
- Loss: Jaccard (Stanford2D3DS); alternating Cross-Entropy/Jaccard schedule (Matterport3D)
- Depth preprocessed to pseudo-disparity (threshold = 99.5th percentile of train depths, rounded to nearest 10 cm), replicated to 3 channels

## Intended use and limitations

Indoor panoramic semantic segmentation with RGB / RGB-D / RGB-D-N input.
Evaluated only on indoor datasets; outdoor generalization is not guaranteed.

## License and access terms

- This model card and the released trainable weights: **CC BY-NC-SA 4.0**
  (Attribution–NonCommercial–ShareAlike). Use is restricted to **non-commercial**
  purposes.
- The frozen SAM backbone (downloaded separately) remains under its original
  **Apache-2.0** license from Meta AI.

## Citation

```bibtex
@article{chamseddine2026panosamic,
  title   = {PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion},
  author  = {Chamseddine, Mahdi and Stricker, Didier and Rambach, Jason},
  journal = {arXiv preprint arXiv:2601.07447},
  year    = {2026}
}
```

## Acknowledgement

Funded by the European Union as part of the projects HumanTech (Grant Agreement
101058236) and ShieldBOT (Grant Agreement 101235093).