---
license: cc-by-nc-4.0
library_name: warpconvnet
pipeline_tag: image-segmentation
tags:
  - 3d
  - point-cloud
  - instance-segmentation
  - open-vocabulary
  - scannet
  - scannet200
  - scannetpp
  - replica
  - spaceformer
  - warpconvnet
---

# SpaceFormer — Open-Vocabulary 3D Instance Segmentation

**SpaceFormer** performs **proposal-free, open-vocabulary 3D instance segmentation**.
A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs
on top of the WarpConvNet [`SpaCeFormer`](https://github.com/NVlabs/WarpConvNet) sparse
point backbone. A single forward pass over an RGB point cloud produces a fixed set of
query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP
feature against text embeddings of **arbitrary class names** (SigLIP2 text encoder, with
prompt ensembling). The vocabulary is chosen at inference time — it is not baked into the
weights — so the model can be queried with any label set.

Project page: https://nvlabs.github.io/SpaCeFormer/

## Model details

- **Task:** open-vocabulary 3D instance segmentation on RGB point clouds.
- **Architecture:** WarpConvNet `SpaCeFormer` backbone (mixed space/curve sparse
  attention U-Net, `ssccc` encoder) → proposal-free query decoder (hidden dim 512,
  200 learned queries, RoPE cross/self-attention, 3 decoder iterations) → objectness +
  per-point mask + per-query CLIP heads. ~85.8M parameters.
- **CLIP/text embedding:** `google/siglip2-so400m-patch14-224` (1152-d), used only at
  inference to embed class names; not stored in this checkpoint.
- **Input:** point coordinates in meters + RGB; voxelized internally at 2 cm.
- **Naming:** `spaceformer_512_siglip2_ssccc` = hidden dim 512 · SigLIP2 embedding ·
  `ssccc` encoder attention (space, space, curve, curve, curve).

## Evaluation

Test-set mAP with the released recipe (**prompt ensembling on, TTA off, default
proposal-free post-processing**):

| Benchmark | mAP | mAP50 | recall (class-agnostic) |
|---|---:|---:|---:|
| ScanNet200 | **0.1265** | 0.210 | 0.756 |
| ScanNet++ | 0.2217 | — | — |
| Replica | 0.2644 | — | — |

## How to use

The model lives in WarpConvNet as `warpconvnet.models.spaceformer` (the backbone needs
WarpConvNet's compiled CUDA extension — install a pre-built wheel or build from source).
It returns **raw** predictions; open-vocab labeling + mask post-processing live in the
demo repo / HuggingFace Space, not in WarpConvNet.

```python
import torch
from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
from huggingface_hub import hf_hub_download

device = torch.device("cuda")
ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")

net = build_spaceformer(device=device)
load_spaceformer_checkpoint(net, ckpt)          # 487 tensors, strict=False

# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
out = net({"coord": coord, "feat": feat, "offset": offset})
# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}
```

To turn `clip_feat` into open-vocabulary labels (SigLIP2 text + prompt ensembling) and
clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space
(`pipeline.py`, `clip_eval.py`, `text_encoder.py`, `postprocessing.py`, `labels.py`) —
e.g. its `inference.py` CLI or the Gradio `app.py`.

## Demo (run locally)

A small local demo lives under [`demo/`](./demo) — no GPU cloud / HF Space needed, run it
on your own machine (requires WarpConvNet with its compiled extension). It takes text
class names, runs segmentation, and shows the result in an interactive 3D
[**viser**](https://viser.studio) viewer:

```bash
pip install -r demo/requirements.txt          # + warpconvnet (compiled)
python demo/demo_viser.py --port 8080         # uses a bundled sample point cloud
# your own scene + vocabulary:
python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other"
```

Open the printed `http://localhost:8080` — each predicted instance is a distinct color.
A headless CLI (`demo/inference.py`) and a Gradio app (`demo/app.py`) are also included.

## Intended use & limitations

- **Intended:** research on open-vocabulary 3D scene understanding; segmenting indoor RGB
  point clouds (ScanNet-like) against custom class vocabularies.
- **Open-vocab mAP is semantics-bottlenecked:** rare/fine-grained classes are weaker than
  head classes; class-agnostic mask recall is higher than the open-vocab mAP.
- **Domain:** trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D)
  and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very
  different sensor domains are out of distribution.
- **Large scenes:** very large clouds can exceed memory in the eval forward; the
  inference code skips such a scene (single-process) rather than crashing.

## Files

- `spaceformer_512_siglip2_ssccc.ckpt` — weights-only Lightning `state_dict` (487
  tensors; `net.*` decoder/backbone + `caption_loss.logit_scale`). Load via
  `load_spaceformer_checkpoint` (strips the `net.` prefix, `strict=False`).
- `spaceformer_512_siglip2_ssccc.ckpt.provenance.json` — architecture, eval numbers, md5.

## License & usage

**These weights are released for non-commercial research use only, under
[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/).** They are a derivative
of datasets governed by non-commercial research Terms of Use, so they are **not** released
under the permissive Apache-2.0 license that covers the *code*.

The model was trained on the following datasets, each of which restricts use to
**non-commercial research/education** under its own terms — by using these weights you
agree to comply with all of them:

- **ScanNet / ScanNet200** — [ScanNet Terms of Use](http://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf)
- **ScanNet++** — [ScanNet++ Terms of Use](https://kaldir.vc.in.tum.de/scannetpp/static/scannetpp-terms-of-use.pdf)
- **ARKitScenes** — [Apple ARKitScenes license](https://github.com/apple/ARKitScenes/blob/main/LICENSE) (non-commercial)
- **Matterport3D** — [Matterport3D Terms of Use](https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf) (non-commercial academic)

Evaluation additionally used **Replica** ([Replica Research Terms](https://github.com/facebookresearch/Replica-Dataset/blob/main/LICENSE), non-commercial), zero-shot.

The accompanying **code** in [WarpConvNet](https://github.com/NVlabs/WarpConvNet) is
licensed separately under **Apache-2.0**.

> Note: this is not legal advice; for commercial use, consult the individual dataset
> licensors. Please also cite the datasets above and the SpaceFormer project.