--- license: cc-by-nc-4.0 library_name: warpconvnet pipeline_tag: image-segmentation tags: - 3d - point-cloud - instance-segmentation - open-vocabulary - scannet - scannet200 - scannetpp - replica - spaceformer - warpconvnet --- # SpaceFormer — Open-Vocabulary 3D Instance Segmentation **SpaceFormer** performs **proposal-free, open-vocabulary 3D instance segmentation**. A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs on top of the WarpConvNet [`SpaCeFormer`](https://github.com/NVlabs/WarpConvNet) sparse point backbone. A single forward pass over an RGB point cloud produces a fixed set of query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP feature against text embeddings of **arbitrary class names** (SigLIP2 text encoder, with prompt ensembling). The vocabulary is chosen at inference time — it is not baked into the weights — so the model can be queried with any label set. Project page: https://nvlabs.github.io/SpaCeFormer/ ## Model details - **Task:** open-vocabulary 3D instance segmentation on RGB point clouds. - **Architecture:** WarpConvNet `SpaCeFormer` backbone (mixed space/curve sparse attention U-Net, `ssccc` encoder) → proposal-free query decoder (hidden dim 512, 200 learned queries, RoPE cross/self-attention, 3 decoder iterations) → objectness + per-point mask + per-query CLIP heads. ~85.8M parameters. - **CLIP/text embedding:** `google/siglip2-so400m-patch14-224` (1152-d), used only at inference to embed class names; not stored in this checkpoint. - **Input:** point coordinates in meters + RGB; voxelized internally at 2 cm. - **Naming:** `spaceformer_512_siglip2_ssccc` = hidden dim 512 · SigLIP2 embedding · `ssccc` encoder attention (space, space, curve, curve, curve). ## Evaluation Test-set mAP with the released recipe (**prompt ensembling on, TTA off, default proposal-free post-processing**): | Benchmark | mAP | mAP50 | recall (class-agnostic) | |---|---:|---:|---:| | ScanNet200 | **0.1265** | 0.210 | 0.756 | | ScanNet++ | 0.2217 | — | — | | Replica | 0.2644 | — | — | ## How to use The model lives in WarpConvNet as `warpconvnet.models.spaceformer` (the backbone needs WarpConvNet's compiled CUDA extension — install a pre-built wheel or build from source). It returns **raw** predictions; open-vocab labeling + mask post-processing live in the demo repo / HuggingFace Space, not in WarpConvNet. ```python import torch from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint from huggingface_hub import hf_hub_download device = torch.device("cuda") ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt") net = build_spaceformer(device=device) load_spaceformer_checkpoint(net, ckpt) # 487 tensors, strict=False # coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N] out = net({"coord": coord, "feat": feat, "offset": offset}) # raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]} ``` To turn `clip_feat` into open-vocabulary labels (SigLIP2 text + prompt ensembling) and clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space (`pipeline.py`, `clip_eval.py`, `text_encoder.py`, `postprocessing.py`, `labels.py`) — e.g. its `inference.py` CLI or the Gradio `app.py`. ## Demo (run locally) A small local demo lives under [`demo/`](./demo) — no GPU cloud / HF Space needed, run it on your own machine (requires WarpConvNet with its compiled extension). It takes text class names, runs segmentation, and shows the result in an interactive 3D [**viser**](https://viser.studio) viewer: ```bash pip install -r demo/requirements.txt # + warpconvnet (compiled) python demo/demo_viser.py --port 8080 # uses a bundled sample point cloud # your own scene + vocabulary: python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other" ``` Open the printed `http://localhost:8080` — each predicted instance is a distinct color. A headless CLI (`demo/inference.py`) and a Gradio app (`demo/app.py`) are also included. ## Intended use & limitations - **Intended:** research on open-vocabulary 3D scene understanding; segmenting indoor RGB point clouds (ScanNet-like) against custom class vocabularies. - **Open-vocab mAP is semantics-bottlenecked:** rare/fine-grained classes are weaker than head classes; class-agnostic mask recall is higher than the open-vocab mAP. - **Domain:** trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D) and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very different sensor domains are out of distribution. - **Large scenes:** very large clouds can exceed memory in the eval forward; the inference code skips such a scene (single-process) rather than crashing. ## Files - `spaceformer_512_siglip2_ssccc.ckpt` — weights-only Lightning `state_dict` (487 tensors; `net.*` decoder/backbone + `caption_loss.logit_scale`). Load via `load_spaceformer_checkpoint` (strips the `net.` prefix, `strict=False`). - `spaceformer_512_siglip2_ssccc.ckpt.provenance.json` — architecture, eval numbers, md5. ## License & usage **These weights are released for non-commercial research use only, under [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/).** They are a derivative of datasets governed by non-commercial research Terms of Use, so they are **not** released under the permissive Apache-2.0 license that covers the *code*. The model was trained on the following datasets, each of which restricts use to **non-commercial research/education** under its own terms — by using these weights you agree to comply with all of them: - **ScanNet / ScanNet200** — [ScanNet Terms of Use](http://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf) - **ScanNet++** — [ScanNet++ Terms of Use](https://kaldir.vc.in.tum.de/scannetpp/static/scannetpp-terms-of-use.pdf) - **ARKitScenes** — [Apple ARKitScenes license](https://github.com/apple/ARKitScenes/blob/main/LICENSE) (non-commercial) - **Matterport3D** — [Matterport3D Terms of Use](https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf) (non-commercial academic) Evaluation additionally used **Replica** ([Replica Research Terms](https://github.com/facebookresearch/Replica-Dataset/blob/main/LICENSE), non-commercial), zero-shot. The accompanying **code** in [WarpConvNet](https://github.com/NVlabs/WarpConvNet) is licensed separately under **Apache-2.0**. > Note: this is not legal advice; for commercial use, consult the individual dataset > licensors. Please also cite the datasets above and the SpaceFormer project.