--- title: SpaceFormer Open-Vocab 3D Instance Segmentation emoji: ๐Ÿงฉ colorFrom: indigo colorTo: green sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 tags: - 3d - point-cloud - instance-segmentation - open-vocabulary --- # SpaceFormer โ€” Open-Vocabulary 3D Instance Segmentation (demo) Proposal-free **open-vocabulary 3D instance segmentation**. A Mask2Former-style query decoder (learned queries + RoPE) on top of the WarpConvNet `SpaCeFormer` backbone: one forward pass over an RGB point cloud produces query masks + per-query CLIP features, which are labeled against text embeddings of **arbitrary** class names (SigLIP2, with prompt ensembling) โ€” the vocabulary is chosen at inference time. Released checkpoint: | Benchmark | mAP | |---|---| | ScanNet200 | 0.1265 | | ScanNet++ | 0.2217 | | Replica | 0.2644 | This repo is the **demo / inference layer**. The model itself lives in WarpConvNet (`warpconvnet.models.spaceformer`); this repo only adds the Gradio UI (`app.py`) and a CLI inference entry point (`inference.py`). ## Requirements ```bash pip install -r requirements.txt ``` > **WarpConvNet must be installed with its compiled extension** (a pre-built wheel, or > build from source). It is intentionally not pinned in `requirements.txt` because it is > environment-specific. `transformers` pulls the SigLIP2 text encoder > (`google/siglip2-so400m-patch14-224`) on first use. ## Live demo (Gradio / HuggingFace Space) ```bash HF_REPO_ID=chrischoy/SpaCeFormer python app.py # or a local checkpoint: SPACEFORMER_CKPT=/path/to/spaceformer_512_siglip2_ssccc.ckpt python app.py ``` Upload a point cloud, type comma-separated class names, get an interactive 3D view colored by predicted instance + a ranked table. As a **HuggingFace Space**: create a **GPU** Gradio Space, install WarpConvNet + `requirements.txt` in the image, and set the Space variables `HF_REPO_ID` (and optional `HF_FILENAME`, default `spaceformer_512_siglip2_ssccc.ckpt`). ## Local demo (viser) An interactive, self-contained local demo that takes **text class names**, runs segmentation, and visualizes the result in the browser with [viser](https://viser.studio) โ€” each predicted instance gets a distinct color, unassigned points stay grey, and a GUI panel lists the top instances. ```bash # auto-download the checkpoint + use a bundled sample point cloud python demo_viser.py --port 8080 # your own cloud + vocabulary, local checkpoint python demo_viser.py --ckpt /path/to/spaceformer_512_siglip2_ssccc.ckpt \ --ply my_scene.ply --class-names chair table monitor wall floor # full ScanNet200 label set python demo_viser.py --ply my_scene.ply --use-scannet200 ``` Then open the printed URL (default `http://localhost:8080`) in a browser. With no `--ply`, the demo uses an open3d bundled sample cloud (or a synthesized random RGB cloud) โ€” a generic cloud won't segment meaningfully; it only demonstrates that the pipeline + viewer run end to end. The demo colors the model's **output** points (`out["backbone_pc"].coordinates`), which are what the predicted masks index into after the model's internal voxelization โ€” not the raw `.ply` points, whose count may differ. ## CLI inference ```bash # local checkpoint python inference.py --ckpt /path/to/spaceformer_512_siglip2_ssccc.ckpt \ --scene /path/to/scene_dir # dir with coord.npy + color.npy # or auto-download from a HuggingFace model repo HF_REPO_ID=chrischoy/SpaCeFormer python inference.py \ --scene my_scene.ply --class-names "office chair" "desk" "monitor" "other" # full ScanNet200 label set python inference.py --ckpt --scene --use-scannet200 ``` `--scene` accepts a directory with `coord.npy`(`[N,3]` float meters)+`color.npy`(`[N,3]` 0โ€“255), a `.npz` `{coord,color}`, an `[N,6]` `.npy` (xyz,rgb), or a `.ply`. Coordinates stay in **meters** โ€” the model voxelizes internally at 2 cm. Output: a ranked list of `{label, score, #points}`; `score = objectness ยท mask_quality ยท class_prob`. ## License Apache-2.0, matching the WarpConvNet `space_former.py` SPDX header.