Merge SpaceFormer demo (viser + CLI + Gradio) under demo/

Browse files

Files changed (10) hide show

demo/README.md +112 -0
demo/app.py +183 -0
demo/clip_eval.py +91 -0
demo/demo_viser.py +341 -0
demo/inference.py +100 -0
demo/labels.py +245 -0
demo/pipeline.py +203 -0
demo/postprocessing.py +577 -0
demo/requirements.txt +15 -0
demo/text_encoder.py +218 -0

demo/README.md ADDED Viewed

	@@ -0,0 +1,112 @@

+---
+title: SpaceFormer Open-Vocab 3D Instance Segmentation
+emoji: 🧩
+colorFrom: indigo
+colorTo: green
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: apache-2.0
+tags:
+  - 3d
+  - point-cloud
+  - instance-segmentation
+  - open-vocabulary
+---
+# SpaceFormer — Open-Vocabulary 3D Instance Segmentation (demo)
+Proposal-free **open-vocabulary 3D instance segmentation**. A Mask2Former-style query
+decoder (learned queries + RoPE) on top of the WarpConvNet `SpaCeFormer` backbone: one
+forward pass over an RGB point cloud produces query masks + per-query CLIP features,
+which are labeled against text embeddings of **arbitrary** class names (SigLIP2, with
+prompt ensembling) — the vocabulary is chosen at inference time.
+Released checkpoint:
+| Benchmark | mAP |
+|---|---|
+| ScanNet200 | 0.1265 |
+| ScanNet++ | 0.2217 |
+| Replica | 0.2644 |
+This repo is the **demo / inference layer**. The model itself lives in WarpConvNet
+(`warpconvnet.models.spaceformer`); this repo only adds the Gradio UI (`app.py`) and a
+CLI inference entry point (`inference.py`).
+## Requirements
+```bash
+pip install -r requirements.txt
+```
+> **WarpConvNet must be installed with its compiled extension** (a pre-built wheel, or
+> build from source). It is intentionally not pinned in `requirements.txt` because it is
+> environment-specific. `transformers` pulls the SigLIP2 text encoder
+> (`google/siglip2-so400m-patch14-224`) on first use.
+## Live demo (Gradio / HuggingFace Space)
+```bash
+HF_REPO_ID=chrischoy/SpaCeFormer python app.py
+# or a local checkpoint:
+SPACEFORMER_CKPT=/path/to/spaceformer_512_siglip2_ssccc.ckpt python app.py
+```
+Upload a point cloud, type comma-separated class names, get an interactive 3D view
+colored by predicted instance + a ranked table. As a **HuggingFace Space**: create a
+**GPU** Gradio Space, install WarpConvNet + `requirements.txt` in the image, and set the
+Space variables `HF_REPO_ID` (and optional `HF_FILENAME`, default
+`spaceformer_512_siglip2_ssccc.ckpt`).
+## Local demo (viser)
+An interactive, self-contained local demo that takes **text class names**, runs
+segmentation, and visualizes the result in the browser with
+[viser](https://viser.studio) — each predicted instance gets a distinct color,
+unassigned points stay grey, and a GUI panel lists the top instances.
+```bash
+# auto-download the checkpoint + use a bundled sample point cloud
+python demo_viser.py --port 8080
+# your own cloud + vocabulary, local checkpoint
+python demo_viser.py --ckpt /path/to/spaceformer_512_siglip2_ssccc.ckpt \
+    --ply my_scene.ply --class-names chair table monitor wall floor
+# full ScanNet200 label set
+python demo_viser.py --ply my_scene.ply --use-scannet200
+```
+Then open the printed URL (default `http://localhost:8080`) in a browser.
+With no `--ply`, the demo uses an open3d bundled sample cloud (or a synthesized
+random RGB cloud) — a generic cloud won't segment meaningfully; it only
+demonstrates that the pipeline + viewer run end to end. The demo colors the
+model's **output** points (`out["backbone_pc"].coordinates`), which are what the
+predicted masks index into after the model's internal voxelization — not the raw
+`.ply` points, whose count may differ.
+## CLI inference
+```bash
+# local checkpoint
+python inference.py --ckpt /path/to/spaceformer_512_siglip2_ssccc.ckpt \
+    --scene /path/to/scene_dir                 # dir with coord.npy + color.npy
+# or auto-download from a HuggingFace model repo
+HF_REPO_ID=chrischoy/SpaCeFormer python inference.py \
+    --scene my_scene.ply --class-names "office chair" "desk" "monitor" "other"
+# full ScanNet200 label set
+python inference.py --ckpt <ckpt> --scene <scene> --use-scannet200
+```
+`--scene` accepts a directory with `coord.npy`(`[N,3]` float meters)+`color.npy`(`[N,3]`
+0–255), a `.npz` `{coord,color}`, an `[N,6]` `.npy` (xyz,rgb), or a `.ply`. Coordinates
+stay in **meters** — the model voxelizes internally at 2 cm. Output: a ranked list of
+`{label, score, #points}`; `score = objectness · mask_quality · class_prob`.
+## License
+Apache-2.0, matching the WarpConvNet `space_former.py` SPDX header.

demo/app.py ADDED Viewed

	@@ -0,0 +1,183 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""HuggingFace Space: SpaceFormer open-vocabulary 3D instance segmentation release.
+Presentation/deployment layer. All model + inference logic is imported from the
+installed ``warpconvnet`` library (``warpconvnet.models.spaceformer``); this file
+only adds the Gradio UI, the 3D Plotly viewer, and checkpoint download.
+Upload an RGB point cloud (.ply / [N,6] .npy / .npz), type comma-separated class
+names, and get an interactive 3D view colored by predicted instance + a ranked
+table of {label, score, #points}.
+WarpConvNet (with its compiled extension) and transformers must be installed in
+the Space image. Configure the checkpoint via Space variables:
+    HF_REPO_ID        model repo holding the checkpoint (e.g. chrischoy/SpaCeFormer)
+    HF_FILENAME       checkpoint filename (default: spaceformer_512_siglip2_ssccc.ckpt)
+    SPACEFORMER_CKPT  explicit local checkpoint path (overrides the HF download)
+"""
+import os
+import numpy as np
+import torch
+from warpconvnet.models.spaceformer import (
+    build_spaceformer,
+    load_spaceformer_checkpoint,
+)
+from labels import DEFAULT_CLASS_NAMES, PROMPT_TEMPLATES
+from pipeline import (
+    SIGLIP_MODEL_ID,
+    load_scene,
+    make_batch,
+    predict_instances,
+)
+HF_REPO_ID = os.environ.get("HF_REPO_ID", "")
+HF_FILENAME = os.environ.get("HF_FILENAME", "spaceformer_512_siglip2_ssccc.ckpt")
+_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+_STATE = {"net": None, "clip": None}  # lazy singletons, kept resident across requests
+def _resolve_ckpt() -> str:
+    explicit = os.environ.get("SPACEFORMER_CKPT")
+    if explicit:
+        return explicit
+    if not HF_REPO_ID:
+        raise RuntimeError(
+            "Set SPACEFORMER_CKPT to a local checkpoint, or HF_REPO_ID to a "
+            "HuggingFace model repo to auto-download from."
+        )
+    from huggingface_hub import hf_hub_download
+    return hf_hub_download(repo_id=HF_REPO_ID, filename=HF_FILENAME)
+def _get_model():
+    if _STATE["net"] is None:
+        net = build_spaceformer(device=_DEVICE)
+        load_spaceformer_checkpoint(net, _resolve_ckpt())
+        _STATE["net"] = net
+    return _STATE["net"]
+def _get_clip_encoder():
+    if _STATE["clip"] is None:
+        from text_encoder import get_text_encoder
+        _STATE["clip"] = get_text_encoder(
+            model_type="siglip2", model_id=SIGLIP_MODEL_ID, device=str(_DEVICE)
+        )
+    return _STATE["clip"]
+def _text_eval(class_names):
+    from clip_eval import CLIPAlignmentEval
+    evaluator = CLIPAlignmentEval(normalize_input=False)
+    evaluator.prepare_target_embedding(
+        class_names=list(class_names),
+        clip_encoder=_get_clip_encoder(),
+        device=_DEVICE,
+        prompt_templates=list(PROMPT_TEMPLATES),
+    )
+    return evaluator
+def _palette(n):
+    rng = np.random.default_rng(0)
+    return [tuple(int(x) for x in rng.integers(40, 230, size=3)) for _ in range(max(n, 1))]
+def _plot(coord_np, results, top_k, score_thresh):
+    """Plotly 3D scatter colored by instance (top_k by score)."""
+    import plotly.graph_objects as go
+    kept = [r for r in results if r["score"] >= score_thresh][:top_k]
+    rgb = np.full((coord_np.shape[0], 3), 160, dtype=np.uint8)  # grey background
+    palette = _palette(len(kept))
+    for i, r in enumerate(kept):
+        rgb[r["mask"]] = palette[i]
+    colors = [f"rgb({c[0]},{c[1]},{c[2]})" for c in rgb]
+    # Subsample for browser responsiveness.
+    n = coord_np.shape[0]
+    if n > 120_000:
+        idx = np.random.default_rng(0).choice(n, 120_000, replace=False)
+    else:
+        idx = np.arange(n)
+    fig = go.Figure(
+        data=[go.Scatter3d(
+            x=coord_np[idx, 0], y=coord_np[idx, 1], z=coord_np[idx, 2],
+            mode="markers",
+            marker=dict(size=1.5, color=[colors[i] for i in idx]),
+        )]
+    )
+    fig.update_layout(
+        scene=dict(aspectmode="data"),
+        margin=dict(l=0, r=0, t=0, b=0),
+        showlegend=False,
+    )
+    return fig
+def segment(scene_file, class_text, top_k, score_thresh):
+    """Gradio callback: file + class names -> (3D figure, results table)."""
+    if scene_file is None:
+        return None, [["(upload a point cloud first)", "", ""]]
+    class_names = [c.strip() for c in class_text.split(",") if c.strip()] \
+        or list(DEFAULT_CLASS_NAMES)
+    path = scene_file.name if hasattr(scene_file, "name") else scene_file
+    coord_np, color_np = load_scene(path)
+    batch = make_batch(coord_np, color_np, _DEVICE)
+    net = _get_model()
+    results = predict_instances(net, batch, _text_eval(class_names), class_names)
+    fig = _plot(coord_np, results, int(top_k), float(score_thresh))
+    table = [
+        [r["label"], f"{r['score']:.3f}", int(r["mask"].sum())]
+        for r in results[: int(top_k)]
+    ]
+    if not table:
+        table = [["(no instances above threshold)", "", ""]]
+    return fig, table
+def build_interface():
+    import gradio as gr
+    with gr.Blocks(title="SpaceFormer — Open-Vocab 3D Instance Segmentation") as demo:
+        gr.Markdown(
+            "# SpaceFormer\n"
+            "Proposal-free **open-vocabulary 3D instance segmentation**. Upload an "
+            "RGB point cloud, type any class names, and get instance masks labeled "
+            "against your vocabulary (SigLIP2 text + prompt ensembling).\n\n"
+            "Released checkpoint: ScanNet200 **0.1265** / ScanNet++ 0.2217 / Replica 0.2644."
+        )
+        with gr.Row():
+            with gr.Column(scale=1):
+                scene_file = gr.File(label="Point cloud (.ply / .npy[N,6] / .npz)")
+                class_text = gr.Textbox(
+                    label="Class names (comma-separated)",
+                    value=", ".join(DEFAULT_CLASS_NAMES),
+                )
+                top_k = gr.Slider(1, 100, value=30, step=1, label="Max instances shown")
+                score_thresh = gr.Slider(0.0, 1.0, value=0.0, step=0.01, label="Score threshold")
+                run = gr.Button("Segment", variant="primary")
+            with gr.Column(scale=2):
+                plot = gr.Plot(label="Predicted instances (colored)")
+                table = gr.Dataframe(
+                    headers=["label", "score", "#points"],
+                    label="Instances",
+                    wrap=True,
+                )
+        run.click(segment, [scene_file, class_text, top_k, score_thresh], [plot, table])
+    return demo
+if __name__ == "__main__":
+    build_interface().launch(server_name="0.0.0.0")

demo/clip_eval.py ADDED Viewed

	@@ -0,0 +1,91 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Open-vocabulary CLIP alignment for inference.
+Self-contained extract of the training repo's ``CLIPAlignmentEval``: encode class
+names into a text-embedding matrix (optionally with prompt ensembling) and score
+per-query CLIP features against it via cosine similarity.
+"""
+import logging
+from typing import List, Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+log = logging.getLogger(__name__)
+class CLIPAlignmentEval(nn.Module):
+    """Cosine-similarity classifier between per-query CLIP features and text embeddings.
+    Args:
+        normalize_input: L2-normalize the query features before the cosine product.
+            For SpaceFormer set this to ``False`` — the clip head output is already
+            compared directly (matches the official eval recipe).
+    """
+    def __init__(self, normalize_input: bool = False):
+        super().__init__()
+        self.normalize_input = normalize_input
+        self.emb_target: Optional[torch.Tensor] = None  # [C, D] L2-normalized
+    def set_target_embedding(self, text_embeddings: torch.Tensor) -> None:
+        self.emb_target = text_embeddings.float()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.normalize_input:
+            return F.normalize(x, p=2, dim=1)
+        return x
+    def predict(self, x: torch.Tensor, return_logit: bool = False) -> torch.Tensor:
+        """Score features ``x`` [Q, D] against the text embeddings -> [Q, C]."""
+        assert self.emb_target is not None, "call prepare_target_embedding() first"
+        pred = self.forward(x)
+        logit = torch.matmul(pred, self.emb_target.t().to(pred.dtype))
+        if return_logit:
+            return logit
+        return logit.argmax(dim=1)
+    @torch.inference_mode()
+    def prepare_target_embedding(
+        self,
+        class_names: List[str],
+        clip_encoder: nn.Module,
+        device: torch.device,
+        use_prompt: bool = False,
+        prompt_template: Optional[str] = None,
+        prompt_templates: Optional[List[str]] = None,
+    ) -> None:
+        """Encode ``class_names`` into the [C, D] target matrix.
+        Three mutually exclusive prompting modes (first non-empty wins):
+          - ``prompt_templates``: prompt ensembling — render each class under every
+            ``"... {} ..."`` template, per-row L2-normalize, mean, re-normalize.
+            This is the recommended eval-time free win.
+          - ``prompt_template``: a single ``"... {c} ..."`` format string.
+          - ``use_prompt``: the OpenScene default ``"a {} in a scene"``.
+        The token ``"other"`` is always encoded bare (no template) so the
+        background/void class stays neutral.
+        """
+        log.info("Preparing CLIP target embedding for %d classes", len(class_names))
+        if prompt_templates:
+            log.info("Prompt ensembling over %d templates", len(prompt_templates))
+            ensembled = None
+            for template in prompt_templates:
+                rendered = [
+                    template.format(c) if "other" not in c else "other" for c in class_names
+                ]
+                emb = F.normalize(clip_encoder(rendered, normalize=True).float(), p=2, dim=-1)
+                ensembled = emb if ensembled is None else ensembled + emb
+            text_embedding = F.normalize(ensembled / float(len(prompt_templates)), p=2, dim=-1)
+        elif prompt_template is not None:
+            rendered = [prompt_template.format(c=c) for c in class_names]
+            text_embedding = clip_encoder(rendered, normalize=True)
+        elif use_prompt:
+            rendered = [f"a {c} in a scene" if "other" not in c else "other" for c in class_names]
+            text_embedding = clip_encoder(rendered, normalize=True)
+        else:
+            text_embedding = clip_encoder(class_names, normalize=True)
+        self.set_target_embedding(text_embedding.to(device))

demo/demo_viser.py ADDED Viewed

	@@ -0,0 +1,341 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Local viser demo: SpaceFormer open-vocabulary 3D instance segmentation.
+Takes TEXT class names, runs one forward pass of the released SpaceFormer over a
+point cloud, and visualizes the predicted instances in a browser with
+`viser <https://viser.studio>`_ (each kept instance a distinct color; unassigned
+points grey). Model + pipeline come from the installed ``warpconvnet`` library
+and the sibling ``pipeline.py`` / ``postprocessing.py`` in this repo — this file
+only adds the forward + viser visualization glue.
+  # local checkpoint, custom vocabulary
+  python demo_viser.py --ckpt /path/to/spaceformer_512_siglip2_ssccc.ckpt \
+      --ply my_scene.ply --class-names chair table monitor wall floor
+  # auto-download the checkpoint from HuggingFace and use a bundled sample cloud
+  python demo_viser.py --port 8080          # falls back to chrischoy/SpaCeFormer
+Then open the printed URL (default http://localhost:8080) in a browser.
+CRITICAL coord/mask alignment (see space_former_seg.py):
+  The model voxelizes internally (PointToSparseWrapper), so its output masks are
+  over the model's OUTPUT points, whose count may NOT equal the raw .ply point
+  count. In eval mode the forward returns ``out["backbone_pc"]`` (a warpconvnet
+  ``Points``) whose ``.coordinates`` correspond 1:1 with the per-point mask rows
+  in ``out["mask"][0]``. We therefore run the forward pass HERE (mirroring
+  pipeline.predict_instances) so we can pull the backbone_pc coordinates and the
+  post-processed masks together, and color THOSE coordinates — never the raw
+  input coords, which may differ in length/order.
+NOTE: WarpConvNet (with its compiled ``_C`` extension) + transformers + a
+checkpoint are required to actually run. ``--help`` and import of this file work
+without viser/WarpConvNet (both are imported lazily inside ``main``).
+"""
+import argparse
+import os
+import tempfile
+import time
+import numpy as np
+import torch
+# Reused, task-specific glue from this repo (scene I/O, eval transforms, text
+# embeddings) — we deliberately do NOT reuse predict_instances() wholesale
+# because we also need the aligned backbone_pc coordinates, so we inline the
+# forward below and call apply_post_processing() directly.
+from labels import CLASS_LABELS_200, DEFAULT_CLASS_NAMES
+from pipeline import (
+    POST_PROCESSING_CFG,
+    build_text_embeddings,
+    load_scene,
+    make_batch,
+)
+from postprocessing import apply_post_processing
+HF_REPO_ID = os.environ.get("HF_REPO_ID", "chrischoy/SpaCeFormer")
+HF_FILENAME = os.environ.get("HF_FILENAME", "spaceformer_512_siglip2_ssccc.ckpt")
+# --------------------------------------------------------------------------- #
+# Checkpoint + sample-scene resolution
+# --------------------------------------------------------------------------- #
+def resolve_ckpt(ckpt_arg):
+    """Return a local checkpoint path: --ckpt, $SPACEFORMER_CKPT, or HF download."""
+    if ckpt_arg:
+        return ckpt_arg
+    explicit = os.environ.get("SPACEFORMER_CKPT")
+    if explicit:
+        return explicit
+    from huggingface_hub import hf_hub_download
+    print(f"[ckpt] downloading {HF_FILENAME} from HuggingFace repo {HF_REPO_ID} ...")
+    return hf_hub_download(repo_id=HF_REPO_ID, filename=HF_FILENAME)
+def resolve_sample_ply(ply_arg):
+    """Return a path to a .ply. If --ply is unset, use a zero-config sample.
+    Preference order (no fragile external URLs):
+      1) open3d's bundled ``PLYPointCloud`` sample (a small RGB point cloud), or
+      2) a synthesized random RGB point cloud written to a temp .ply.
+    A random/sample cloud will NOT segment into meaningful instances — it only
+    demonstrates that the pipeline + visualization run end to end.
+    """
+    if ply_arg:
+        return ply_arg
+    # 1) open3d bundled sample (downloaded/cached by open3d itself, offline after).
+    try:
+        import open3d as o3d
+        sample_path = o3d.data.PLYPointCloud().path
+        if os.path.isfile(sample_path):
+            print(f"[sample] using open3d bundled PLYPointCloud sample: {sample_path}")
+            print("[sample] NOTE: a generic sample cloud won't segment meaningfully; "
+                  "it's only to demo the pipeline + viz.")
+            return sample_path
+    except Exception as exc:  # noqa: BLE001 - any open3d issue -> fall back to synthetic
+        print(f"[sample] open3d sample unavailable ({exc}); synthesizing a random cloud.")
+    # 2) Synthesize a small random RGB point cloud and write a temp .ply.
+    rng = np.random.default_rng(0)
+    n = 20_000
+    coord = rng.uniform(-2.0, 2.0, size=(n, 3)).astype(np.float32)  # meters
+    color = rng.integers(0, 256, size=(n, 3)).astype(np.uint8)
+    tmp = tempfile.NamedTemporaryFile(suffix=".ply", delete=False)
+    tmp.close()
+    _write_ply(tmp.name, coord, color)
+    print(f"[sample] wrote synthetic random RGB cloud ({n} pts) to {tmp.name}")
+    print("[sample] NOTE: a random cloud won't segment meaningfully; it's only to "
+          "demo the pipeline + viz.")
+    return tmp.name
+def _write_ply(path, coord, color):
+    """Write an ASCII RGB .ply (plyfile if available, else open3d, else raw)."""
+    try:
+        from plyfile import PlyData, PlyElement
+        verts = np.empty(
+            coord.shape[0],
+            dtype=[("x", "f4"), ("y", "f4"), ("z", "f4"),
+                   ("red", "u1"), ("green", "u1"), ("blue", "u1")],
+        )
+        verts["x"], verts["y"], verts["z"] = coord[:, 0], coord[:, 1], coord[:, 2]
+        verts["red"], verts["green"], verts["blue"] = color[:, 0], color[:, 1], color[:, 2]
+        PlyData([PlyElement.describe(verts, "vertex")], text=True).write(path)
+        return
+    except ImportError:
+        pass
+    import open3d as o3d
+    pcd = o3d.geometry.PointCloud()
+    pcd.points = o3d.utility.Vector3dVector(coord.astype(np.float64))
+    pcd.colors = o3d.utility.Vector3dVector(color.astype(np.float64) / 255.0)
+    o3d.io.write_point_cloud(path, pcd)
+# --------------------------------------------------------------------------- #
+# Forward + post-processing (aligned to backbone_pc coordinates)
+# --------------------------------------------------------------------------- #
+@torch.inference_mode()
+def segment_aligned(net, batch, text_eval, class_names):
+    """One forward pass -> post-processed instances aligned to backbone_pc coords.
+    Mirrors ``pipeline.predict_instances`` but ALSO returns the model's output
+    coordinates (``out["backbone_pc"].coordinates``), which are the coordinates
+    the returned masks index into (see module docstring / space_former_seg.py).
+    Returns ``(coords[M,3] float32, results)`` where each result is
+    ``{mask: bool[M], label, label_id, score}`` and ``M`` is the backbone point
+    count (== number of columns in ``out["mask"][0]``), NOT the raw .ply count.
+    """
+    out = net(batch)
+    # backbone_pc is present only in eval; its coordinates align 1:1 with the
+    # mask rows. batch size is 1 here, so every point belongs to sample 0.
+    backbone_pc = out["backbone_pc"]
+    coords = backbone_pc.coordinates.detach().cpu().numpy().astype(np.float32)  # [M, 3]
+    binary_logits = out["logit"][0]        # [Q, 2] objectness over {fg, bg}
+    mask_logits = out["mask"][0].T         # [M, Q] -> [Q, M]
+    clip_feats = out["clip_feat"][0]       # [Q, D]
+    pred_iou = out["pred_iou"][0] if "pred_iou" in out else None
+    # Sanity: mask columns must match the backbone point count we will color.
+    assert mask_logits.shape[1] == coords.shape[0], (
+        f"mask columns {mask_logits.shape[1]} != backbone points {coords.shape[0]}; "
+        "coord/mask alignment broken"
+    )
+    class_logits = text_eval.predict(clip_feats, return_logit=True)  # [Q, C]
+    masks, scores, _classes, indices = apply_post_processing(
+        mask_logits,
+        binary_logits,
+        mask_threshold=0.0,
+        point_coords=None,
+        pp_cfg=POST_PROCESSING_CFG,
+        pred_iou=pred_iou,
+    )
+    results = []
+    if len(indices) > 0:
+        probs = torch.softmax(class_logits[indices], dim=-1)  # [K, C]
+        class_probs, class_ids = probs.max(dim=1)
+        final_scores = scores * class_probs
+        for k in range(len(indices)):
+            results.append(
+                {
+                    "mask": masks[k].cpu().numpy().astype(bool),  # bool[M], aligned to coords
+                    "label": class_names[int(class_ids[k])],
+                    "label_id": int(class_ids[k]),
+                    "score": float(final_scores[k]),
+                }
+            )
+        results.sort(key=lambda r: r["score"], reverse=True)
+    return coords, results
+# --------------------------------------------------------------------------- #
+# Coloring + viser visualization
+# --------------------------------------------------------------------------- #
+def _instance_colors(coords, results, top_k, score_thresh):
+    """Grey base cloud + a distinct random color per kept (top-k, thresholded) instance.
+    Returns ``(rgb[M,3] uint8, kept)`` where ``kept`` is the list of instances
+    actually colored (already sorted by score, high first).
+    """
+    rgb = np.full((coords.shape[0], 3), 160, dtype=np.uint8)  # grey background
+    kept = [r for r in results if r["score"] >= score_thresh][:top_k]
+    rng = np.random.default_rng(0)
+    for r in kept:
+        color = rng.integers(40, 230, size=3).astype(np.uint8)
+        r["color"] = tuple(int(c) for c in color)  # stash for GUI legend
+        rgb[r["mask"]] = color
+    return rgb, kept
+def visualize(coords, results, port, top_k, score_thresh, point_size):
+    """Start a viser server, add the colored point cloud + a GUI legend, block."""
+    import viser  # imported lazily so --help works without viser installed
+    rgb, kept = _instance_colors(coords, results, top_k, score_thresh)
+    server = viser.ViserServer(port=port)
+    # Point cloud: positions from backbone_pc coords, colors per instance.
+    server.scene.add_point_cloud(
+        name="/scene",
+        points=coords.astype(np.float32),
+        colors=rgb,               # uint8 [M, 3]
+        point_size=point_size,
+    )
+    # A small GUI panel listing the kept instances (label + score + #points).
+    with server.gui.add_folder(f"Top {len(kept)} instances"):
+        if not kept:
+            server.gui.add_markdown("_(no instances above the score threshold)_")
+        for i, r in enumerate(kept):
+            c = r["color"]
+            swatch = f'<span style="color:rgb({c[0]},{c[1]},{c[2]})">&#9632;</span>'
+            server.gui.add_markdown(
+                f"{swatch} **{r['label']}** — score {r['score']:.3f}, "
+                f"{int(r['mask'].sum())} pts"
+            )
+    url = f"http://localhost:{port}"
+    print(f"\n[viser] serving at {url}  (open this URL in your browser)")
+    print("[viser] press Ctrl-C to stop.")
+    try:
+        while True:
+            time.sleep(2.0)
+    except KeyboardInterrupt:
+        print("\n[viser] shutting down.")
+def print_summary(results, top_k):
+    """Ranked text summary of instances (label, score, #points) to stdout."""
+    print(f"\n=== {len(results)} predicted instances ===")
+    header = f"{'#':>3}  {'score':>6}  {'points':>8}  label"
+    print(header)
+    print("-" * len(header))
+    for i, r in enumerate(results[:top_k]):
+        print(f"{i:>3}  {r['score']:>6.3f}  {int(r['mask'].sum()):>8}  {r['label']}")
+    if len(results) > top_k:
+        print(f"... ({len(results) - top_k} more)")
+# --------------------------------------------------------------------------- #
+# Entry point (linear top-to-bottom)
+# --------------------------------------------------------------------------- #
+def main():
+    ap = argparse.ArgumentParser(
+        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
+    )
+    ap.add_argument("--ply", default=None,
+                    help="input point cloud .ply (default: an open3d bundled sample "
+                         "or a synthesized random cloud)")
+    ap.add_argument("--class-names", nargs="+", default=None,
+                    help="open-vocab class names (default: a short built-in list)")
+    ap.add_argument("--use-scannet200", action="store_true",
+                    help="use the full ScanNet200 label set as class names")
+    ap.add_argument("--ckpt", default=None,
+                    help="checkpoint path (else $SPACEFORMER_CKPT, else HF download)")
+    ap.add_argument("--iou-head", action="store_true",
+                    help="build the learned IoU head (only for a checkpoint that has one)")
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    ap.add_argument("--port", type=int, default=8080, help="viser server port")
+    ap.add_argument("--score-thresh", type=float, default=0.0,
+                    help="only color instances with score >= this")
+    ap.add_argument("--top-k", type=int, default=30,
+                    help="max number of instances to color / list")
+    ap.add_argument("--point-size", type=float, default=0.02,
+                    help="viser point size (world units / meters)")
+    args = ap.parse_args()
+    device = torch.device(args.device)
+    # Vocabulary (text class names — the open-vocabulary input).
+    if args.use_scannet200:
+        class_names = list(CLASS_LABELS_200)
+    else:
+        class_names = args.class_names or list(DEFAULT_CLASS_NAMES)
+    print(f"[vocab] {len(class_names)} classes: {', '.join(class_names[:8])}"
+          f"{' ...' if len(class_names) > 8 else ''}")
+    # 1) Resolve/load the scene (.ply -> coord[N,3] meters, color[N,3] 0-255).
+    ply_path = resolve_sample_ply(args.ply)
+    coord_np, color_np = load_scene(ply_path)
+    # 2) Build the eval batch (CenterShift + NormalizeColor + offsets).
+    batch = make_batch(coord_np, color_np, device)
+    # 3) Build model + load released weights (WarpConvNet required here).
+    from warpconvnet.models.spaceformer import (
+        build_spaceformer,
+        load_spaceformer_checkpoint,
+    )
+    net = build_spaceformer(use_iou_head=args.iou_head, device=device)
+    missing, unexpected = load_spaceformer_checkpoint(net, resolve_ckpt(args.ckpt))
+    print(f"[weights] {len(missing)} missing, {len(unexpected)} unexpected")
+    # 4) Encode the text class names (SigLIP2 + prompt ensembling).
+    text_eval = build_text_embeddings(class_names, device)
+    # 5) Forward + post-process -> masks + labels + scores aligned to backbone coords.
+    coords, results = segment_aligned(net, batch, text_eval, class_names)
+    print(f"[model] backbone output points: {coords.shape[0]} "
+          f"(raw input was {coord_np.shape[0]})")
+    # 6) Text summary + interactive viser visualization.
+    print_summary(results, args.top_k)
+    visualize(coords, results, args.port, args.top_k, args.score_thresh, args.point_size)
+if __name__ == "__main__":
+    main()

demo/inference.py ADDED Viewed

	@@ -0,0 +1,100 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""SpaceFormer single-scene open-vocabulary 3D instance segmentation (CLI inference).
+Self-contained command-line entry point for the released
+SpaceFormer. The model + inference pipeline come from the installed ``warpconvnet``
+library (``warpconvnet.models.spaceformer``); this script only wires up checkpoint
+resolution (local path or HuggingFace auto-download), runs one scene, and
+prints/saves the predicted instances.
+  # weights from a local checkpoint
+  python inference.py --ckpt /path/to/spaceformer_512_siglip2_ssccc.ckpt \
+      --scene /path/to/scene_dir
+  # or auto-download from a HuggingFace model repo
+  HF_REPO_ID=chrischoy/SpaCeFormer python inference.py \
+      --scene my_scene.ply --class-names "office chair" "desk" "monitor" "other"
+Requires WarpConvNet (with its compiled extension) + transformers in the environment.
+"""
+import argparse
+import os
+import torch
+from warpconvnet.models.spaceformer import (
+    build_spaceformer,
+    load_spaceformer_checkpoint,
+)
+from labels import CLASS_LABELS_200, DEFAULT_CLASS_NAMES
+from pipeline import (
+    build_text_embeddings,
+    load_scene,
+    make_batch,
+    predict_instances,
+    print_summary,
+    save_results,
+)
+HF_FILENAME = os.environ.get("HF_FILENAME", "spaceformer_512_siglip2_ssccc.ckpt")
+def resolve_ckpt(ckpt_arg: str | None) -> str:
+    """Return a local checkpoint path: explicit --ckpt, $SPACEFORMER_CKPT, or HF download."""
+    if ckpt_arg:
+        return ckpt_arg
+    explicit = os.environ.get("SPACEFORMER_CKPT")
+    if explicit:
+        return explicit
+    repo_id = os.environ.get("HF_REPO_ID")
+    if not repo_id:
+        raise SystemExit(
+            "No checkpoint: pass --ckpt, or set SPACEFORMER_CKPT (local path) or "
+            "HF_REPO_ID (HuggingFace model repo to download from)."
+        )
+    from huggingface_hub import hf_hub_download
+    return hf_hub_download(repo_id=repo_id, filename=HF_FILENAME)
+def main() -> None:
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--ckpt", default=None,
+                    help="checkpoint path (else $SPACEFORMER_CKPT, else HF download via $HF_REPO_ID)")
+    ap.add_argument("--scene", required=True,
+                    help="scene dir (coord.npy+color.npy) or .npy/.npz/.ply file")
+    ap.add_argument("--class-names", nargs="+", default=None,
+                    help="open-vocab class names (default: a short built-in list)")
+    ap.add_argument("--use-scannet200", action="store_true",
+                    help="use the full ScanNet200 label set as class names")
+    ap.add_argument("--iou-head", action="store_true",
+                    help="build the learned IoU head (only for a checkpoint that has one)")
+    ap.add_argument("--save", default=None, help="optional output .npz path")
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    args = ap.parse_args()
+    device = torch.device(args.device)
+    if args.use_scannet200:
+        class_names = list(CLASS_LABELS_200)
+    else:
+        class_names = args.class_names or list(DEFAULT_CLASS_NAMES)
+    print(f"[vocab] {len(class_names)} classes")
+    net = build_spaceformer(use_iou_head=args.iou_head, device=device)
+    missing, unexpected = load_spaceformer_checkpoint(net, resolve_ckpt(args.ckpt))
+    print(f"[weights] {len(missing)} missing, {len(unexpected)} unexpected")
+    coord_np, color_np = load_scene(args.scene)
+    batch = make_batch(coord_np, color_np, device)
+    text_eval = build_text_embeddings(class_names, device)
+    results = predict_instances(net, batch, text_eval, class_names)
+    print_summary(results)
+    if args.save:
+        save_results(results, coord_np, args.save)
+if __name__ == "__main__":
+    main()

demo/labels.py ADDED Viewed

	@@ -0,0 +1,245 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Label sets and prompt templates for open-vocabulary SpaceFormer evaluation."""
+# Prompt-ensembling templates (LEVER 1). Each must contain a single positional
+# ``{}`` placeholder. Averaging class embeddings across these is a confirmed
+# eval-time accuracy win for the released checkpoint.
+PROMPT_TEMPLATES = (
+    "a {} in a scene",
+    "a photo of a {} in a scene",
+    "a {}",
+    "a photo of a {}",
+    "there is a {} in the scene",
+    "a 3d model of a {}",
+)
+# A short, readable default vocabulary for the demo. Pass your own class names to
+# override, or use CLASS_LABELS_200 for the full ScanNet200 benchmark label set.
+DEFAULT_CLASS_NAMES = (
+    "wall",
+    "floor",
+    "ceiling",
+    "chair",
+    "table",
+    "desk",
+    "couch",
+    "bed",
+    "cabinet",
+    "shelf",
+    "door",
+    "window",
+    "monitor",
+    "keyboard",
+    "lamp",
+    "picture",
+    "whiteboard",
+    "trash can",
+    "backpack",
+    "plant",
+    "other",
+)
+# The 200 ScanNet200 instance/semantic class names (benchmark order).
+CLASS_LABELS_200 = (
+    "wall",
+    "chair",
+    "floor",
+    "table",
+    "door",
+    "couch",
+    "cabinet",
+    "shelf",
+    "desk",
+    "office chair",
+    "bed",
+    "pillow",
+    "sink",
+    "picture",
+    "window",
+    "toilet",
+    "bookshelf",
+    "monitor",
+    "curtain",
+    "book",
+    "armchair",
+    "coffee table",
+    "box",
+    "refrigerator",
+    "lamp",
+    "kitchen cabinet",
+    "towel",
+    "clothes",
+    "tv",
+    "nightstand",
+    "counter",
+    "dresser",
+    "stool",
+    "cushion",
+    "plant",
+    "ceiling",
+    "bathtub",
+    "end table",
+    "dining table",
+    "keyboard",
+    "bag",
+    "backpack",
+    "toilet paper",
+    "printer",
+    "tv stand",
+    "whiteboard",
+    "blanket",
+    "shower curtain",
+    "trash can",
+    "closet",
+    "stairs",
+    "microwave",
+    "stove",
+    "shoe",
+    "computer tower",
+    "bottle",
+    "bin",
+    "ottoman",
+    "bench",
+    "board",
+    "washing machine",
+    "mirror",
+    "copier",
+    "basket",
+    "sofa chair",
+    "file cabinet",
+    "fan",
+    "laptop",
+    "shower",
+    "paper",
+    "person",
+    "paper towel dispenser",
+    "oven",
+    "blinds",
+    "rack",
+    "plate",
+    "blackboard",
+    "piano",
+    "suitcase",
+    "rail",
+    "radiator",
+    "recycling bin",
+    "container",
+    "wardrobe",
+    "soap dispenser",
+    "telephone",
+    "bucket",
+    "clock",
+    "stand",
+    "light",
+    "laundry basket",
+    "pipe",
+    "clothes dryer",
+    "guitar",
+    "toilet paper holder",
+    "seat",
+    "speaker",
+    "column",
+    "bicycle",
+    "ladder",
+    "bathroom stall",
+    "shower wall",
+    "cup",
+    "jacket",
+    "storage bin",
+    "coffee maker",
+    "dishwasher",
+    "paper towel roll",
+    "machine",
+    "mat",
+    "windowsill",
+    "bar",
+    "toaster",
+    "bulletin board",
+    "ironing board",
+    "fireplace",
+    "soap dish",
+    "kitchen counter",
+    "doorframe",
+    "toilet paper dispenser",
+    "mini fridge",
+    "fire extinguisher",
+    "ball",
+    "hat",
+    "shower curtain rod",
+    "water cooler",
+    "paper cutter",
+    "tray",
+    "shower door",
+    "pillar",
+    "ledge",
+    "toaster oven",
+    "mouse",
+    "toilet seat cover dispenser",
+    "furniture",
+    "cart",
+    "storage container",
+    "scale",
+    "tissue box",
+    "light switch",
+    "crate",
+    "power outlet",
+    "decoration",
+    "sign",
+    "projector",
+    "closet door",
+    "vacuum cleaner",
+    "candle",
+    "plunger",
+    "stuffed animal",
+    "headphones",
+    "dish rack",
+    "broom",
+    "guitar case",
+    "range hood",
+    "dustpan",
+    "hair dryer",
+    "water bottle",
+    "handicap bar",
+    "purse",
+    "vent",
+    "shower floor",
+    "water pitcher",
+    "mailbox",
+    "bowl",
+    "paper bag",
+    "alarm clock",
+    "music stand",
+    "projector screen",
+    "divider",
+    "laundry detergent",
+    "bathroom counter",
+    "object",
+    "bathroom vanity",
+    "closet wall",
+    "laundry hamper",
+    "bathroom stall door",
+    "ceiling light",
+    "trash bin",
+    "dumbbell",
+    "stair rail",
+    "tube",
+    "bathroom cabinet",
+    "cd case",
+    "closet rod",
+    "coffee kettle",
+    "structure",
+    "shower head",
+    "keyboard piano",
+    "case of water bottles",
+    "coat rack",
+    "storage organizer",
+    "folded chair",
+    "fire alarm",
+    "power strip",
+    "calendar",
+    "poster",
+    "potted plant",
+    "luggage",
+    "mattress",
+)

demo/pipeline.py ADDED Viewed

	@@ -0,0 +1,203 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Open-vocabulary inference pipeline for the SpaceFormer release.
+Task-specific glue that turns the raw model outputs (logit / mask / clip_feat,
+from ``warpconvnet.models.spaceformer.SpaCeFormerInstSeg``) into labeled instances:
+scene I/O + minimal eval transforms, SigLIP2 text embedding (prompt-ensembled),
+and SAM2-style mask post-processing. Kept in the release repo (not WarpConvNet),
+since labeling/post-processing is downstream of the model.
+"""
+import os
+import numpy as np
+import torch
+from postprocessing import apply_post_processing
+from labels import PROMPT_TEMPLATES
+# SigLIP2 text encoder used at training time. text_dim 1152 == model clip-head dim.
+SIGLIP_MODEL_ID = "google/siglip2-so400m-patch14-224"
+# Voxel size the model was trained/exported at (the model voxelizes internally;
+# this only labels the data_dict for parity).
+GRID_SIZE = 0.02
+# Release eval post-processing: NMS on, min 20 pts/mask, DBSCAN/stability off.
+POST_PROCESSING_CFG = {
+    "use_dbscan": False,
+    "use_stability_score": False,
+    "use_nms": True,
+    "nms_thresh": 0.7,
+    "min_mask_points": 20,
+    "objectness_thresh": 0.0,
+}
+# --------------------------------------------------------------------------- #
+# Scene loading + minimal eval transforms
+# --------------------------------------------------------------------------- #
+def load_scene(scene_path: str):
+    """Load one scene as (coord[N,3] float meters, color[N,3] float 0-255)."""
+    if os.path.isdir(scene_path):
+        coord = np.load(os.path.join(scene_path, "coord.npy"))
+        color = np.load(os.path.join(scene_path, "color.npy"))
+    elif scene_path.endswith(".npz"):
+        z = np.load(scene_path)
+        coord = z[_first_key(z, ("coord", "coords", "xyz", "points"))]
+        color = z[_first_key(z, ("color", "colors", "rgb"))]
+    elif scene_path.endswith(".npy"):
+        arr = np.load(scene_path)
+        assert arr.ndim == 2 and arr.shape[1] >= 6, "expected [N,6] xyz+rgb .npy"
+        coord, color = arr[:, :3], arr[:, 3:6]
+    elif scene_path.endswith(".ply"):
+        coord, color = _load_ply(scene_path)
+    else:
+        raise ValueError(f"unsupported scene format: {scene_path}")
+    coord = np.ascontiguousarray(coord, dtype=np.float32)
+    color = np.ascontiguousarray(color)
+    if color.dtype != np.uint8 and color.max() <= 1.0 + 1e-6:
+        color = (color * 255.0).round()
+    color = color.astype(np.float32)
+    print(f"[scene] {scene_path}: {coord.shape[0]} points")
+    return coord, color
+def _first_key(z, candidates):
+    for c in candidates:
+        if c in z:
+            return c
+    raise KeyError(f"none of {candidates} in {list(z.keys())}")
+def _load_ply(path: str):
+    try:
+        from plyfile import PlyData
+        v = PlyData.read(path)["vertex"]
+        coord = np.stack([v["x"], v["y"], v["z"]], axis=1)
+        if "red" in v.data.dtype.names:
+            color = np.stack([v["red"], v["green"], v["blue"]], axis=1)
+        else:
+            color = np.full_like(coord, 127.5)
+        return coord, color
+    except ImportError:
+        import open3d as o3d
+        pcd = o3d.io.read_point_cloud(path)
+        coord = np.asarray(pcd.points)
+        color = np.asarray(pcd.colors) * 255.0 if pcd.has_colors() else np.full_like(coord, 127.5)
+        return coord, color
+def make_batch(coord_np, color_np, device):
+    """Apply the minimal eval transforms and build the single-sample data dict.
+    CenterShift(apply_z) + NormalizeColor(0-255 -> [-1,1]) + offset [0, N].
+    Coords stay in meters — the model voxelizes internally at 2 cm.
+    """
+    coord = torch.from_numpy(coord_np).float()
+    color = torch.from_numpy(color_np).float()
+    cmin = coord.min(dim=0).values
+    cmax = coord.max(dim=0).values
+    shift = torch.tensor([(cmin[0] + cmax[0]) / 2, (cmin[1] + cmax[1]) / 2, cmin[2]])
+    coord = coord - shift
+    feat = color / 127.5 - 1.0
+    n = coord.shape[0]
+    offset = torch.tensor([0, n], dtype=torch.int32)
+    return {
+        "coord": coord.to(device),
+        "feat": feat.to(device),
+        "offset": offset.to(device),
+        "grid_size": GRID_SIZE,
+    }
+# --------------------------------------------------------------------------- #
+# Text embeddings (prompt-ensembled) + prediction
+# --------------------------------------------------------------------------- #
+def build_text_embeddings(class_names, device):
+    """Encode class names with SigLIP2 under multiple templates and average."""
+    from text_encoder import get_text_encoder
+    from clip_eval import CLIPAlignmentEval
+    clip_encoder = get_text_encoder(model_type="siglip2", model_id=SIGLIP_MODEL_ID, device=str(device))
+    evaluator = CLIPAlignmentEval(normalize_input=False)  # matches official eval
+    evaluator.prepare_target_embedding(
+        class_names=list(class_names),
+        clip_encoder=clip_encoder,
+        device=device,
+        prompt_templates=list(PROMPT_TEMPLATES),  # prompt ensembling ON
+    )
+    return evaluator
+@torch.inference_mode()
+def predict_instances(net, batch, text_eval, class_names):
+    """Single forward pass -> post-processing -> open-vocab labels."""
+    out = net(batch)
+    binary_logits = out["logit"][0]  # [Q, 2] objectness over {fg, bg}
+    mask_logits = out["mask"][0].T  # [N, Q] -> [Q, N]
+    clip_feats = out["clip_feat"][0]  # [Q, D]
+    pred_iou = out["pred_iou"][0] if "pred_iou" in out else None
+    class_logits = text_eval.predict(clip_feats, return_logit=True)  # [Q, C]
+    masks, scores, _classes, indices = apply_post_processing(
+        mask_logits,
+        binary_logits,
+        mask_threshold=0.0,
+        point_coords=None,
+        pp_cfg=POST_PROCESSING_CFG,
+        pred_iou=pred_iou,
+    )
+    results = []
+    if len(indices) > 0:
+        probs = torch.softmax(class_logits[indices], dim=-1)  # [K, C]
+        class_probs, class_ids = probs.max(dim=1)
+        final_scores = scores * class_probs
+        for k in range(len(indices)):
+            results.append(
+                {
+                    "mask": masks[k].cpu().numpy().astype(bool),
+                    "label": class_names[int(class_ids[k])],
+                    "label_id": int(class_ids[k]),
+                    "score": float(final_scores[k]),
+                }
+            )
+        results.sort(key=lambda r: r["score"], reverse=True)
+    return results
+# --------------------------------------------------------------------------- #
+# Output helpers
+# --------------------------------------------------------------------------- #
+def print_summary(results, top_k=20):
+    print(f"\n=== {len(results)} predicted instances ===")
+    header = f"{'#':>3}  {'score':>6}  {'points':>8}  label"
+    print(header)
+    print("-" * len(header))
+    for i, r in enumerate(results[:top_k]):
+        print(f"{i:>3}  {r['score']:>6.3f}  {int(r['mask'].sum()):>8}  {r['label']}")
+    if len(results) > top_k:
+        print(f"... ({len(results) - top_k} more)")
+def save_results(results, coord_np, out_path):
+    if not results:
+        np.savez(out_path, masks=np.zeros((0, coord_np.shape[0]), dtype=bool),
+                 labels=np.array([]), scores=np.array([]))
+        return
+    np.savez(
+        out_path,
+        coord=coord_np,
+        masks=np.stack([r["mask"] for r in results]),
+        labels=np.array([r["label"] for r in results]),
+        label_ids=np.array([r["label_id"] for r in results]),
+        scores=np.array([r["score"] for r in results]),
+    )
+    print(f"[save] wrote {len(results)} instances to {out_path}")

demo/postprocessing.py ADDED Viewed

	@@ -0,0 +1,577 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+SAM2-style post-processing utilities for mask segmentation.
+This module provides shared post-processing functions used by both the
+MaskLanguageLitModule (validation/testing) and the demo script.
+"""
+from typing import Tuple, Optional, Dict
+import time
+import numpy as np
+import torch
+try:
+    from cuml.cluster import DBSCAN
+except ImportError:
+    DBSCAN = None
+def calculate_stability_score(
+    masks: torch.Tensor,
+    mask_threshold: float = 0.0,
+    threshold_offset: float = 1.0,
+) -> torch.Tensor:
+    """
+    Computes the stability score for a set of masks.
+    The stability score is the IoU between the binary masks obtained by
+    thresholding at (mask_threshold + threshold_offset) and
+    (mask_threshold - threshold_offset).
+    High stability means sharp mask boundaries.
+    Args:
+        masks: [Q, N] mask logits
+        mask_threshold: Base threshold (usually 0.0 for logits)
+        threshold_offset: Offset to apply for high/low thresholds
+    Returns:
+        stability_score: [Q] stability score per mask
+    """
+    high_thresh_mask = masks > (mask_threshold + threshold_offset)
+    low_thresh_mask = masks > (mask_threshold - threshold_offset)
+    intersection = high_thresh_mask.float().sum(-1)
+    union = low_thresh_mask.float().sum(-1)
+    stability_score = intersection / (union + 1e-6)
+    return stability_score
+def apply_nms(
+    masks_binary: torch.Tensor,
+    scores: torch.Tensor,
+    nms_thresh: float = 0.7,
+) -> torch.Tensor:
+    """
+    Applies greedy NMS on masks using pairwise IoU.
+    Args:
+        masks_binary: [Q, N] binary masks (booleans or 0/1 floats)
+        scores: [Q] mask scores for ranking
+        nms_thresh: IoU threshold for suppression
+    Returns:
+        keep_indices: Tensor of indices to keep after NMS
+    """
+    # Sort by score descending
+    order = torch.argsort(scores, descending=True)
+    masks_binary = masks_binary.bool()
+    keep = []
+    indices = order
+    while indices.numel() > 0:
+        current = indices[0]
+        keep.append(current.item())
+        if indices.numel() == 1:
+            break
+        # Compare current mask with rest
+        current_mask = masks_binary[current].unsqueeze(0)  # [1, N]
+        rest_indices = indices[1:]
+        rest_masks = masks_binary[rest_indices]  # [K, N]
+        intersection = (current_mask & rest_masks).float().sum(dim=1)
+        union = (current_mask | rest_masks).float().sum(dim=1)
+        iou = intersection / (union + 1e-6)
+        # Keep masks with IoU < thresh
+        mask_keep = iou < nms_thresh
+        indices = rest_indices[mask_keep]
+    return torch.tensor(keep, device=masks_binary.device, dtype=torch.long)
+def apply_dbscan_clustering(
+    current_masks: torch.Tensor,
+    point_coords: torch.Tensor,
+    current_scores: torch.Tensor,
+    current_classes: torch.Tensor,
+    eps: float = 0.95,
+    min_samples: int = 1,
+    backend: str = "auto",
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    """
+    Applies DBSCAN to each mask to split spatially disconnected components.
+    Args:
+        current_masks: [Q, N] boolean masks
+        point_coords: [N, 3] point coordinates
+        current_scores: [Q] scores
+        current_classes: [Q] classes
+        eps: DBSCAN eps parameter
+        min_samples: DBSCAN min_samples parameter
+        backend: "auto", "cuml", or "cpu"
+    Returns:
+        new_masks: [Q', N] expanded boolean masks
+        new_scores: [Q'] expanded scores
+        new_classes: [Q'] expanded classes
+        new_indices: [Q'] indices mapping to original queries
+    """
+    # 0. Size check (Performance optimization) - REMOVED GLOBAL CHECK
+    # if point_coords.shape[0] > 100000:
+    #     print(f"DBSCAN: Skipping due to large point cloud ({point_coords.shape[0]} points > 100k)")
+    #     return current_masks, current_scores, current_classes
+    # 1. Determine Backend
+    use_cuml = False
+    if backend == "auto":
+        use_cuml = DBSCAN is not None
+    elif backend == "cuml":
+        if DBSCAN is None:
+            print("Warning: backend='cuml' requested but cuML not found. Falling back to CPU.")
+            use_cuml = False
+        else:
+            use_cuml = True
+    elif backend == "cpu":
+        use_cuml = False
+    device = current_masks.device
+    num_queries = current_masks.shape[0]
+    # Initialize lists to hold the new split masks
+    new_masks_list = []
+    # We'll store indices pointing to original scores/classes to avoid duplicating them early
+    new_indices_list = []
+    # 2. Execution Path
+    if use_cuml:
+        # --- cuML (GPU) Path ---
+        # print(f"DBSCAN (cuML): Processing {point_coords.shape[0]} points")
+        # Ensure data is on GPU and valid types
+        # cuML DBSCAN expects input of shape (n_samples, n_features)
+        # We process each mask independently.
+        # Optimization: To avoid loop overhead, we could try to batch, but DBSCAN isn't batched.
+        # We iterate over queries.
+        for i in range(num_queries):
+            mask = current_masks[i]
+            # Skip empty masks
+            if not mask.any():
+                continue
+            # Filter points for this mask
+            # mask is [N], point_coords is [N, 3]
+            # Slicing creates a new tensor on GPU
+            points = point_coords[mask]
+            # Check per-mask size limit
+            if points.shape[0] > 100000:
+                # Skip DBSCAN for this mask, keep original
+                print(
+                    f"DBSCAN (cuML): Skipping mask {i} due to large point cloud ({points.shape[0]} points > 100k)"
+                )
+                new_masks_list.append(mask)
+                new_indices_list.append(i)
+                continue
+            if points.shape[0] < min_samples:
+                # Keep original
+                print(
+                    f"DBSCAN (cuML): Skipping mask {i} due to small point cloud ({points.shape[0]} points < {min_samples})"
+                )
+                new_masks_list.append(mask)
+                new_indices_list.append(i)
+                continue
+            try:
+                # Run cuML DBSCAN
+                # dbscan = DBSCAN(eps=eps, min_samples=min_samples)
+                # labels = dbscan.fit_predict(points)
+                # fit_predict returns a cudf Series or cupy array depending on input?
+                # If input is torch tensor, cuML >= 23.04 supports __cuda_array_interface__
+                # It usually returns a cupy array or similar.
+                # Check if we need to convert to cupy explicitly if torch support is iffy in installed version
+                # But modern cuML supports torch tensors.
+                start_time = time.time()
+                clusterer = DBSCAN(eps=eps, min_samples=min_samples)
+                labels = clusterer.fit_predict(points)
+                db_time = time.time() - start_time
+                # Labels is likely a cupy array or similar on GPU
+                # Convert to torch for easier handling
+                if hasattr(labels, "to_dlpack"):
+                    from torch.utils.dlpack import from_dlpack
+                    labels = from_dlpack(labels.to_dlpack())
+                elif hasattr(labels, "__cuda_array_interface__"):
+                    labels = torch.as_tensor(labels, device=device)
+                unique_labels = torch.unique(labels)
+                # Count valid clusters (excluding noise -1)
+                valid_clusters = unique_labels[unique_labels != -1]
+                if len(valid_clusters) == 0:
+                    # All noise? Or just one noise cluster?
+                    # If essentially no structure found, maybe keep original or drop?
+                    # Standard behavior: if it was a mask, and now it's all noise...
+                    # we probably shouldn't discard the *entire* mask content if it was a valid object.
+                    # But DBSCAN says it's noise.
+                    # Let's keep original if nothing valid found, similar to CPU path logic.
+                    pass
+                found_cluster = False
+                # Reconstruct masks
+                # We need global indices of the points
+                mask_indices = torch.nonzero(mask, as_tuple=True)[0]
+                for label in valid_clusters:
+                    found_cluster = True
+                    # Create new boolean mask
+                    # 1. Start with zeros
+                    new_mask = torch.zeros_like(mask)
+                    # 2. Get local indices where label matches
+                    local_indices = (labels == label).nonzero(as_tuple=True)[0]
+                    # 3. Map to global indices
+                    global_indices = mask_indices[local_indices]
+                    # 4. Set True
+                    new_mask[global_indices] = True
+                    new_masks_list.append(new_mask)
+                    new_indices_list.append(i)
+                if not found_cluster:
+                    # Treat as noise/failure to cluster, keep original?
+                    if len(new_masks_list) == 0 or new_indices_list[-1] != i:
+                        # If we haven't added anything for this query `i`
+                        # (Logic check: strictly speaking we might have added splits from previous masks
+                        # so checking new_indices_list[-1] is valid only if list not empty)
+                        pass
+            except Exception as e:
+                print(f"DBSCAN (cuML) Error Query {i}: {e}")
+                # Fallback: keep original
+                new_masks_list.append(mask)
+                new_indices_list.append(i)
+    else:
+        # --- CPU Path ---
+        # print(f"DBSCAN (CPU): Processing {point_coords.shape[0]} points")
+        # Move inputs to CPU
+        masks_cpu = current_masks.detach().cpu().numpy()
+        coords_cpu = point_coords.detach().cpu().numpy()
+        try:
+            from sklearn.cluster import DBSCAN as SklearnDBSCAN
+        except ImportError:
+            print("Scikit-learn not found. Returning original masks.")
+            print("Scikit-learn not found. Returning original masks.")
+            return (
+                current_masks,
+                current_scores,
+                current_classes,
+                torch.arange(num_queries, device=device),
+            )
+        for i in range(num_queries):
+            mask = masks_cpu[i]
+            if not mask.any():
+                continue
+            points = coords_cpu[mask]
+            # Check per-mask size limit
+            if points.shape[0] > 100000:
+                # Skip DBSCAN for this mask, keep original
+                print(
+                    f"DBSCAN (CPU): Skipping mask {i} due to large point cloud ({points.shape[0]} points > 100k)"
+                )
+                new_masks_list.append(current_masks[i])
+                new_indices_list.append(i)
+                continue
+            if points.shape[0] < min_samples:
+                # Keep original
+                print(
+                    f"DBSCAN (CPU): Skipping mask {i} due to small point cloud ({points.shape[0]} points < {min_samples})"
+                )
+                new_masks_list.append(current_masks[i])
+                new_indices_list.append(i)
+                continue
+            try:
+                # Ensure float32 for sklearn
+                start_time = time.time()
+                clusterer = SklearnDBSCAN(eps=eps, min_samples=min_samples)
+                labels = clusterer.fit_predict(points.astype(np.float32))
+                db_time = time.time() - start_time
+                unique_labels = np.unique(labels)
+                print(
+                    f"DBSCAN (CPU): Processing {points.shape[0]} points took {db_time:.4f} seconds, found {len(unique_labels)} clusters"
+                )
+                found_cluster = False
+                # We need indices to reconstruct mask on GPU/CPU
+                # Since we are returning torch tensors on `device`, let's construct list of tensors
+                # It is faster to construct on CPU then move or construct on GPU?
+                # Constructing on GPU inside loop might be slow due to kernel launches.
+                # Let's construct on GPU to match the list type of cuML path
+                mask_indices_cpu = np.nonzero(mask)[0]
+                for label in unique_labels:
+                    if label == -1:
+                        continue
+                    found_cluster = True
+                    # Construct new mask
+                    # It's easier to create on CPU then convert
+                    new_mask_cpu = np.zeros_like(mask)  # bool/uint8
+                    local_mask = labels == label
+                    active_indices = mask_indices_cpu[local_mask]
+                    new_mask_cpu[active_indices] = 1  # True
+                    # Convert to tensor on device
+                    new_masks_list.append(
+                        torch.from_numpy(new_mask_cpu).to(device, dtype=torch.bool)
+                    )
+                    new_indices_list.append(i)
+                if not found_cluster:
+                    # Keep original? Currently explicitly dropped in previous code pass?
+                    # "if not found_cluster: # Treated as noise, currently dropped."
+                    # But we should probably keep it if it was a valid object that just didn't cluster well?
+                    # The original code did `pass`.
+                    pass
+            except Exception as e:
+                print(f"DBSCAN (CPU) Error Query {i}: {e}")
+                new_masks_list.append(current_masks[i])
+                new_indices_list.append(i)
+    # 3. Assemble Results
+    if len(new_masks_list) == 0:
+        return (
+            torch.zeros((0, current_masks.shape[1]), device=device, dtype=torch.bool),
+            torch.zeros((0,), device=device, dtype=current_scores.dtype),
+            torch.zeros((0,), device=device, dtype=current_classes.dtype),
+            torch.zeros((0,), device=device, dtype=torch.long),
+        )
+    final_masks = torch.stack(new_masks_list)
+    # Gather scores and classes using indices
+    indices_tensor = torch.tensor(new_indices_list, device=device, dtype=torch.long)
+    final_scores = current_scores[indices_tensor]
+    final_classes = current_classes[indices_tensor]
+    return final_masks, final_scores, final_classes, indices_tensor
+def apply_post_processing(
+    pred_masks: torch.Tensor,
+    pred_logits: torch.Tensor,
+    mask_threshold: float = 0.0,
+    point_coords: Optional[torch.Tensor] = None,
+    pp_cfg: Optional[Dict] = None,
+    pred_iou: Optional[torch.Tensor] = None,
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+    """
+    Applies configured post-processing filters.
+    Args:
+        pred_masks: [Q, N] mask logits
+        pred_logits: [Q, 2] class logits (objectness is class 0)
+        mask_threshold: Threshold for mask binarization (usually 0.0 for logits)
+        pred_iou: Optional [Q] learned IoU logits from SpaceFormer's IoU head.
+            When provided, `sigmoid(pred_iou)` replaces the hand-coded
+            `mask_quality = (sigmoid(masks) * binary).sum / binary.sum` proxy in
+            the score = obj * quality formula. DBSCAN expansion copies the same
+            scalar to every component of an expanded query.
+        pp_cfg: Post-processing configuration dict with keys:
+            - objectness_thresh: float (default 0.0, disabled)
+            - min_mask_points: int (default 0, disabled)
+            - use_stability_score: bool (default False)
+            - stability_score_thresh: float (default 0.9)
+            - stability_score_offset: float (default 1.0)
+            - stability_score_thresh: float (default 0.9)
+            - stability_score_offset: float (default 1.0)
+            - use_nms: bool (default False)
+            - nms_thresh: float (default 0.7)
+            - use_dbscan: bool (default False)
+            - dbscan_eps: float (default 0.95)
+            - dbscan_min_points: int (default 1)
+            - dbscan_backend: str (default "auto")
+    Returns:
+        final_masks: [Q', N] final binary masks
+        final_scores: [Q'] final scores
+        final_classes: [Q'] final classes
+        final_indices: [Q'] indices mapping to original queries
+    """
+    if pp_cfg is None:
+        pp_cfg = {}
+    # Basic preparation
+    masks_binary = pred_masks > mask_threshold
+    # 0. Min Point Count Filtering (FIRST STEP - early rejection)
+    # Filter out small masks before expensive operations like DBSCAN
+    keep = torch.arange(pred_masks.shape[0], device=pred_masks.device)
+    if pp_cfg.get("min_mask_points", 0) > 0:
+        counts = masks_binary.float().sum(1)
+        keep_size = counts >= pp_cfg["min_mask_points"]
+        keep = keep[keep_size]
+        if len(keep) == 0:
+            return (
+                torch.zeros((0, pred_masks.shape[1]), device=pred_masks.device, dtype=torch.bool),
+                torch.zeros((0,), device=pred_masks.device, dtype=pred_masks.dtype),
+                torch.zeros((0,), device=pred_masks.device, dtype=torch.long),
+                torch.zeros((0,), device=pred_masks.device, dtype=torch.long),
+            )
+        # Filter all inputs
+        masks_binary = masks_binary[keep]
+        pred_masks = pred_masks[keep]
+        pred_logits = pred_logits[keep]
+        if pred_iou is not None:
+            pred_iou = pred_iou[keep]
+    # 1. DBSCAN Expansion
+    # If DBSCAN is used, we expand masks immediately.
+    # We maintain a mapping to original logits to allow stability calculation later.
+    current_masks = masks_binary
+    current_logits = pred_masks
+    current_pred_logits = pred_logits
+    # Track indices (now relative to filtered set if min_mask_points was applied)
+    current_indices = keep.clone()
+    # Objectness component
+    # Check what class 0 means?
+    obj_probs = pred_logits.softmax(dim=-1)[:, 0]
+    # Mask quality component (IoU proxy) — learned if pred_iou is provided
+    # (P3-SAM-style IoU head), otherwise the hand-coded sigmoid-mean proxy.
+    if pred_iou is not None:
+        mask_quality = pred_iou.sigmoid()
+    else:
+        masks_sigmoid = pred_masks.sigmoid()
+        mask_quality = (masks_sigmoid * masks_binary.float()).sum(1) / (
+            masks_binary.float().sum(1) + 1e-6
+        )
+    scores = obj_probs * mask_quality
+    classes = torch.zeros_like(scores, dtype=torch.long)  # class 0
+    if pp_cfg.get("use_dbscan", False) and point_coords is not None:
+        current_masks, scores, classes, dbscan_indices = apply_dbscan_clustering(
+            current_masks,
+            point_coords,
+            scores,
+            classes,
+            eps=pp_cfg.get("dbscan_eps", 0.95),
+            min_samples=pp_cfg.get("dbscan_min_points", 1),
+            backend=pp_cfg.get("dbscan_backend", "auto"),
+        )
+        # We need to map them back to original query indices
+        current_indices = keep[dbscan_indices]
+        # Expand logits and other properties to match split masks
+        # Use dbscan_indices (relative to current filtered set) for indexing current tensors
+        current_logits = current_logits[dbscan_indices]
+        current_pred_logits = current_pred_logits[dbscan_indices]
+        obj_probs = obj_probs[dbscan_indices]
+        # MASK THE LOGITS (Stability Fix)
+        # Key step: constrain the logits to the new binary mask shape
+        # so stability score is calculated on the component, not the whole original mask.
+        # We use a large negative value for background.
+        current_logits = torch.where(current_masks, current_logits, -100.0)
+        # Recalculate mask quality for the NEW masks. With learned IoU we copy
+        # the parent query's scalar to every expanded component (no per-component
+        # IoU prediction is available); without it, recompute the sigmoid-mean
+        # proxy from the masked logits.
+        if pred_iou is not None:
+            mask_quality = pred_iou[dbscan_indices].sigmoid()
+        else:
+            masks_sigmoid = current_logits.sigmoid()
+            mask_quality = (masks_sigmoid * current_masks.float()).sum(1) / (
+                current_masks.float().sum(1) + 1e-6
+            )
+        # Recalculate scores (Obj * Quality)
+        scores = obj_probs * mask_quality
+    # Now we have `current_masks` (binary) and `current_logits` (masked logits).
+    # All subsequent steps operate on these.
+    # 2. Objectness Filtering
+    keep = torch.arange(current_masks.shape[0], device=current_masks.device)
+    if pp_cfg.get("objectness_thresh", 0.0) > 0:
+        # obj_probs is aligned with current set
+        keep_obj = obj_probs > pp_cfg["objectness_thresh"]
+        keep = keep[keep_obj[keep]]
+    if len(keep) == 0:
+        return (
+            torch.zeros((0, pred_masks.shape[1]), device=pred_masks.device, dtype=torch.bool),
+            torch.zeros((0,), device=pred_masks.device, dtype=scores.dtype),
+            torch.zeros((0,), device=pred_masks.device, dtype=classes.dtype),
+            torch.zeros((0,), device=pred_masks.device, dtype=torch.long),
+        )
+    # 3. Stability Score
+    if pp_cfg.get("use_stability_score", False):
+        active_logits = current_logits[keep]
+        stability = calculate_stability_score(
+            active_logits,
+            mask_threshold,
+            pp_cfg.get("stability_score_offset", 1.0),
+        )
+        keep_stable = stability >= pp_cfg.get("stability_score_thresh", 0.9)
+        keep = keep[keep_stable]
+    if len(keep) == 0:
+        return (
+            torch.zeros((0, pred_masks.shape[1]), device=pred_masks.device, dtype=torch.bool),
+            torch.zeros((0,), device=pred_masks.device, dtype=scores.dtype),
+            torch.zeros((0,), device=pred_masks.device, dtype=classes.dtype),
+            torch.zeros((0,), device=pred_masks.device, dtype=torch.long),
+        )
+    # 4. NMS
+    if pp_cfg.get("use_nms", False):
+        active_masks = current_masks[keep]
+        active_scores = scores[keep]
+        keep_nms = apply_nms(active_masks, active_scores, pp_cfg.get("nms_thresh", 0.7))
+        keep = keep[keep_nms]
+    # Final gather
+    final_masks = current_masks[keep]
+    final_scores = scores[keep]
+    final_classes = classes[keep]
+    final_indices = current_indices[keep]
+    return final_masks, final_scores, final_classes, final_indices

demo/requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+# Demo/inference dependencies.
+# WarpConvNet (with its compiled _C extension) must be installed separately — a
+# pre-built wheel or built from source; it is environment-specific so not pinned here.
+torch
+einops
+transformers
+numpy
+huggingface_hub
+gradio
+plotly
+# optional point-cloud input formats
+plyfile
+# local viser demo (demo_viser.py): interactive 3D viewer + sample .ply I/O
+viser
+open3d

demo/text_encoder.py ADDED Viewed

	@@ -0,0 +1,218 @@

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+from typing import List, Tuple
+import hashlib
+import os
+import abc
+import numpy as np
+import torch
+from transformers import AutoTokenizer, AutoModel
+# Assume that models are already cached
+os.environ["HF_HUB_OFFLINE"] = "1"
+# Use a deterministic hash function for strings
+def string_hash(s: str) -> int:
+    return int(hashlib.md5(s.encode()).hexdigest(), 16)
+class CLIPTextEncoderInterace(abc.ABC):
+    model: torch.nn.Module
+    CHANNEL_DIM: int
+    def __post_init__(self):
+        self.freeze_encoder()
+    def freeze_encoder(self):
+        for params in self.model.parameters():
+            params.requires_grad = False
+    @abc.abstractmethod
+    def __call__(self, list_of_texts: List[str], normalize: bool = True) -> torch.Tensor:
+        raise NotImplementedError
+    @torch.inference_mode()
+    def get_unique_text_embedding(
+        self,
+        list_of_texts: List[str] | List[List[str]],
+        normalize: bool = True,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Get unique embeddings for a list of texts.
+        Args:
+            list_of_texts: List[str] | List[List[str]]
+                List of texts or list of list of texts to get unique embeddings for.
+                Total number of texts is N.
+        Returns:
+            embeddings: torch.Tensor, shape (M, D)
+                Unique embeddings for the list of texts.
+            from_unique_indices: torch.Tensor, shape (N,)
+                Indices of the texts in the original list.
+            to_unique_indices: torch.Tensor, shape (M,)
+                Indices of the unique texts in the flattened list.
+        """
+        # Flatten the list of texts
+        if isinstance(list_of_texts, list) and isinstance(list_of_texts[0], list):
+            # list of lists
+            list_of_texts = [item for sublist in list_of_texts for item in sublist]
+        # cchoy: Get unique texts using hash. Using string directly is not deterministic due to python string object not using the string values only for hashing.
+        flat_caption_hash = [string_hash(caption) for caption in list_of_texts]
+        _, to_unique_indices, from_unique_indices = np.unique(
+            flat_caption_hash, return_index=True, return_inverse=True
+        )
+        # Get unique texts
+        unique_texts = [list_of_texts[i] for i in to_unique_indices]
+        # Get embeddings
+        embeddings = self(unique_texts, normalize=normalize)
+        # Return embeddings and indices
+        return embeddings, torch.tensor(from_unique_indices), torch.tensor(to_unique_indices)
+def get_text_encoder(
+    model_type: str,
+    device: str,
+    **kwargs,
+) -> CLIPTextEncoderInterace:
+    if model_type == "siglip2":
+        return Siglip2TextEncoder(device=device, **kwargs)
+    elif model_type == "openclip":  # Recap CLIP is also openclip
+        return OpenCLIPTextEncoder(device=device, **kwargs)
+    else:
+        raise ValueError(f"Model type {model_type} not supported")
+class OpenCLIPTextEncoder(CLIPTextEncoderInterace):
+    CHANNEL_DIM = None
+    def __init__(
+        self,
+        model_id: str,
+        device: str = "cuda",
+        torch_dtype: torch.dtype = torch.bfloat16,
+        context_length: int = None,
+        **kwargs,
+    ):
+        # This is a not a required dependency, so we need to import it here
+        try:
+            from open_clip import create_model_from_pretrained, get_tokenizer
+        except ImportError:
+            raise ImportError(
+                "open_clip is not installed. Please install it with `pip install open-clip`"
+            )
+        self.prepare_data(model_id)
+        self.tokenizer = get_tokenizer(model_id)
+        precision = {torch.float16: "fp16", torch.bfloat16: "bf16"}[torch_dtype]
+        self.model, _ = create_model_from_pretrained(
+            model_id,
+            device=device,
+            precision=precision,
+        )
+        self.device = device
+        # Set context_length: use provided value, or infer from model, or use default
+        if context_length is not None:
+            self.context_length = context_length
+        elif hasattr(self.model, "context_length"):
+            self.context_length = self.model.context_length
+        elif hasattr(self.model, "text") and hasattr(self.model.text, "context_length"):
+            self.context_length = self.model.text.context_length
+        else:
+            # Default to 77 for standard CLIP models
+            self.context_length = 77
+            print(
+                f"Warning: Could not infer context_length from model, using default: {self.context_length}"
+            )
+    def prepare_data(self, model_id: str):
+        from open_clip.factory import download_pretrained_from_hf
+        # Remove hf-hub: prefix if it exists
+        model_id = model_id[len("hf-hub:") :] if model_id.startswith("hf-hub:") else model_id
+        ckpt_path = download_pretrained_from_hf(
+            model_id, cache_dir=os.environ.get("HF_HUB_CACHE", os.path.expanduser("~/.cache/"))
+        )
+        return ckpt_path
+    @torch.inference_mode()
+    @torch.amp.autocast(enabled=True, device_type="cuda")
+    def __call__(self, list_of_texts: List[str], normalize: bool = True) -> torch.Tensor:
+        text_tokens = self.tokenizer(list_of_texts, context_length=self.context_length).to(
+            self.device
+        )
+        embeddings = self.model.encode_text(text_tokens)
+        if normalize:
+            embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
+        return embeddings
+class Siglip2TextEncoder(CLIPTextEncoderInterace):
+    CHANNEL_DIM = 1152
+    def __init__(
+        self,
+        model_id: str = "google/siglip2-so400m-patch16-384",
+        device: str = "cuda",
+        attn_implementation: str = "flash_attention_2",
+        torch_dtype: torch.dtype = torch.bfloat16,
+        **kwargs,
+    ):
+        # Disable tokenizer parallelism
+        os.environ["TOKENIZERS_PARALLELISM"] = "false"
+        # Try loading from local cache first to avoid 429 errors
+        try:
+            self.tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=True)
+            self.model = AutoModel.from_pretrained(
+                model_id,
+                attn_implementation=attn_implementation,
+                torch_dtype=torch_dtype,
+                device_map=device,
+                local_files_only=True,
+            )
+            print(f"Successfully loaded {model_id} from local cache.")
+        except OSError:
+            print(
+                f"Model {model_id} not found locally. Downloading/Updating from Hugging Face Hub..."
+            )
+            # Fallback to downloading if not found locally
+            # This might still hit 429 if many ranks try it, but it's the standard fallback.
+            # Ideally verify downloading on rank 0 only in a multi-node setup if this persists.
+            self.tokenizer = AutoTokenizer.from_pretrained(model_id)
+            self.model = AutoModel.from_pretrained(
+                model_id,
+                attn_implementation=attn_implementation,
+                torch_dtype=torch_dtype,
+                device_map=device,
+            )
+        self.model.vision_model = None  # Remove vision model
+        self.device = device
+    @torch.inference_mode()
+    @torch.amp.autocast(enabled=True, device_type="cuda")
+    def __call__(self, list_of_texts: List[str], normalize: bool = True) -> torch.Tensor:
+        # Length is 64 https://huggingface.co/docs/transformers/main/model_doc/siglip2
+        text_inputs = self.tokenizer(
+            list_of_texts,
+            padding="max_length",
+            truncation=True,
+            max_length=64,
+            return_tensors="pt",
+        ).to(self.device)
+        outputs = self.model.get_text_features(**text_inputs)
+        # In newer transformers, get_text_features may return a
+        # BaseModelOutputWithPooling instead of a plain tensor.
+        if not isinstance(outputs, torch.Tensor):
+            outputs = outputs.pooler_output
+        if normalize:
+            outputs = torch.nn.functional.normalize(outputs, dim=-1)
+        return outputs