red_cube PRISM-JEPA deploy bundle: LeWM WM + PRISM prior + self-contained inference + README

Browse files

Files changed (7) hide show

README.md +110 -0
arx_inference_demo.py +333 -0
jepa.py +153 -0
lewm_red_cube_epoch_100_object.ckpt +3 -0
module.py +306 -0
prior_head.py +73 -0
prior_head_red_cube.pt +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,110 @@

+---
+license: apache-2.0
+library_name: pytorch
+tags:
+- robotics
+- world-model
+- jepa
+- lewm
+- prism
+- arx
+- visuomotor
+---
+# PRISM-JEPA · red_cube (ARX-X5) — LeWM world model + PRISM action prior
+Complete deployable stack for goal-conditioned visuomotor planning on the ARX-X5
+**"red cube"** task: the **LeWM JEPA world model** + the **PRISM goal-conditioned action
+prior** + **self-contained PRISM-MPPI inference code**.
+## ⚠️ Status — read first
+This is a **research artifact / deployment hand-off package, NOT a validated policy.**
+- Trained on 201 teleop demos (`Xia-2004/red_cube`, 42,165 frames, 5-DoF).
+- The world model **converges cleanly and its forward model is accurate** (rollout pred/id
+  ≈ 0.25 < 0.5), **but its MPPI cost surface is weak/flat** on this small-real-robot,
+  small-action data (CV ≈ **0.14** ≪ 0.30 "discriminative" threshold). Consequence: the
+  planner's distinctive **cost-rescoring is dormant**, so in practice **PRISM ≈ a
+  goal-conditioned BC prior** — in offline A/B it produces ~31% more expert-like actions
+  than vanilla LeWM-MPPI (which wanders), but adds **no measurable goal-progress** in the
+  world model's own latent metric (paired t-test p = 0.57).
+- **Never run on the real robot.** Treat as a starting point; add workspace/velocity safety
+  limits and validate before any hardware run.
+- Full analysis & how these numbers were obtained: project doc
+  `docs/30_red_cube_cv_investigation_and_prism.md`.
+## Contents
+| file | description |
+|---|---|
+| `lewm_red_cube_epoch_100_object.ckpt` | LeWM world model — pickled JEPA: ViT-tiny encoder + AR transformer predictor + action encoder (~18M params) |
+| `prior_head_red_cube.pt` | PRISM goal-conditioned action prior — `state_dict` + `config` + action `StandardScaler` (mean/scale) |
+| `arx_inference_demo.py` | self-contained `PrismMPPIInference` (PoG-fused PRISM-MPPI; `use_prism=False` → vanilla LeWM-MPPI) |
+| `jepa.py`, `module.py`, `prior_head.py` | model classes required to unpickle the ckpt and run the prior |
+## Observation / action space
+- **Observation:** single top-down RGB frame, **224×224×3 uint8** (RealSense `camera_third`).
+- **Goal:** an RGB goal image, same format (the prior + cost are conditioned on it).
+- **Action:** 5-DoF delta end-effector **`[dx, dy, dz, dyaw, d_gripper]`**, raw units, one per
+  control tick. `plan()` returns one plan-step = `A_block = 5` ticks → shape **`(5, 5)`**.
+## Dependencies
+`torch`, `numpy`, `einops`, and `transformers` (the encoder inside the ckpt is a HuggingFace
+ViT, needed at unpickle time). The three bundled `.py` files must be importable from the
+working directory. (If unpickling complains about a missing class, also `pip install
+stable-pretraining`.)
+## Deploy — receding-horizon control loop
+```python
+from arx_inference_demo import PrismMPPIInference
+planner = PrismMPPIInference(
+    lewm_ckpt  = "lewm_red_cube_epoch_100_object.ckpt",
+    prior_ckpt = "prior_head_red_cube.pt",
+    use_prism  = True,     # True = PRISM (prior ⊗ MPPI via PoG fusion); False = vanilla LeWM-MPPI
+    device     = "cuda",
+)
+goal_img = load_goal_image()                   # (224,224,3) uint8 — the task goal image
+while not done:
+    obs     = camera.read()                    # (224,224,3) uint8, top-down camera_third view
+    actions = planner.plan(obs, goal_img)      # (5, 5) raw [dx,dy,dz,dyaw,d_gripper]
+    for a in actions:                          # receding horizon: execute the block, then replan
+        robot.execute(a)                       # (or execute fewer than 5 and replan more often)
+```
+`plan()` runs one full PRISM-MPPI optimization and returns the first `A_block = 5` env-step
+actions of the optimized plan, in **raw action units** (already de-normalized).
+## Key hyperparameters (`PrismMPPIInference` constructor)
+| arg | default | meaning |
+|---|---|---|
+| `H` | 5 | planning horizon (plan-steps) |
+| `A_block` | 5 | env-steps (ticks) per plan-step ("frameskip") |
+| `K` | 128 | MPPI samples per iteration |
+| `n_iters` | 30 | MPPI refinement iterations |
+| `var_scale` | 1.0 | initial planner sampling std |
+| `prior_sigma_scale` | 2.0 | multiplier on the prior σ before PoG fusion (PRISM only) |
+| `temperature` | 0.5 | MPPI softmax temperature |
+| `history_size` | 3 | LeWM history-window length (**must match training**) |
+`H`, `A_block`, `A_raw`, `history_size` must match the checkpoints — the constructor asserts
+the prior head's config agrees. Change them only if you retrain.
+## PRISM vs vanilla (A/B)
+Build a second planner with `use_prism=False` for a baseline (plain LeWM-MPPI, no prior,
+same encoder/predictor/MPPI loop). On this task PRISM produces more expert-like actions;
+vanilla tends to wander because the cost surface is flat.
+## Provenance
+Data: [`Xia-2004/red_cube`](https://huggingface.co/datasets/Xia-2004/red_cube) (ARX-X5
+left-arm teleop). Sibling of `Xia-2004/arx-left-cube`. World-model architecture is identical
+to the sim LeWM (ViT-tiny, embed_dim 192, predictor depth 6 / heads 16) — part of the
+PRISM-JEPA project (sister of Newt-PRISM, CoRL 2026).

arx_inference_demo.py ADDED Viewed

	@@ -0,0 +1,333 @@

+"""arx_inference_demo.py — standalone PRISM-MPPI inference for ARX cube task.
+This file is **self-contained**: it depends only on the bundled
+`jepa.py`, `module.py`, `prior_head.py`, plus standard torch / numpy.
+No `stable_worldmodel` import — the MPPI loop is re-implemented inline.
+Intended use by a downstream consumer (e.g., the ARX deployment side):
+    from arx_inference_demo import PrismMPPIInference
+    planner = PrismMPPIInference(
+        lewm_ckpt    = "lewm_arx.ckpt",
+        prior_ckpt   = "prior_head_arx.pt",
+        device       = "cuda",
+    )
+    # In the control loop:
+    while not done:
+        obs_uint8  = camera.read()          # (224, 224, 3) uint8 RGB
+        goal_uint8 = goal_image             # (224, 224, 3) uint8 RGB
+        actions    = planner.plan(obs_uint8, goal_uint8)
+                                            # → (A_block, 5) float32, raw action units
+        for a in actions:
+            robot.execute(a)                # step the robot
+`plan()` performs one full PRISM-MPPI optimization and returns the first
+A_block = 5 env-step actions of the optimized plan. The caller may choose
+to execute all 5 then replan (receding-horizon, k=A_block), or execute
+fewer and replan more often.
+PRISM-MPPI summary:
+  1. JEPA encoder turns current obs + goal image into latent embeddings z_t, z_g.
+  2. PRISM prior head maps (z_t, z_g) → (μ_p, σ_p) over the next
+     H × A_block × A_raw normalized actions.
+  3. We seed an MPPI distribution N(0, var_scale I) and PoG-fuse with the
+     prior to get N(fused_μ, fused_σ²). The variance is FROZEN through MPPI
+     iterations (this is the PRISM-MPPI signature; see paper §3).
+  4. Each iteration samples K candidate action sequences, rolls them out via
+     the LeWM ARPredictor in latent space, computes cost = MSE(predicted
+     final z, z_g), reweights candidates by exp(-β·cost), updates the mean.
+  5. After n_iters iterations, the first A_block entries of the mean are
+     returned (denormalized to raw env action units via the saved
+     StandardScaler).
+"""
+from __future__ import annotations
+from pathlib import Path
+import numpy as np
+import torch
+import torch.nn.functional as F
+# Required for unpickling the LeWM ckpt — these modules must be importable
+import jepa   # noqa: F401  — registers JEPA class
+import module  # noqa: F401  — registers ARPredictor, Embedder, etc.
+from prior_head import PriorHead
+IMAGENET_MEAN = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)
+IMAGENET_STD = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1)
+def _preprocess(img_uint8: np.ndarray, device: torch.device) -> torch.Tensor:
+    """uint8 (H, W, 3) → float (1, 3, 224, 224), ImageNet-normalized."""
+    assert img_uint8.shape == (224, 224, 3), \
+        f"Expected (224, 224, 3) image, got {img_uint8.shape}"
+    t = torch.from_numpy(img_uint8).permute(2, 0, 1).float().div(255.0).unsqueeze(0)
+    t = t.to(device)
+    mean = IMAGENET_MEAN.to(device)
+    std = IMAGENET_STD.to(device)
+    return (t - mean) / std
+def _pog_fusion(mean, std, mu_p, sg_p, sigma_floor=0.05):
+    """Product-of-Gaussians fusion. Matches prism_mppi.pog_fusion."""
+    eps = 1e-8
+    tau_base = 1.0 / (std ** 2 + eps)
+    tau_p = 1.0 / (sg_p ** 2 + eps)
+    tau_c = tau_base + tau_p
+    fused_mean = (tau_base * mean + tau_p * mu_p) / tau_c
+    fused_std = (1.0 / tau_c).sqrt().clamp(min=sigma_floor)
+    return fused_mean, fused_std
+class PrismMPPIInference:
+    """Standalone PRISM-MPPI planner for ARX cube task.
+    Supports two modes via the `use_prism` constructor flag — kept on a
+    single class so that PRISM and vanilla-MPPI A/B comparisons use the
+    exact same encoder, predictor, MPPI loop, and StandardScaler. The
+    only difference between the two modes is whether the PoG fusion at
+    init time uses the prior head's (μ_p, σ_p) or not. The prior-head
+    checkpoint is always loaded — its StandardScaler (action
+    normalization) is shared by both modes so the comparison is
+    apples-to-apples in raw action units.
+    Args (paper defaults — change only if you know what you're doing):
+        lewm_ckpt:        path to lewm_arx.ckpt (pickled JEPA module)
+        prior_ckpt:       path to prior_head_arx.pt (PRISM head state_dict + scaler)
+        use_prism:        if True (default), inject the PRISM prior via PoG fusion.
+                          If False, skip the prior — the planner becomes vanilla
+                          LeWM-MPPI from N(0, var_scale) seed. Use this for paper-
+                          grade real-robot A/B against PRISM-MPPI.
+        H:                planning horizon in plan-steps (default 5)
+        A_block:          env-steps per plan-step (default 5, "frameskip")
+        K:                num MPPI samples per iteration (default 128)
+        n_iters:          num MPPI refinement iterations (default 30)
+        var_scale:        initial planner std (default 1.0)
+        temperature:      MPPI softmax temperature β = 1/temperature (default 0.5)
+        sigma_floor:      lower bound on fused σ (default 0.05); only used by PRISM
+        prior_sigma_scale: multiplier on prior σ_p before fusion (default 2.0,
+                          matches the paper's PRISM-MPPI s=2.0 setting); only used by PRISM
+        history_size:     LeWM history-window length (default 3; must match training)
+        device:           'cuda' or 'cpu'
+    """
+    def __init__(
+        self,
+        lewm_ckpt: str | Path,
+        prior_ckpt: str | Path,
+        use_prism: bool = True,
+        H: int = 5,
+        A_block: int = 5,
+        K: int = 128,
+        n_iters: int = 30,
+        var_scale: float = 1.0,
+        temperature: float = 0.5,
+        sigma_floor: float = 0.05,
+        prior_sigma_scale: float = 2.0,
+        history_size: int = 3,
+        device: str = "cuda",
+    ):
+        self.device = torch.device(device)
+        self.use_prism = bool(use_prism)
+        self.H = H
+        self.A_block = A_block
+        self.K = K
+        self.n_iters = n_iters
+        self.var_scale = var_scale
+        self.beta = 1.0 / temperature
+        self.sigma_floor = sigma_floor
+        self.prior_sigma_scale = prior_sigma_scale
+        self.history_size = history_size
+        # ---- Load LeWM (encoder + AR predictor, pickled) ----
+        print(f"[init] loading LeWM ckpt: {lewm_ckpt}")
+        self.lewm = torch.load(
+            str(lewm_ckpt), map_location=self.device, weights_only=False,
+        )
+        self.lewm.to(self.device).eval()
+        for p in self.lewm.parameters():
+            p.requires_grad_(False)
+        # ---- Load PRISM prior head + scaler (scaler always used; head conditionally) ----
+        print(f"[init] loading prior head + scaler: {prior_ckpt}")
+        pck = torch.load(str(prior_ckpt), map_location=self.device, weights_only=False)
+        cfg = pck["config"]
+        self.A_raw = int(cfg["A_raw"])
+        assert cfg["H"] == self.H and cfg["A_block"] == self.A_block, (
+            f"Ckpt config mismatch: H={cfg['H']} A_block={cfg['A_block']} "
+            f"vs runtime H={self.H} A_block={self.A_block}"
+        )
+        if self.use_prism:
+            self.head = PriorHead(**cfg).to(self.device).eval()
+            self.head.load_state_dict(pck["state_dict"])
+            for p in self.head.parameters():
+                p.requires_grad_(False)
+        else:
+            self.head = None     # vanilla LeWM-MPPI mode: skip PoG fusion
+        # Action denormalization (raw_action = norm_action * scale + mean) — always loaded
+        self.scaler_mean = torch.tensor(pck["scaler_mean"], device=self.device).float()
+        self.scaler_scale = torch.tensor(pck["scaler_scale"], device=self.device).float()
+        mode_str = "PRISM-MPPI" if self.use_prism else "vanilla LeWM-MPPI (PRISM off)"
+        print(f"[init] mode = {mode_str}")
+        print(f"[init] z_dim={cfg['z_dim']}  H={self.H}  A_block={self.A_block}  "
+              f"A_raw={self.A_raw}")
+        print(f"[init] device={self.device}  K={self.K}  n_iters={self.n_iters}")
+    @torch.no_grad()
+    def _encode(self, img_uint8: np.ndarray) -> torch.Tensor:
+        """uint8 image → (1, D) CLS embedding."""
+        x = _preprocess(img_uint8, self.device)
+        # JEPA.encode expects a dict with 'pixels' shape (B, T, C, H, W)
+        info = {"pixels": x.unsqueeze(1)}  # add T=1 dim
+        info = self.lewm.encode(info)
+        return info["emb"][:, 0]  # (1, D)
+    @torch.no_grad()
+    def _prior(self, z_t: torch.Tensor, z_g: torch.Tensor):
+        """PRISM head: (1, D), (1, D) → (μ, σ) of shape (1, H, A_block, A_raw)
+        in normalized action space."""
+        return self.head(z_t, z_g)
+    @torch.no_grad()
+    def _rollout_costs(
+        self,
+        z_t: torch.Tensor,                # (1, D)
+        z_g: torch.Tensor,                # (1, D)
+        action_candidates: torch.Tensor,   # (1, K, H*A_block, A_raw) normalized
+    ) -> torch.Tensor:                     # (1, K) cost per candidate
+        """Rollout each candidate via LeWM AR predictor, compute final-z MSE to z_g."""
+        B, K, T_total, A = action_candidates.shape
+        assert T_total == self.H * self.A_block
+        D = z_t.shape[-1]
+        HS = self.history_size
+        # Seed embedding history with the current z_t (tile to HS length)
+        # emb: (B*K, HS, D)
+        emb = z_t.unsqueeze(1).expand(B, K, D).reshape(B * K, D)
+        emb = emb.unsqueeze(1).expand(-1, HS, -1).contiguous()
+        # action_seq: (B*K, T_total, A) — env-step actions; predictor consumes them block-by-block
+        act_seq = action_candidates.reshape(B * K, T_total, A)
+        # Group actions into plan-steps of A_block: (B*K, H, A_block * A)
+        act_plan = act_seq.reshape(B * K, self.H, self.A_block * A)
+        # Embed actions via the predictor's action_encoder (Embedder)
+        # act_emb: (B*K, H, action_emb_dim)
+        act_emb = self.lewm.action_encoder(act_plan)
+        # AR rollout
+        for t in range(self.H):
+            emb_trunc = emb[:, -HS:]                 # (B*K, HS, D)
+            act_trunc = act_emb[:, max(0, t - HS + 1): t + 1]  # last HS actions seen
+            # Pad on the left if we don't have HS history of actions yet
+            if act_trunc.shape[1] < HS:
+                pad = act_trunc[:, :1].expand(-1, HS - act_trunc.shape[1], -1)
+                act_trunc = torch.cat([pad, act_trunc], dim=1)
+            pred = self.lewm.predict(emb_trunc, act_trunc)[:, -1:]  # (B*K, 1, D)
+            emb = torch.cat([emb, pred], dim=1)
+        # Final predicted embedding: emb[:, -1]
+        pred_final = emb[:, -1]                       # (B*K, D)
+        goal = z_g.unsqueeze(1).expand(B, K, D).reshape(B * K, D)
+        cost = F.mse_loss(pred_final, goal, reduction="none").sum(dim=-1)  # (B*K,)
+        return cost.reshape(B, K)
+    @torch.no_grad()
+    def plan(self, obs_uint8: np.ndarray, goal_uint8: np.ndarray) -> np.ndarray:
+        """One MPPI optimization (PRISM or vanilla depending on `use_prism`).
+        Returns (A_block, A_raw) actions in raw env units.
+        """
+        # 1. Encode
+        z_t = self._encode(obs_uint8)   # (1, D)
+        z_g = self._encode(goal_uint8)  # (1, D)
+        # 2. Init MPPI distribution N(0, var_scale)
+        shape = (1, self.H * self.A_block, self.A_raw)
+        mean = torch.zeros(shape, device=self.device)
+        std = torch.full(shape, self.var_scale, device=self.device)
+        # 3. (PRISM only) prior in normalized action space + PoG fusion
+        if self.use_prism:
+            mu_p, sg_p = self._prior(z_t, z_g)                # (1, H, A_block, A_raw)
+            mu_p_flat = mu_p.reshape(*shape)
+            sg_p_flat = sg_p.reshape(*shape) * self.prior_sigma_scale
+            mean, std = _pog_fusion(mean, std, mu_p_flat, sg_p_flat, self.sigma_floor)
+        # 4. MPPI iterations (frozen σ — PRISM-MPPI signature when use_prism=True;
+        #    matches stable_worldmodel.solver.MPPISolver default when use_prism=False)
+        for it in range(self.n_iters):
+            noise = torch.randn(
+                1, self.K, self.H * self.A_block, self.A_raw, device=self.device,
+            )
+            cands = mean.unsqueeze(1) + noise * std.unsqueeze(1)
+            # cands: (1, K, H*A_block, A_raw)
+            cost = self._rollout_costs(z_t, z_g, cands)  # (1, K)
+            log_w = -self.beta * (cost - cost.min(dim=-1, keepdim=True).values)
+            w = torch.softmax(log_w, dim=-1)             # (1, K)
+            # Importance-weighted mean update; std FROZEN (PRISM-MPPI)
+            mean = (w.unsqueeze(-1).unsqueeze(-1) * cands).sum(dim=1)
+            # mean: (1, H*A_block, A_raw)
+        # 5. First A_block actions, denormalized
+        first_block_norm = mean[0, : self.A_block]  # (A_block, A_raw)
+        first_block_raw = first_block_norm * self.scaler_scale + self.scaler_mean
+        return first_block_raw.cpu().numpy().astype(np.float32)
+# ===========================================================================
+# Sanity test: load + plan on a sample from the ARX h5
+# ===========================================================================
+if __name__ == "__main__":
+    import argparse
+    ap = argparse.ArgumentParser()
+    ap.add_argument(
+        "--lewm-ckpt", default=".stable-wm/lewm_arx_epoch_100_object.ckpt",
+    )
+    ap.add_argument("--prior-ckpt", default="prior_head_arx.pt")
+    ap.add_argument("--h5", default=".stable-wm/arx_left_cube.h5")
+    ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--no-prism", action="store_true",
+                    help="Run vanilla LeWM-MPPI (no PRISM prior). Use for A/B comparison.")
+    args = ap.parse_args()
+    # Load a sample from the ARX h5 — first frame of episode 0 + its goal
+    import h5py
+    print(f"\n[demo] loading sample from {args.h5}")
+    with h5py.File(args.h5, "r") as f:
+        obs = f["pixels"][0]
+        goal = f["goal_pixels"][0]
+        ground_truth_action = f["action"][0]
+    print(f"[demo] obs.shape={obs.shape}  goal.shape={goal.shape}  "
+          f"obs.dtype={obs.dtype}")
+    # Build planner
+    print()
+    planner = PrismMPPIInference(
+        lewm_ckpt=args.lewm_ckpt,
+        prior_ckpt=args.prior_ckpt,
+        use_prism=not args.no_prism,
+        device="cuda" if torch.cuda.is_available() else "cpu",
+    )
+    # Plan
+    mode = "vanilla LeWM-MPPI" if args.no_prism else "PRISM-MPPI"
+    print(f"\n[demo] running {mode} on the sample obs + its goal image…")
+    import time
+    t0 = time.time()
+    actions = planner.plan(obs, goal)
+    dt = time.time() - t0
+    print(f"[demo] planned in {dt:.2f}s")
+    print(f"[demo] action sequence (A_block × A_raw): shape={actions.shape}")
+    print(f"[demo]   first action:        {actions[0].tolist()}")
+    print(f"[demo]   ground-truth (t=0):  {ground_truth_action.tolist()}")
+    print(f"[demo]   |Δ|: {np.linalg.norm(actions[0] - ground_truth_action):.4f}")

jepa.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""JEPA Implementation"""
+import torch
+import torch.nn.functional as F
+from einops import rearrange
+from torch import nn
+def detach_clone(v):
+    return v.detach().clone() if torch.is_tensor(v) else v
+class JEPA(nn.Module):
+    def __init__(
+        self,
+        encoder,
+        predictor,
+        action_encoder,
+        projector=None,
+        pred_proj=None,
+    ):
+        super().__init__()
+        self.encoder = encoder
+        self.predictor = predictor
+        self.action_encoder = action_encoder
+        self.projector = projector or nn.Identity()
+        self.pred_proj = pred_proj or nn.Identity()
+    def encode(self, info):
+        """Encode observations and actions into embeddings.
+        info: dict with pixels and action keys
+        """
+        pixels = info['pixels'].float()
+        b = pixels.size(0)
+        pixels = rearrange(pixels, "b t ... -> (b t) ...") # flatten for encoding
+        output = self.encoder(pixels, interpolate_pos_encoding=True)
+        pixels_emb = output.last_hidden_state[:, 0]  # cls token
+        emb = self.projector(pixels_emb)
+        info["emb"] = rearrange(emb, "(b t) d -> b t d", b=b)
+        if "action" in info:
+            info["act_emb"] = self.action_encoder(info["action"])
+        return info
+    def predict(self, emb, act_emb):
+        """Predict next state embedding
+        emb: (B, T, D)
+        act_emb: (B, T, A_emb)
+        """
+        preds = self.predictor(emb, act_emb)
+        preds = self.pred_proj(rearrange(preds, "b t d -> (b t) d"))
+        preds = rearrange(preds, "(b t) d -> b t d", b=emb.size(0))
+        return preds
+    ####################
+    ## Inference only ##
+    ####################
+    def rollout(self, info, action_sequence, history_size: int = 3):
+        """Rollout the model given an initial info dict and action sequence.
+        pixels: (B, S, T, C, H, W)
+        action_sequence: (B, S, T, action_dim)
+         - S is the number of action plan samples
+         - T is the time horizon
+        """
+        assert "pixels" in info, "pixels not in info_dict"
+        H = info["pixels"].size(2)
+        B, S, T = action_sequence.shape[:3]
+        act_0, act_future = torch.split(action_sequence, [H, T - H], dim=2)
+        info["action"] = act_0
+        n_steps = T - H
+        # copy and encode initial info dict
+        _init = {k: v[:, 0] for k, v in info.items() if torch.is_tensor(v)}
+        _init = self.encode(_init)
+        emb = info["emb"] = _init["emb"].unsqueeze(1).expand(B, S, -1, -1)
+        _init = {k: detach_clone(v) for k, v in _init.items()}
+        # flatten batch and sample dimensions for rollout
+        emb = rearrange(emb, "b s ... -> (b s) ...").clone()
+        act = rearrange(act_0, "b s ... -> (b s) ...")
+        act_future = rearrange(act_future, "b s ... -> (b s) ...")
+        # rollout predictor autoregressively for n_steps
+        HS = history_size
+        for t in range(n_steps):
+            act_emb = self.action_encoder(act)
+            emb_trunc = emb[:, -HS:]  # (BS, HS, D)
+            act_trunc = act_emb[:, -HS:]  # (BS, HS, A_emb)
+            pred_emb = self.predict(emb_trunc, act_trunc)[:, -1:]  # (BS, 1, D)
+            emb = torch.cat([emb, pred_emb], dim=1)  # (BS, T+1, D)
+            next_act = act_future[:, t : t + 1, :]  # (BS, 1, action_dim)
+            act = torch.cat([act, next_act], dim=1)  # (BS, T+1, action_dim)
+        # predict the last state
+        act_emb = self.action_encoder(act)  # (BS, T, A_emb)
+        emb_trunc = emb[:, -HS:]  # (BS, HS, D)
+        act_trunc = act_emb[:, -HS:]  # (BS, HS, A_emb)
+        pred_emb = self.predict(emb_trunc, act_trunc)[:, -1:]  # (BS, 1, D)
+        emb = torch.cat([emb, pred_emb], dim=1)
+        # unflatten batch and sample dimensions
+        pred_rollout = rearrange(emb, "(b s) ... -> b s ...", b=B, s=S)
+        info["predicted_emb"] = pred_rollout
+        return info
+    def criterion(self, info_dict: dict):
+        """Compute the cost between predicted embeddings and goal embeddings."""
+        pred_emb = info_dict["predicted_emb"]  # (B,S, T-1, dim)
+        goal_emb = info_dict["goal_emb"]  # (B, S, T, dim)
+        goal_emb = goal_emb[..., -1:, :].expand_as(pred_emb)
+        # return last-step cost per action candidate
+        cost = F.mse_loss(
+            pred_emb[..., -1:, :],
+            goal_emb[..., -1:, :].detach(),
+            reduction="none",
+        ).sum(dim=tuple(range(2, pred_emb.ndim)))  # (B, S)
+        return cost
+    def get_cost(self, info_dict: dict, action_candidates: torch.Tensor):
+        """ Compute the cost of action candidates given an info dict with goal and initial state."""
+        assert "goal" in info_dict, "goal not in info_dict"
+        device = next(self.parameters()).device
+        for k in list(info_dict.keys()):
+            if torch.is_tensor(info_dict[k]):
+                info_dict[k] = info_dict[k].to(device)
+        goal = {k: v[:, 0] for k, v in info_dict.items() if torch.is_tensor(v)}
+        goal["pixels"] = goal["goal"]
+        for k in info_dict:
+            if k.startswith("goal_"):
+                goal[k[len("goal_") :]] = goal.pop(k)
+        goal.pop("action")
+        goal = self.encode(goal)
+        info_dict["goal_emb"] = goal["emb"]
+        info_dict = self.rollout(info_dict, action_candidates)
+        cost = self.criterion(info_dict)
+        return cost

lewm_red_cube_epoch_100_object.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca701f5496e3a23b0dbae8fa9a66f5de980abe729845739f75fce3556b97fe8e
+size 72355236

module.py ADDED Viewed

	@@ -0,0 +1,306 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+from einops import rearrange
+def modulate(x, shift, scale):
+    """AdaLN-zero modulation"""
+    return x * (1 + scale) + shift
+class SIGReg(torch.nn.Module):
+    """Sketch Isotropic Gaussian Regularizer (single-GPU!)"""
+    def __init__(self, knots=17, num_proj=1024):
+        super().__init__()
+        self.num_proj = num_proj
+        t = torch.linspace(0, 3, knots, dtype=torch.float32)
+        dt = 3 / (knots - 1)
+        weights = torch.full((knots,), 2 * dt, dtype=torch.float32)
+        weights[[0, -1]] = dt
+        window = torch.exp(-t.square() / 2.0)
+        self.register_buffer("t", t)
+        self.register_buffer("phi", window)
+        self.register_buffer("weights", weights * window)
+    def forward(self, proj):
+        """
+        proj: (T, B, D)
+        """
+        # sample random projections
+        A = torch.randn(proj.size(-1), self.num_proj, device=proj.device)
+        A = A.div_(A.norm(p=2, dim=0))
+        # compute the epps-pulley statistic
+        x_t = (proj @ A).unsqueeze(-1) * self.t
+        err = (x_t.cos().mean(-3) - self.phi).square() + x_t.sin().mean(-3).square()
+        statistic = (err @ self.weights) * proj.size(-2)
+        return statistic.mean() # average over projections and time
+class FeedForward(nn.Module):
+    """FeedForward network used in Transformers"""
+    def __init__(self, dim, hidden_dim, dropout=0.0):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, dim),
+            nn.Dropout(dropout),
+        )
+    def forward(self, x):
+        return self.net(x)
+class Attention(nn.Module):
+    """Scaled dot-product attention with causal masking"""
+    def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
+        super().__init__()
+        inner_dim = dim_head * heads
+        project_out = not (heads == 1 and dim_head == dim)
+        self.heads = heads
+        self.scale = dim_head**-0.5
+        self.dropout = dropout
+        self.norm = nn.LayerNorm(dim)
+        self.attend = nn.Softmax(dim=-1)
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
+        self.to_out = (
+            nn.Sequential(nn.Linear(inner_dim, dim), nn.Dropout(dropout))
+            if project_out
+            else nn.Identity()
+        )
+    def forward(self, x, causal=True):
+        """
+        x : (B, T, D)
+        """
+        x = self.norm(x)
+        drop = self.dropout if self.training else 0.0
+        qkv = self.to_qkv(x).chunk(3, dim=-1)  # q, k, v: (B, heads, T, dim_head)
+        q, k, v = (rearrange(t, "b t (h d) -> b h t d", h=self.heads) for t in qkv)
+        out = F.scaled_dot_product_attention(q, k, v, dropout_p=drop, is_causal=causal)
+        out = rearrange(out, "b h t d -> b t (h d)")
+        return self.to_out(out)
+class ConditionalBlock(nn.Module):
+    """Transformer block with AdaLN-zero conditioning"""
+    def __init__(self, dim, heads, dim_head, mlp_dim, dropout=0.0):
+        super().__init__()
+        self.attn = Attention(dim, heads=heads, dim_head=dim_head, dropout=dropout)
+        self.mlp = FeedForward(dim, mlp_dim, dropout=dropout)
+        self.norm1 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(), nn.Linear(dim, 6 * dim, bias=True)
+        )
+        nn.init.constant_(self.adaLN_modulation[-1].weight, 0)
+        nn.init.constant_(self.adaLN_modulation[-1].bias, 0)
+    def forward(self, x, c):
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+            self.adaLN_modulation(c).chunk(6, dim=-1)
+        )
+        x = x + gate_msa * self.attn(modulate(self.norm1(x), shift_msa, scale_msa))
+        x = x + gate_mlp * self.mlp(modulate(self.norm2(x), shift_mlp, scale_mlp))
+        return x
+class Block(nn.Module):
+    """Standard Transformer block"""
+    def __init__(self, dim, heads, dim_head, mlp_dim, dropout=0.0):
+        super().__init__()
+        self.attn = Attention(dim, heads=heads, dim_head=dim_head, dropout=dropout)
+        self.mlp = FeedForward(dim, mlp_dim, dropout=dropout)
+        self.norm1 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+    def forward(self, x):
+        x = x + self.attn(self.norm1(x))
+        x = x + self.mlp(self.norm2(x))
+        return x
+class Transformer(nn.Module):
+    """Standard Transformer with support for AdaLN-zero blocks"""
+    def __init__(
+        self,
+        input_dim,
+        hidden_dim,
+        output_dim,
+        depth,
+        heads,
+        dim_head,
+        mlp_dim,
+        dropout=0.0,
+        block_class=Block,
+    ):
+        super().__init__()
+        self.norm = nn.LayerNorm(hidden_dim)
+        self.layers = nn.ModuleList([])
+        self.input_proj = (
+            nn.Linear(input_dim, hidden_dim)
+            if input_dim != hidden_dim
+            else nn.Identity()
+        )
+        self.cond_proj = (
+            nn.Linear(input_dim, hidden_dim)
+            if input_dim != hidden_dim
+            else nn.Identity()
+        )
+        self.output_proj = (
+            nn.Linear(hidden_dim, output_dim)
+            if hidden_dim != output_dim
+            else nn.Identity()
+        )
+        for _ in range(depth):
+            self.layers.append(
+                block_class(hidden_dim, heads, dim_head, mlp_dim, dropout)
+            )
+    def forward(self, x, c=None):
+        if hasattr(self, "input_proj"):
+            x = self.input_proj(x)
+        if c is not None and hasattr(self, "cond_proj"):
+            c = self.cond_proj(c)
+        for block in self.layers:
+            x = block(x) if isinstance(block, Block) else block(x, c)
+        x = self.norm(x)
+        if hasattr(self, "output_proj"):
+            x = self.output_proj(x)
+        return x
+class Embedder(nn.Module):
+    def __init__(
+        self,
+        input_dim=10,
+        smoothed_dim=10,
+        emb_dim=10,
+        mlp_scale=4,
+    ):
+        super().__init__()
+        self.patch_embed = nn.Conv1d(input_dim, smoothed_dim, kernel_size=1, stride=1)
+        self.embed = nn.Sequential(
+            nn.Linear(smoothed_dim, mlp_scale * emb_dim),
+            nn.SiLU(),
+            nn.Linear(mlp_scale * emb_dim, emb_dim),
+        )
+    def forward(self, x):
+        """
+        x: (B, T, D)
+        """
+        x = x.float()
+        x = x.permute(0, 2, 1)
+        x = self.patch_embed(x)
+        x = x.permute(0, 2, 1)
+        x = self.embed(x)
+        return x
+class MLP(nn.Module):
+    """Simple MLP with optional normalization and activation"""
+    def __init__(
+        self,
+        input_dim,
+        hidden_dim,
+        output_dim=None,
+        norm_fn=nn.LayerNorm,
+        act_fn=nn.GELU,
+    ):
+        super().__init__()
+        norm_fn = norm_fn(hidden_dim) if norm_fn is not None else nn.Identity()
+        self.net = nn.Sequential(
+            nn.Linear(input_dim, hidden_dim),
+            norm_fn,
+            act_fn(),
+            nn.Linear(hidden_dim, output_dim or input_dim),
+        )
+    def forward(self, x):
+        """
+        x: (B*T, D)
+        """
+        return self.net(x)
+class ActionEncoder2DWrapper(nn.Module):
+    """Slices 6-dim raw action input down to 2-dim (dx, dy) before encoding.
+    Accepts either (..., frameskip*6) or (..., frameskip*2) trailing dim
+    and adapts. Lives in module.py so it's importable from any script
+    that already imports `module` to use other LeWM components.
+    """
+    def __init__(self, inner: nn.Module, frameskip: int = 5):
+        super().__init__()
+        self.inner = inner
+        self.frameskip = frameskip
+    def forward(self, x):
+        if x.shape[-1] == self.frameskip * 6:
+            B = x.shape[:-1]
+            x = x.reshape(*B, self.frameskip, 6)
+            x = x[..., :2]
+            x = x.reshape(*B, self.frameskip * 2)
+        return self.inner(x)
+class ARPredictor(nn.Module):
+    """Autoregressive predictor for next-step embedding prediction."""
+    def __init__(
+        self,
+        *,
+        num_frames,
+        depth,
+        heads,
+        mlp_dim,
+        input_dim,
+        hidden_dim,
+        output_dim=None,
+        dim_head=64,
+        dropout=0.0,
+        emb_dropout=0.0,
+    ):
+        super().__init__()
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_frames, input_dim))
+        self.dropout = nn.Dropout(emb_dropout)
+        self.transformer = Transformer(
+            input_dim,
+            hidden_dim,
+            output_dim or input_dim,
+            depth,
+            heads,
+            dim_head,
+            mlp_dim,
+            dropout,
+            block_class=ConditionalBlock,
+        )
+    def forward(self, x, c):
+        """
+        x: (B, T, d)
+        c: (B, T, act_dim)
+        """
+        T = x.size(1)
+        x = x + self.pos_embedding[:, :T]
+        x = self.dropout(x)
+        x = self.transformer(x, c)
+        return x

prior_head.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""PriorHead — MLP that maps (z_t, z_g) → Gaussian over an action sequence.
+Per §9 design: input is concat(z_t, z_g) ∈ R^{2D}, output is (μ, σ) over
+H × A_block × A_raw normalized actions. σ is per-input via softplus + a floor.
+The head's output sits in StandardScaler-normalized action space; the eval-side
+policy is responsible for inverse-transform back to env action units.
+"""
+from __future__ import annotations
+import torch
+from torch import nn
+import torch.nn.functional as F
+class PriorHead(nn.Module):
+    def __init__(
+        self,
+        z_dim: int,
+        H: int,
+        A_block: int,
+        A_raw: int,
+        hidden: int = 512,
+        sigma_floor: float = 0.05,
+    ):
+        super().__init__()
+        self.z_dim = z_dim
+        self.H = H
+        self.A_block = A_block
+        self.A_raw = A_raw
+        self.action_seq_dim = H * A_block * A_raw
+        self.sigma_floor = sigma_floor
+        self.mlp = nn.Sequential(
+            nn.Linear(2 * z_dim, hidden),
+            nn.GELU(),
+            nn.Linear(hidden, hidden),
+            nn.GELU(),
+            nn.Linear(hidden, 2 * self.action_seq_dim),
+        )
+    def forward(self, z_t: torch.Tensor, z_g: torch.Tensor):
+        """z_t, z_g: (B, D). Returns (mu, sigma) each of shape (B, H, A_block, A_raw)."""
+        x = torch.cat([z_t, z_g], dim=-1)
+        out = self.mlp(x)
+        mu_flat, log_sigma_flat = out.chunk(2, dim=-1)
+        sigma_flat = F.softplus(log_sigma_flat) + self.sigma_floor
+        B = mu_flat.size(0)
+        shape = (B, self.H, self.A_block, self.A_raw)
+        return mu_flat.view(shape), sigma_flat.view(shape)
+def beta_nll_loss(
+    mu: torch.Tensor,
+    sigma: torch.Tensor,
+    target: torch.Tensor,
+    beta: float = 0.5,
+) -> torch.Tensor:
+    """β-NLL (Seitzer et al. 2022).
+    Standard Gaussian NLL per element (dropping additive constant):
+        nll_i = 0.5 (y_i − μ_i)² / σ_i² + log σ_i
+    β-NLL multiplies by stop_grad(σ_i^(2β)) before averaging:
+        L = mean_i [ stop_grad(σ_i^{2β}) · nll_i ]
+    β=0.5 is the recommended robust default — keeps σ-gradient alive but
+    prevents the σ-blow-up pathology of vanilla NLL when μ is hard to fit.
+    """
+    var = sigma.pow(2)
+    log_sigma = sigma.log()
+    sq_err = (target - mu).pow(2)
+    nll = 0.5 * sq_err / var + log_sigma
+    weight = sigma.detach().pow(2 * beta)
+    return (weight * nll).mean()

prior_head_red_cube.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0026eda7168dc515df27fadeb415ac2e6bd25fb98206b0db6f34fb365d6c591
+size 2359141