YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ACT β€” Pantheon YAM 'swap screwdriver head' β€” Velocity-Normalized

ACT (Action-Chunking Transformer) policy for a bimanual YAM manipulation task, trained as one arm of a Velocity-Normalization (VN) ablation. This checkpoint: VN.

Trained on velocity-normalized demonstrations (VN re-times each demo to a consistent speed profile: idle removed, fast motion slowed, cruising aligned).

Companion model: atharva-pantheon/act-pantheon-yam-screwdriver-naive β€” the other arm of the ablation (same data, same config, uniform 30β†’10 Hz downsample).


Research summary

Question. Does Velocity-Normalization (VN) preprocessing β€” re-timing teleop demos so a policy sees a consistent end-effector speed distribution β€” change what an ACT policy learns, holding the underlying demonstrations fixed?

Setup. A controlled ablation: two ACT policies, identical architecture, hyperparameters, and seed; the only difference is how source frames are selected when building the 10 Hz training set.

this model (VN) companion
frame selection velocity-normalized 30β†’10 Hz uniform 30β†’10 Hz downsample

Data

  • Task: "swap the tool head on the screwdriver" (task_index 16) β€” 379 episodes, ~5.2 h, the task with the most episodes. Bimanual YAM, 14-D joints (zero-padded to 20), 3 cameras (top / wrist-L / wrist-R), 30 fps.
  • Both training sets built at 224Γ—224, 10 Hz, same writer, same action labeling. VN frame selection is the sole variable.

Velocity Normalization (VN)

Implementation: https://github.com/vovw/vn-pipeline. End-effector speed via forward kinematics (pinocchio, YAM URDF link_6), bimanual speed = max over arms. Two stages:

  • Stage 1 (inter-episode): align each episode's cruising speed (30th-pct) toward the median (clamp 0.75–1.5Γ—).
  • Stage 2 (intra-episode): a smooth monotonic speed map H(s) that slows the fast tail; gripper-event windows and trailing idle preserved.

Applied to this task: breakpoints m=0.024, M=0.162 m/s; 566,204 source frames β†’ 211,501 VN frames (1.12Γ— duration ratio; 1.1% idle dropped; Stage-1 factor median 1.0Γ—, range 0.75–1.5). The naive baseline is a plain uniform 3Γ— downsample (β‰ˆ189k frames, idle kept).

Training

ACT, ResNet18 (ImageNet) vision backbone, chunk_size=30, n_action_steps=30, n_obs_steps=1, 224Γ—224, batch 64, lr 1e-5, 10,000 steps (β‰ˆ6.7 epochs), seed 1000. A100 80 GB, both runs concurrent. Logged to W&B project vn-act-screwdriver.

Result

run start loss final loss (L1+KL)
VN 5.27 0.294
naive (no-VN) 5.19 0.310

loss comparison

Both converge tightly (same demonstrations). Training loss is not the verdict β€” VN's intended benefit is consistent execution speed at inference, which requires a robot/sim rollout to evaluate. What this run establishes: both policies train cleanly to convergence on identical data with VN frame-selection as the only difference.

Files

  • model.safetensors β€” ACT weights (~52 M params)
  • config.json, train_config.json β€” policy + training config
  • policy_preprocessor*/policy_postprocessor* β€” input/output normalization (required to run)
  • comparison.png β€” VN vs no-VN training-loss curves

Usage (LeRobot)

from lerobot.policies.act.modeling_act import ACTPolicy
policy = ACTPolicy.from_pretrained("atharva-pantheon/act-pantheon-yam-screwdriver-vn")

Notes / provenance

  • State/action are 20-D (14-D YAM joints zero-padded; valid_action_dims=14).
  • Built via a v2.0β†’v3.0 adapter (the vn-pipeline run_vn.py targets v3.0 input).
  • Engineering note: on the 128-core training box, capping OMP_NUM_THREADS was essential β€” uncapped, ToTensor cost 236 ms/image (thread-dispatch overhead) vs 0.27 ms capped (1000Γ—), which otherwise starved the GPU.
Downloads last month
18
Safetensors
Model size
51.6M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support