atharva-pantheon's picture
Update README.md
0a3ce10 verified
|
Raw
History Blame Contribute Delete
3.99 kB

ACT β€” Pantheon YAM 'swap screwdriver head' β€” Velocity-Normalized

ACT (Action-Chunking Transformer) policy for a bimanual YAM manipulation task, trained as one arm of a Velocity-Normalization (VN) ablation. This checkpoint: VN.

Trained on velocity-normalized demonstrations (VN re-times each demo to a consistent speed profile: idle removed, fast motion slowed, cruising aligned).

Companion model: atharva-pantheon/act-pantheon-yam-screwdriver-naive β€” the other arm of the ablation (same data, same config, uniform 30β†’10 Hz downsample).


Research summary

Question. Does Velocity-Normalization (VN) preprocessing β€” re-timing teleop demos so a policy sees a consistent end-effector speed distribution β€” change what an ACT policy learns, holding the underlying demonstrations fixed?

Setup. A controlled ablation: two ACT policies, identical architecture, hyperparameters, and seed; the only difference is how source frames are selected when building the 10 Hz training set.

this model (VN) companion
frame selection velocity-normalized 30β†’10 Hz uniform 30β†’10 Hz downsample

Data

  • Task: "swap the tool head on the screwdriver" (task_index 16) β€” 379 episodes, ~5.2 h, the task with the most episodes. Bimanual YAM, 14-D joints (zero-padded to 20), 3 cameras (top / wrist-L / wrist-R), 30 fps.
  • Both training sets built at 224Γ—224, 10 Hz, same writer, same action labeling. VN frame selection is the sole variable.

Velocity Normalization (VN)

Implementation: https://github.com/vovw/vn-pipeline. End-effector speed via forward kinematics (pinocchio, YAM URDF link_6), bimanual speed = max over arms. Two stages:

  • Stage 1 (inter-episode): align each episode's cruising speed (30th-pct) toward the median (clamp 0.75–1.5Γ—).
  • Stage 2 (intra-episode): a smooth monotonic speed map H(s) that slows the fast tail; gripper-event windows and trailing idle preserved.

Applied to this task: breakpoints m=0.024, M=0.162 m/s; 566,204 source frames β†’ 211,501 VN frames (1.12Γ— duration ratio; 1.1% idle dropped; Stage-1 factor median 1.0Γ—, range 0.75–1.5). The naive baseline is a plain uniform 3Γ— downsample (β‰ˆ189k frames, idle kept).

Training

ACT, ResNet18 (ImageNet) vision backbone, chunk_size=30, n_action_steps=30, n_obs_steps=1, 224Γ—224, batch 64, lr 1e-5, 10,000 steps (β‰ˆ6.7 epochs), seed 1000. A100 80 GB, both runs concurrent. Logged to W&B project vn-act-screwdriver.

Result

run start loss final loss (L1+KL)
VN 5.27 0.294
naive (no-VN) 5.19 0.310

loss comparison

Both converge tightly (same demonstrations). Training loss is not the verdict β€” VN's intended benefit is consistent execution speed at inference, which requires a robot/sim rollout to evaluate. What this run establishes: both policies train cleanly to convergence on identical data with VN frame-selection as the only difference.

Files

  • model.safetensors β€” ACT weights (~52 M params)
  • config.json, train_config.json β€” policy + training config
  • policy_preprocessor*/policy_postprocessor* β€” input/output normalization (required to run)
  • comparison.png β€” VN vs no-VN training-loss curves

Usage (LeRobot)

from lerobot.policies.act.modeling_act import ACTPolicy
policy = ACTPolicy.from_pretrained("atharva-pantheon/act-pantheon-yam-screwdriver-vn")

Notes / provenance

  • State/action are 20-D (14-D YAM joints zero-padded; valid_action_dims=14).
  • Built via a v2.0β†’v3.0 adapter (the vn-pipeline run_vn.py targets v3.0 input).
  • Engineering note: on the 128-core training box, capping OMP_NUM_THREADS was essential β€” uncapped, ToTensor cost 236 ms/image (thread-dispatch overhead) vs 0.27 ms capped (1000Γ—), which otherwise starved the GPU.