| # ACT β Pantheon YAM 'swap screwdriver head' β **Velocity-Normalized** |
|
|
| ACT (Action-Chunking Transformer) policy for a bimanual YAM manipulation task, trained as |
| one arm of a **Velocity-Normalization (VN) ablation**. This checkpoint: VN. |
|
|
| Trained on **velocity-normalized** demonstrations (VN re-times each demo to a consistent speed profile: idle removed, fast motion slowed, cruising aligned). |
|
|
| > **Companion model:** [atharva-pantheon/act-pantheon-yam-screwdriver-naive](https://huggingface.co/atharva-pantheon/act-pantheon-yam-screwdriver-naive) β the other arm |
| > of the ablation (same data, same config, uniform 30β10 Hz downsample). |
|
|
| --- |
|
|
| ## Research summary |
|
|
| **Question.** Does Velocity-Normalization (VN) preprocessing β re-timing teleop demos so a |
| policy sees a consistent end-effector speed distribution β change what an ACT policy learns, |
| holding the underlying demonstrations fixed? |
|
|
| **Setup.** A controlled ablation: two ACT policies, identical architecture, hyperparameters, |
| and seed; the **only** difference is how source frames are selected when building the 10 Hz |
| training set. |
|
|
| | | this model (VN) | companion | |
| |---|---|---| |
| | frame selection | velocity-normalized 30β10 Hz | uniform 30β10 Hz downsample | |
|
|
| ### Data |
| - **Task:** *"swap the tool head on the screwdriver"* (task_index 16) β **379 episodes, ~5.2 h**, |
| the task with the most episodes. Bimanual YAM, 14-D joints (zero-padded to 20), 3 cameras |
| (top / wrist-L / wrist-R), 30 fps. |
| - Both training sets built at **224Γ224, 10 Hz**, same writer, same action labeling. VN frame |
| selection is the sole variable. |
| |
| ### Velocity Normalization (VN) |
| Implementation: <https://github.com/vovw/vn-pipeline>. End-effector speed via forward |
| kinematics (pinocchio, YAM URDF `link_6`), bimanual speed = max over arms. Two stages: |
| - **Stage 1 (inter-episode):** align each episode's cruising speed (30th-pct) toward the median (clamp 0.75β1.5Γ). |
| - **Stage 2 (intra-episode):** a smooth monotonic speed map H(s) that slows the fast tail; gripper-event windows and trailing idle preserved. |
|
|
| Applied to this task: breakpoints m=0.024, M=0.162 m/s; **566,204 source frames β 211,501 VN frames** |
| (1.12Γ duration ratio; 1.1% idle dropped; Stage-1 factor median 1.0Γ, range 0.75β1.5). The naive |
| baseline is a plain uniform 3Γ downsample (β189k frames, idle kept). |
|
|
| ### Training |
| ACT, ResNet18 (ImageNet) vision backbone, `chunk_size=30`, `n_action_steps=30`, `n_obs_steps=1`, |
| 224Γ224, batch 64, lr 1e-5, **10,000 steps** (β6.7 epochs), seed 1000. A100 80 GB, both runs concurrent. |
| Logged to W&B project `vn-act-screwdriver`. |
|
|
| ### Result |
| | run | start loss | final loss (L1+KL) | |
| |---|---|---| |
| | VN | 5.27 | **0.294** | |
| | naive (no-VN) | 5.19 | **0.310** | |
|
|
|  |
|
|
| Both converge tightly (same demonstrations). **Training loss is not the verdict** β VN's intended |
| benefit is *consistent execution speed* at inference, which requires a robot/sim rollout to |
| evaluate. What this run establishes: both policies train cleanly to convergence on identical |
| data with VN frame-selection as the only difference. |
|
|
| ## Files |
| - `model.safetensors` β ACT weights (~52 M params) |
| - `config.json`, `train_config.json` β policy + training config |
| - `policy_preprocessor*/policy_postprocessor*` β input/output normalization (required to run) |
| - `comparison.png` β VN vs no-VN training-loss curves |
|
|
| ## Usage (LeRobot) |
| ```python |
| from lerobot.policies.act.modeling_act import ACTPolicy |
| policy = ACTPolicy.from_pretrained("atharva-pantheon/act-pantheon-yam-screwdriver-vn") |
| ``` |
|
|
| ## Notes / provenance |
| - State/action are 20-D (14-D YAM joints zero-padded; `valid_action_dims=14`). |
| - Built via a v2.0βv3.0 adapter (the vn-pipeline `run_vn.py` targets v3.0 input). |
| - Engineering note: on the 128-core training box, capping `OMP_NUM_THREADS` was essential β |
| uncapped, `ToTensor` cost ~236 ms/image (thread-dispatch overhead) vs 0.27 ms capped (~1000Γ), |
| which otherwise starved the GPU. |
|
|