atharva-pantheon's picture
Update README.md
0a3ce10 verified
|
Raw
History Blame Contribute Delete
3.99 kB
# ACT β€” Pantheon YAM 'swap screwdriver head' β€” **Velocity-Normalized**
ACT (Action-Chunking Transformer) policy for a bimanual YAM manipulation task, trained as
one arm of a **Velocity-Normalization (VN) ablation**. This checkpoint: VN.
Trained on **velocity-normalized** demonstrations (VN re-times each demo to a consistent speed profile: idle removed, fast motion slowed, cruising aligned).
> **Companion model:** [atharva-pantheon/act-pantheon-yam-screwdriver-naive](https://huggingface.co/atharva-pantheon/act-pantheon-yam-screwdriver-naive) β€” the other arm
> of the ablation (same data, same config, uniform 30β†’10 Hz downsample).
---
## Research summary
**Question.** Does Velocity-Normalization (VN) preprocessing β€” re-timing teleop demos so a
policy sees a consistent end-effector speed distribution β€” change what an ACT policy learns,
holding the underlying demonstrations fixed?
**Setup.** A controlled ablation: two ACT policies, identical architecture, hyperparameters,
and seed; the **only** difference is how source frames are selected when building the 10 Hz
training set.
| | this model (VN) | companion |
|---|---|---|
| frame selection | velocity-normalized 30β†’10 Hz | uniform 30β†’10 Hz downsample |
### Data
- **Task:** *"swap the tool head on the screwdriver"* (task_index 16) β€” **379 episodes, ~5.2 h**,
the task with the most episodes. Bimanual YAM, 14-D joints (zero-padded to 20), 3 cameras
(top / wrist-L / wrist-R), 30 fps.
- Both training sets built at **224Γ—224, 10 Hz**, same writer, same action labeling. VN frame
selection is the sole variable.
### Velocity Normalization (VN)
Implementation: <https://github.com/vovw/vn-pipeline>. End-effector speed via forward
kinematics (pinocchio, YAM URDF `link_6`), bimanual speed = max over arms. Two stages:
- **Stage 1 (inter-episode):** align each episode's cruising speed (30th-pct) toward the median (clamp 0.75–1.5Γ—).
- **Stage 2 (intra-episode):** a smooth monotonic speed map H(s) that slows the fast tail; gripper-event windows and trailing idle preserved.
Applied to this task: breakpoints m=0.024, M=0.162 m/s; **566,204 source frames β†’ 211,501 VN frames**
(1.12Γ— duration ratio; 1.1% idle dropped; Stage-1 factor median 1.0Γ—, range 0.75–1.5). The naive
baseline is a plain uniform 3Γ— downsample (β‰ˆ189k frames, idle kept).
### Training
ACT, ResNet18 (ImageNet) vision backbone, `chunk_size=30`, `n_action_steps=30`, `n_obs_steps=1`,
224Γ—224, batch 64, lr 1e-5, **10,000 steps** (β‰ˆ6.7 epochs), seed 1000. A100 80 GB, both runs concurrent.
Logged to W&B project `vn-act-screwdriver`.
### Result
| run | start loss | final loss (L1+KL) |
|---|---|---|
| VN | 5.27 | **0.294** |
| naive (no-VN) | 5.19 | **0.310** |
![loss comparison](comparison.png)
Both converge tightly (same demonstrations). **Training loss is not the verdict** β€” VN's intended
benefit is *consistent execution speed* at inference, which requires a robot/sim rollout to
evaluate. What this run establishes: both policies train cleanly to convergence on identical
data with VN frame-selection as the only difference.
## Files
- `model.safetensors` β€” ACT weights (~52 M params)
- `config.json`, `train_config.json` β€” policy + training config
- `policy_preprocessor*/policy_postprocessor*` β€” input/output normalization (required to run)
- `comparison.png` β€” VN vs no-VN training-loss curves
## Usage (LeRobot)
```python
from lerobot.policies.act.modeling_act import ACTPolicy
policy = ACTPolicy.from_pretrained("atharva-pantheon/act-pantheon-yam-screwdriver-vn")
```
## Notes / provenance
- State/action are 20-D (14-D YAM joints zero-padded; `valid_action_dims=14`).
- Built via a v2.0β†’v3.0 adapter (the vn-pipeline `run_vn.py` targets v3.0 input).
- Engineering note: on the 128-core training box, capping `OMP_NUM_THREADS` was essential β€”
uncapped, `ToTensor` cost ~236 ms/image (thread-dispatch overhead) vs 0.27 ms capped (~1000Γ—),
which otherwise starved the GPU.