Update README.md

0a3ce10 verified 9 days ago

3.99 kB

	# ACT — Pantheon YAM 'swap screwdriver head' — Velocity-Normalized

	ACT (Action-Chunking Transformer) policy for a bimanual YAM manipulation task, trained as
	one arm of a Velocity-Normalization (VN) ablation. This checkpoint: VN.

	Trained on velocity-normalized demonstrations (VN re-times each demo to a consistent speed profile: idle removed, fast motion slowed, cruising aligned).

	> Companion model: [atharva-pantheon/act-pantheon-yam-screwdriver-naive](https://huggingface.co/atharva-pantheon/act-pantheon-yam-screwdriver-naive) — the other arm
	> of the ablation (same data, same config, uniform 30→10 Hz downsample).

	---

	## Research summary

	Question. Does Velocity-Normalization (VN) preprocessing — re-timing teleop demos so a
	policy sees a consistent end-effector speed distribution — change what an ACT policy learns,
	holding the underlying demonstrations fixed?

	Setup. A controlled ablation: two ACT policies, identical architecture, hyperparameters,
	and seed; the only difference is how source frames are selected when building the 10 Hz
	training set.

	\| \| this model (VN) \| companion \|
	\|---\|---\|---\|
	\| frame selection \| velocity-normalized 30→10 Hz \| uniform 30→10 Hz downsample \|

	### Data
	- Task: "swap the tool head on the screwdriver" (task_index 16) — 379 episodes, ~5.2 h,
	the task with the most episodes. Bimanual YAM, 14-D joints (zero-padded to 20), 3 cameras
	(top / wrist-L / wrist-R), 30 fps.
	- Both training sets built at 224×224, 10 Hz, same writer, same action labeling. VN frame
	selection is the sole variable.

	### Velocity Normalization (VN)
	Implementation: <https://github.com/vovw/vn-pipeline>. End-effector speed via forward
	kinematics (pinocchio, YAM URDF `link_6`), bimanual speed = max over arms. Two stages:
	- Stage 1 (inter-episode): align each episode's cruising speed (30th-pct) toward the median (clamp 0.75–1.5×).
	- Stage 2 (intra-episode): a smooth monotonic speed map H(s) that slows the fast tail; gripper-event windows and trailing idle preserved.

	Applied to this task: breakpoints m=0.024, M=0.162 m/s; 566,204 source frames → 211,501 VN frames
	(1.12× duration ratio; 1.1% idle dropped; Stage-1 factor median 1.0×, range 0.75–1.5). The naive
	baseline is a plain uniform 3× downsample (≈189k frames, idle kept).

	### Training
	ACT, ResNet18 (ImageNet) vision backbone, `chunk_size=30`, `n_action_steps=30`, `n_obs_steps=1`,
	224×224, batch 64, lr 1e-5, 10,000 steps (≈6.7 epochs), seed 1000. A100 80 GB, both runs concurrent.
	Logged to W&B project `vn-act-screwdriver`.

	### Result
	\| run \| start loss \| final loss (L1+KL) \|
	\|---\|---\|---\|
	\| VN \| 5.27 \| 0.294 \|
	\| naive (no-VN) \| 5.19 \| 0.310 \|

	![loss comparison](comparison.png)

	Both converge tightly (same demonstrations). Training loss is not the verdict — VN's intended
	benefit is consistent execution speed at inference, which requires a robot/sim rollout to
	evaluate. What this run establishes: both policies train cleanly to convergence on identical
	data with VN frame-selection as the only difference.

	## Files
	- `model.safetensors` — ACT weights (~52 M params)
	- `config.json`, `train_config.json` — policy + training config
	- `policy_preprocessor/policy_postprocessor` — input/output normalization (required to run)
	- `comparison.png` — VN vs no-VN training-loss curves

	## Usage (LeRobot)
	```python
	from lerobot.policies.act.modeling_act import ACTPolicy
	policy = ACTPolicy.from_pretrained("atharva-pantheon/act-pantheon-yam-screwdriver-vn")
	```

	## Notes / provenance
	- State/action are 20-D (14-D YAM joints zero-padded; `valid_action_dims=14`).
	- Built via a v2.0→v3.0 adapter (the vn-pipeline `run_vn.py` targets v3.0 input).
	- Engineering note: on the 128-core training box, capping `OMP_NUM_THREADS` was essential —
	uncapped, `ToTensor` cost ~236 ms/image (thread-dispatch overhead) vs 0.27 ms capped (~1000×),
	which otherwise starved the GPU.