Publish Ropedia Xperience-10M task baseline cards

a8124a8 verified 26 days ago

4.08 kB

	# All-Modality Minimal Model

	Script:

	```text
	scripts/train_all_modalities_model.py
	```

	This extends the first minimal model by using every major sample modality in a lightweight way.

	## Modalities Used

	Dynamic sensor/action modalities:

	- `hand_mocap/left_joints_3d`
	- `hand_mocap/right_joints_3d`
	- `full_body_mocap/keypoints`
	- `full_body_mocap/contacts`
	- `slam/trans_xyz`
	- `slam/quat_wxyz` converted by the toolkit into camera rotation matrices
	- `imu/accel_xyz`
	- `imu/gyro_xyz`
	- `depth/depth`
	- `depth/confidence`
	- `fisheye_cam0.mp4`
	- `fisheye_cam1.mp4`
	- `fisheye_cam2.mp4`
	- `fisheye_cam3.mp4`
	- `stereo_left.mp4`
	- `stereo_right.mp4`
	- AAC audio decoded from `fisheye_cam0.mp4`

	Static/context modalities:

	- `slam/point_cloud`
	- `calibration/*`
	- caption objects
	- caption interaction text

	By default, the script does not include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.

	## Feature Design

	The model is still intentionally small:

	```text
	raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier
	```

	For each 20-frame window:

	- Motion signals use mean/std/min/max/delta/velocity statistics.
	- Depth uses global depth stats plus a small normalized depth grid and confidence grid.
	- Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
	- Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies.
	- Text uses a hashed bag-of-words vector from objects and interaction text.
	- Point cloud and calibration are included as static episode-level features.

	Current feature blocks:

	```text
	hand_left_joints: 441
	hand_right_joints: 441
	body_joints: 1092
	body_contacts: 147
	camera_translation: 21
	camera_rotation_matrix: 63
	imu_accel_gyro: 42
	depth_confidence: 980
	video_fisheye_cam0: 686
	video_fisheye_cam1: 686
	video_fisheye_cam2: 686
	video_fisheye_cam3: 686
	video_stereo_left: 686
	video_stereo_right: 686
	audio_fisheye_cam0_aac: 168
	caption_objects_interaction_text: 896
	slam_point_cloud: 22
	calibration: 117
	total: 8546
	```

	## Run Commands

	Action prediction:

	```bash
	cd /path/to/Ropedia
	source .venv/bin/activate
	python scripts/train_all_modalities_model.py
	```

	Subtask prediction:

	```bash
	python scripts/train_all_modalities_model.py --target subtask
	```

	The first run builds reusable caches in:

	```text
	outputs/feature_cache/
	```

	## Current Results

	Action-label model:

	```text
	outputs/min_all_modalities_action_model/
	accuracy: 0.9828
	balanced_accuracy: 0.9856
	macro_f1: 0.9829
	weighted_f1: 0.9863
	majority_baseline: 0.1375
	classes: 18
	feature_dim: 8546
	test_windows: 291
	```

	Subtask-label model:

	```text
	outputs/min_all_modalities_subtask_model/
	accuracy: 0.9828
	balanced_accuracy: 0.9505
	macro_f1: 0.9173
	weighted_f1: 0.9841
	majority_baseline: 0.1448
	classes: 14
	feature_dim: 8546
	test_windows: 290
	```

	## How To Interpret This

	This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.

	It does not prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.

	For a serious embodied-AI experiment:

	```text
	many episodes
	-> cache features per episode
	-> split by episode or task instance
	-> train on some episodes
	-> test on unseen episodes
	```

	The next useful upgrade is not a bigger classifier. It is a better split and more episodes.