All-Modality Minimal Model

Script:

scripts/train_all_modalities_model.py

This extends the first minimal model by using every major sample modality in a lightweight way.

Modalities Used

Dynamic sensor/action modalities:

hand_mocap/left_joints_3d
hand_mocap/right_joints_3d
full_body_mocap/keypoints
full_body_mocap/contacts
slam/trans_xyz
slam/quat_wxyz converted by the toolkit into camera rotation matrices
imu/accel_xyz
imu/gyro_xyz
depth/depth
depth/confidence
fisheye_cam0.mp4
fisheye_cam1.mp4
fisheye_cam2.mp4
fisheye_cam3.mp4
stereo_left.mp4
stereo_right.mp4
AAC audio decoded from fisheye_cam0.mp4

Static/context modalities:

slam/point_cloud
calibration/*
caption objects
caption interaction text

By default, the script does not include action_label, Sub Task, or action-description text as input, because those are too close to the prediction target. You can force that with --include-label-text, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.

Feature Design

The model is still intentionally small:

raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier

For each 20-frame window:

Motion signals use mean/std/min/max/delta/velocity statistics.
Depth uses global depth stats plus a small normalized depth grid and confidence grid.
Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies.
Text uses a hashed bag-of-words vector from objects and interaction text.
Point cloud and calibration are included as static episode-level features.

Current feature blocks:

hand_left_joints:                  441
hand_right_joints:                 441
body_joints:                      1092
body_contacts:                     147
camera_translation:                 21
camera_rotation_matrix:             63
imu_accel_gyro:                     42
depth_confidence:                  980
video_fisheye_cam0:                686
video_fisheye_cam1:                686
video_fisheye_cam2:                686
video_fisheye_cam3:                686
video_stereo_left:                 686
video_stereo_right:                686
audio_fisheye_cam0_aac:            168
caption_objects_interaction_text:  896
slam_point_cloud:                   22
calibration:                       117
total:                            8546

Run Commands

Action prediction:

cd /path/to/Ropedia
source .venv/bin/activate
python scripts/train_all_modalities_model.py

Subtask prediction:

python scripts/train_all_modalities_model.py --target subtask

The first run builds reusable caches in:

outputs/feature_cache/

Current Results

Action-label model:

outputs/min_all_modalities_action_model/
accuracy:          0.9828
balanced_accuracy: 0.9856
macro_f1:          0.9829
weighted_f1:       0.9863
majority_baseline: 0.1375
classes:           18
feature_dim:       8546
test_windows:      291

Subtask-label model:

outputs/min_all_modalities_subtask_model/
accuracy:          0.9828
balanced_accuracy: 0.9505
macro_f1:          0.9173
weighted_f1:       0.9841
majority_baseline: 0.1448
classes:           14
feature_dim:       8546
test_windows:      290

How To Interpret This

This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.

It does not prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.

For a serious embodied-AI experiment:

many episodes
-> cache features per episode
-> split by episode or task instance
-> train on some episodes
-> test on unseen episodes

The next useful upgrade is not a bigger classifier. It is a better split and more episodes.