All-Modality Minimal Model
Script:
scripts/train_all_modalities_model.py
This extends the first minimal model by using every major sample modality in a lightweight way.
Modalities Used
Dynamic sensor/action modalities:
hand_mocap/left_joints_3dhand_mocap/right_joints_3dfull_body_mocap/keypointsfull_body_mocap/contactsslam/trans_xyzslam/quat_wxyzconverted by the toolkit into camera rotation matricesimu/accel_xyzimu/gyro_xyzdepth/depthdepth/confidencefisheye_cam0.mp4fisheye_cam1.mp4fisheye_cam2.mp4fisheye_cam3.mp4stereo_left.mp4stereo_right.mp4- AAC audio decoded from
fisheye_cam0.mp4
Static/context modalities:
slam/point_cloudcalibration/*- caption objects
- caption interaction text
By default, the script does not include action_label, Sub Task, or action-description text as input, because those are too close to the prediction target. You can force that with --include-label-text, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.
Feature Design
The model is still intentionally small:
raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier
For each 20-frame window:
- Motion signals use mean/std/min/max/delta/velocity statistics.
- Depth uses global depth stats plus a small normalized depth grid and confidence grid.
- Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
- Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies.
- Text uses a hashed bag-of-words vector from objects and interaction text.
- Point cloud and calibration are included as static episode-level features.
Current feature blocks:
hand_left_joints: 441
hand_right_joints: 441
body_joints: 1092
body_contacts: 147
camera_translation: 21
camera_rotation_matrix: 63
imu_accel_gyro: 42
depth_confidence: 980
video_fisheye_cam0: 686
video_fisheye_cam1: 686
video_fisheye_cam2: 686
video_fisheye_cam3: 686
video_stereo_left: 686
video_stereo_right: 686
audio_fisheye_cam0_aac: 168
caption_objects_interaction_text: 896
slam_point_cloud: 22
calibration: 117
total: 8546
Run Commands
Action prediction:
cd /path/to/Ropedia
source .venv/bin/activate
python scripts/train_all_modalities_model.py
Subtask prediction:
python scripts/train_all_modalities_model.py --target subtask
The first run builds reusable caches in:
outputs/feature_cache/
Current Results
Action-label model:
outputs/min_all_modalities_action_model/
accuracy: 0.9828
balanced_accuracy: 0.9856
macro_f1: 0.9829
weighted_f1: 0.9863
majority_baseline: 0.1375
classes: 18
feature_dim: 8546
test_windows: 291
Subtask-label model:
outputs/min_all_modalities_subtask_model/
accuracy: 0.9828
balanced_accuracy: 0.9505
macro_f1: 0.9173
weighted_f1: 0.9841
majority_baseline: 0.1448
classes: 14
feature_dim: 8546
test_windows: 290
How To Interpret This
This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.
It does not prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.
For a serious embodied-AI experiment:
many episodes
-> cache features per episode
-> split by episode or task instance
-> train on some episodes
-> test on unseen episodes
The next useful upgrade is not a bigger classifier. It is a better split and more episodes.