# All-Modality Minimal Model Script: ```text scripts/train_all_modalities_model.py ``` This extends the first minimal model by using every major sample modality in a lightweight way. ## Modalities Used Dynamic sensor/action modalities: - `hand_mocap/left_joints_3d` - `hand_mocap/right_joints_3d` - `full_body_mocap/keypoints` - `full_body_mocap/contacts` - `slam/trans_xyz` - `slam/quat_wxyz` converted by the toolkit into camera rotation matrices - `imu/accel_xyz` - `imu/gyro_xyz` - `depth/depth` - `depth/confidence` - `fisheye_cam0.mp4` - `fisheye_cam1.mp4` - `fisheye_cam2.mp4` - `fisheye_cam3.mp4` - `stereo_left.mp4` - `stereo_right.mp4` - AAC audio decoded from `fisheye_cam0.mp4` Static/context modalities: - `slam/point_cloud` - `calibration/*` - caption objects - caption interaction text By default, the script does **not** include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment. ## Feature Design The model is still intentionally small: ```text raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier ``` For each 20-frame window: - Motion signals use mean/std/min/max/delta/velocity statistics. - Depth uses global depth stats plus a small normalized depth grid and confidence grid. - Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats. - Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies. - Text uses a hashed bag-of-words vector from objects and interaction text. - Point cloud and calibration are included as static episode-level features. Current feature blocks: ```text hand_left_joints: 441 hand_right_joints: 441 body_joints: 1092 body_contacts: 147 camera_translation: 21 camera_rotation_matrix: 63 imu_accel_gyro: 42 depth_confidence: 980 video_fisheye_cam0: 686 video_fisheye_cam1: 686 video_fisheye_cam2: 686 video_fisheye_cam3: 686 video_stereo_left: 686 video_stereo_right: 686 audio_fisheye_cam0_aac: 168 caption_objects_interaction_text: 896 slam_point_cloud: 22 calibration: 117 total: 8546 ``` ## Run Commands Action prediction: ```bash cd /path/to/Ropedia source .venv/bin/activate python scripts/train_all_modalities_model.py ``` Subtask prediction: ```bash python scripts/train_all_modalities_model.py --target subtask ``` The first run builds reusable caches in: ```text outputs/feature_cache/ ``` ## Current Results Action-label model: ```text outputs/min_all_modalities_action_model/ accuracy: 0.9828 balanced_accuracy: 0.9856 macro_f1: 0.9829 weighted_f1: 0.9863 majority_baseline: 0.1375 classes: 18 feature_dim: 8546 test_windows: 291 ``` Subtask-label model: ```text outputs/min_all_modalities_subtask_model/ accuracy: 0.9828 balanced_accuracy: 0.9505 macro_f1: 0.9173 weighted_f1: 0.9841 majority_baseline: 0.1448 classes: 14 feature_dim: 8546 test_windows: 290 ``` ## How To Interpret This This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac. It does **not** prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated. For a serious embodied-AI experiment: ```text many episodes -> cache features per episode -> split by episode or task instance -> train on some episodes -> test on unseen episodes ``` The next useful upgrade is not a bigger classifier. It is a better split and more episodes.