# All-Modality Minimal Model

Script:

```text
scripts/train_all_modalities_model.py
```

This extends the first minimal model by using every major sample modality in a lightweight way.

## Modalities Used

Dynamic sensor/action modalities:

- `hand_mocap/left_joints_3d`
- `hand_mocap/right_joints_3d`
- `full_body_mocap/keypoints`
- `full_body_mocap/contacts`
- `slam/trans_xyz`
- `slam/quat_wxyz` converted by the toolkit into camera rotation matrices
- `imu/accel_xyz`
- `imu/gyro_xyz`
- `depth/depth`
- `depth/confidence`
- `fisheye_cam0.mp4`
- `fisheye_cam1.mp4`
- `fisheye_cam2.mp4`
- `fisheye_cam3.mp4`
- `stereo_left.mp4`
- `stereo_right.mp4`
- AAC audio decoded from `fisheye_cam0.mp4`

Static/context modalities:

- `slam/point_cloud`
- `calibration/*`
- caption objects
- caption interaction text

By default, the script does **not** include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.

## Feature Design

The model is still intentionally small:

```text
raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier
```

For each 20-frame window:

- Motion signals use mean/std/min/max/delta/velocity statistics.
- Depth uses global depth stats plus a small normalized depth grid and confidence grid.
- Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
- Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies.
- Text uses a hashed bag-of-words vector from objects and interaction text.
- Point cloud and calibration are included as static episode-level features.

Current feature blocks:

```text
hand_left_joints:                  441
hand_right_joints:                 441
body_joints:                      1092
body_contacts:                     147
camera_translation:                 21
camera_rotation_matrix:             63
imu_accel_gyro:                     42
depth_confidence:                  980
video_fisheye_cam0:                686
video_fisheye_cam1:                686
video_fisheye_cam2:                686
video_fisheye_cam3:                686
video_stereo_left:                 686
video_stereo_right:                686
audio_fisheye_cam0_aac:            168
caption_objects_interaction_text:  896
slam_point_cloud:                   22
calibration:                       117
total:                            8546
```

## Run Commands

Action prediction:

```bash
cd /path/to/Ropedia
source .venv/bin/activate
python scripts/train_all_modalities_model.py
```

Subtask prediction:

```bash
python scripts/train_all_modalities_model.py --target subtask
```

The first run builds reusable caches in:

```text
outputs/feature_cache/
```

## Current Results

Action-label model:

```text
outputs/min_all_modalities_action_model/
accuracy:          0.9828
balanced_accuracy: 0.9856
macro_f1:          0.9829
weighted_f1:       0.9863
majority_baseline: 0.1375
classes:           18
feature_dim:       8546
test_windows:      291
```

Subtask-label model:

```text
outputs/min_all_modalities_subtask_model/
accuracy:          0.9828
balanced_accuracy: 0.9505
macro_f1:          0.9173
weighted_f1:       0.9841
majority_baseline: 0.1448
classes:           14
feature_dim:       8546
test_windows:      290
```

## How To Interpret This

This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.

It does **not** prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.

For a serious embodied-AI experiment:

```text
many episodes
-> cache features per episode
-> split by episode or task instance
-> train on some episodes
-> test on unseen episodes
```

The next useful upgrade is not a bigger classifier. It is a better split and more episodes.