File size: 4,079 Bytes
eea471e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8124a8
eea471e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8124a8
eea471e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8124a8
eea471e
 
 
a8124a8
eea471e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8124a8
 
 
eea471e
 
a8124a8
eea471e
 
 
 
 
 
 
 
 
a8124a8
 
eea471e
 
a8124a8
eea471e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# All-Modality Minimal Model

Script:

```text
scripts/train_all_modalities_model.py
```

This extends the first minimal model by using every major sample modality in a lightweight way.

## Modalities Used

Dynamic sensor/action modalities:

- `hand_mocap/left_joints_3d`
- `hand_mocap/right_joints_3d`
- `full_body_mocap/keypoints`
- `full_body_mocap/contacts`
- `slam/trans_xyz`
- `slam/quat_wxyz` converted by the toolkit into camera rotation matrices
- `imu/accel_xyz`
- `imu/gyro_xyz`
- `depth/depth`
- `depth/confidence`
- `fisheye_cam0.mp4`
- `fisheye_cam1.mp4`
- `fisheye_cam2.mp4`
- `fisheye_cam3.mp4`
- `stereo_left.mp4`
- `stereo_right.mp4`
- AAC audio decoded from `fisheye_cam0.mp4`

Static/context modalities:

- `slam/point_cloud`
- `calibration/*`
- caption objects
- caption interaction text

By default, the script does **not** include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.

## Feature Design

The model is still intentionally small:

```text
raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier
```

For each 20-frame window:

- Motion signals use mean/std/min/max/delta/velocity statistics.
- Depth uses global depth stats plus a small normalized depth grid and confidence grid.
- Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
- Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies.
- Text uses a hashed bag-of-words vector from objects and interaction text.
- Point cloud and calibration are included as static episode-level features.

Current feature blocks:

```text
hand_left_joints:                  441
hand_right_joints:                 441
body_joints:                      1092
body_contacts:                     147
camera_translation:                 21
camera_rotation_matrix:             63
imu_accel_gyro:                     42
depth_confidence:                  980
video_fisheye_cam0:                686
video_fisheye_cam1:                686
video_fisheye_cam2:                686
video_fisheye_cam3:                686
video_stereo_left:                 686
video_stereo_right:                686
audio_fisheye_cam0_aac:            168
caption_objects_interaction_text:  896
slam_point_cloud:                   22
calibration:                       117
total:                            8546
```

## Run Commands

Action prediction:

```bash
cd /path/to/Ropedia
source .venv/bin/activate
python scripts/train_all_modalities_model.py
```

Subtask prediction:

```bash
python scripts/train_all_modalities_model.py --target subtask
```

The first run builds reusable caches in:

```text
outputs/feature_cache/
```

## Current Results

Action-label model:

```text
outputs/min_all_modalities_action_model/
accuracy:          0.9828
balanced_accuracy: 0.9856
macro_f1:          0.9829
weighted_f1:       0.9863
majority_baseline: 0.1375
classes:           18
feature_dim:       8546
test_windows:      291
```

Subtask-label model:

```text
outputs/min_all_modalities_subtask_model/
accuracy:          0.9828
balanced_accuracy: 0.9505
macro_f1:          0.9173
weighted_f1:       0.9841
majority_baseline: 0.1448
classes:           14
feature_dim:       8546
test_windows:      290
```

## How To Interpret This

This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.

It does **not** prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.

For a serious embodied-AI experiment:

```text
many episodes
-> cache features per episode
-> split by episode or task instance
-> train on some episodes
-> test on unseen episodes
```

The next useful upgrade is not a bigger classifier. It is a better split and more episodes.