ropedia-xperience-10m-task-baselines / notes /all_modalities_model.md
cy0307's picture
Publish Ropedia minimal task baseline weights
eea471e verified
|
Raw
History Blame
3.91 kB
# All-Modality Minimal Model
Script:
```text
scripts/train_all_modalities_model.py
```
This extends the first minimal model by using every major sample modality in a lightweight way.
## Modalities Used
Dynamic sensor/action modalities:
- `hand_mocap/left_joints_3d`
- `hand_mocap/right_joints_3d`
- `full_body_mocap/keypoints`
- `full_body_mocap/contacts`
- `slam/trans_xyz`
- `slam/quat_wxyz` converted by the toolkit into camera rotation matrices
- `imu/accel_xyz`
- `imu/gyro_xyz`
- `depth/depth`
- `depth/confidence`
- `fisheye_cam0.mp4`
- `fisheye_cam1.mp4`
- `fisheye_cam2.mp4`
- `fisheye_cam3.mp4`
- `stereo_left.mp4`
- `stereo_right.mp4`
Static/context modalities:
- `slam/point_cloud`
- `calibration/*`
- caption objects
- caption interaction text
By default, the script does **not** include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.
## Feature Design
The model is still intentionally small:
```text
raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier
```
For each 20-frame window:
- Motion signals use mean/std/min/max/delta/velocity statistics.
- Depth uses global depth stats plus a small normalized depth grid and confidence grid.
- Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
- Text uses a hashed bag-of-words vector from objects and interaction text.
- Point cloud and calibration are included as static episode-level features.
Current feature blocks:
```text
hand_left_joints: 441
hand_right_joints: 441
body_joints: 1092
body_contacts: 147
camera_translation: 21
camera_rotation_matrix: 63
imu_accel_gyro: 42
depth_confidence: 980
video_fisheye_cam0: 686
video_fisheye_cam1: 686
video_fisheye_cam2: 686
video_fisheye_cam3: 686
video_stereo_left: 686
video_stereo_right: 686
caption_objects_interaction_text: 896
slam_point_cloud: 22
calibration: 117
total: 8378
```
## Run Commands
Action prediction:
```bash
cd /path/to/Ropedia
source .venv/bin/activate
python scripts/train_all_modalities_model.py
```
Subtask prediction:
```bash
python scripts/train_all_modalities_model.py --target subtask
```
The first run builds reusable caches in:
```text
outputs/feature_cache/
```
## Current Results
Action-label model:
```text
outputs/min_all_modalities_action_model/
accuracy: 0.9828
balanced_accuracy: 0.9801
macro_f1: 0.9791
weighted_f1: 0.9828
majority_baseline: 0.1375
classes: 18
feature_dim: 8378
test_windows: 291
```
Subtask-label model:
```text
outputs/min_all_modalities_subtask_model/
accuracy: 0.9828
balanced_accuracy: 0.9505
macro_f1: 0.9308
weighted_f1: 0.9838
majority_baseline: 0.1448
classes: 14
feature_dim: 8378
test_windows: 290
```
## How To Interpret This
This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.
It does **not** prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.
For a serious embodied-AI experiment:
```text
many episodes
-> cache features per episode
-> split by episode or task instance
-> train on some episodes
-> test on unseen episodes
```
The next useful upgrade is not a bigger classifier. It is a better split and more episodes.