Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 4,079 Bytes
eea471e a8124a8 eea471e a8124a8 eea471e a8124a8 eea471e a8124a8 eea471e a8124a8 eea471e a8124a8 eea471e a8124a8 eea471e a8124a8 eea471e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | # All-Modality Minimal Model
Script:
```text
scripts/train_all_modalities_model.py
```
This extends the first minimal model by using every major sample modality in a lightweight way.
## Modalities Used
Dynamic sensor/action modalities:
- `hand_mocap/left_joints_3d`
- `hand_mocap/right_joints_3d`
- `full_body_mocap/keypoints`
- `full_body_mocap/contacts`
- `slam/trans_xyz`
- `slam/quat_wxyz` converted by the toolkit into camera rotation matrices
- `imu/accel_xyz`
- `imu/gyro_xyz`
- `depth/depth`
- `depth/confidence`
- `fisheye_cam0.mp4`
- `fisheye_cam1.mp4`
- `fisheye_cam2.mp4`
- `fisheye_cam3.mp4`
- `stereo_left.mp4`
- `stereo_right.mp4`
- AAC audio decoded from `fisheye_cam0.mp4`
Static/context modalities:
- `slam/point_cloud`
- `calibration/*`
- caption objects
- caption interaction text
By default, the script does **not** include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment.
## Feature Design
The model is still intentionally small:
```text
raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier
```
For each 20-frame window:
- Motion signals use mean/std/min/max/delta/velocity statistics.
- Depth uses global depth stats plus a small normalized depth grid and confidence grid.
- Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats.
- Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies.
- Text uses a hashed bag-of-words vector from objects and interaction text.
- Point cloud and calibration are included as static episode-level features.
Current feature blocks:
```text
hand_left_joints: 441
hand_right_joints: 441
body_joints: 1092
body_contacts: 147
camera_translation: 21
camera_rotation_matrix: 63
imu_accel_gyro: 42
depth_confidence: 980
video_fisheye_cam0: 686
video_fisheye_cam1: 686
video_fisheye_cam2: 686
video_fisheye_cam3: 686
video_stereo_left: 686
video_stereo_right: 686
audio_fisheye_cam0_aac: 168
caption_objects_interaction_text: 896
slam_point_cloud: 22
calibration: 117
total: 8546
```
## Run Commands
Action prediction:
```bash
cd /path/to/Ropedia
source .venv/bin/activate
python scripts/train_all_modalities_model.py
```
Subtask prediction:
```bash
python scripts/train_all_modalities_model.py --target subtask
```
The first run builds reusable caches in:
```text
outputs/feature_cache/
```
## Current Results
Action-label model:
```text
outputs/min_all_modalities_action_model/
accuracy: 0.9828
balanced_accuracy: 0.9856
macro_f1: 0.9829
weighted_f1: 0.9863
majority_baseline: 0.1375
classes: 18
feature_dim: 8546
test_windows: 291
```
Subtask-label model:
```text
outputs/min_all_modalities_subtask_model/
accuracy: 0.9828
balanced_accuracy: 0.9505
macro_f1: 0.9173
weighted_f1: 0.9841
majority_baseline: 0.1448
classes: 14
feature_dim: 8546
test_windows: 290
```
## How To Interpret This
This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac.
It does **not** prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated.
For a serious embodied-AI experiment:
```text
many episodes
-> cache features per episode
-> split by episode or task instance
-> train on some episodes
-> test on unseen episodes
```
The next useful upgrade is not a bigger classifier. It is a better split and more episodes.
|