Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # All-Modality Minimal Model | |
| Script: | |
| ```text | |
| scripts/train_all_modalities_model.py | |
| ``` | |
| This extends the first minimal model by using every major sample modality in a lightweight way. | |
| ## Modalities Used | |
| Dynamic sensor/action modalities: | |
| - `hand_mocap/left_joints_3d` | |
| - `hand_mocap/right_joints_3d` | |
| - `full_body_mocap/keypoints` | |
| - `full_body_mocap/contacts` | |
| - `slam/trans_xyz` | |
| - `slam/quat_wxyz` converted by the toolkit into camera rotation matrices | |
| - `imu/accel_xyz` | |
| - `imu/gyro_xyz` | |
| - `depth/depth` | |
| - `depth/confidence` | |
| - `fisheye_cam0.mp4` | |
| - `fisheye_cam1.mp4` | |
| - `fisheye_cam2.mp4` | |
| - `fisheye_cam3.mp4` | |
| - `stereo_left.mp4` | |
| - `stereo_right.mp4` | |
| - AAC audio decoded from `fisheye_cam0.mp4` | |
| Static/context modalities: | |
| - `slam/point_cloud` | |
| - `calibration/*` | |
| - caption objects | |
| - caption interaction text | |
| By default, the script does **not** include `action_label`, `Sub Task`, or action-description text as input, because those are too close to the prediction target. You can force that with `--include-label-text`, but that should be treated as a leakage/debug run, not a fair action-recognition experiment. | |
| ## Feature Design | |
| The model is still intentionally small: | |
| ```text | |
| raw modality -> per-frame or static handcrafted features -> window temporal statistics -> softmax classifier | |
| ``` | |
| For each 20-frame window: | |
| - Motion signals use mean/std/min/max/delta/velocity statistics. | |
| - Depth uses global depth stats plus a small normalized depth grid and confidence grid. | |
| - Each video stream uses color stats, color histograms, a small grayscale grid, and simple edge stats. | |
| - Audio uses per-frame waveform/spectral statistics and log-spaced spectral band energies. | |
| - Text uses a hashed bag-of-words vector from objects and interaction text. | |
| - Point cloud and calibration are included as static episode-level features. | |
| Current feature blocks: | |
| ```text | |
| hand_left_joints: 441 | |
| hand_right_joints: 441 | |
| body_joints: 1092 | |
| body_contacts: 147 | |
| camera_translation: 21 | |
| camera_rotation_matrix: 63 | |
| imu_accel_gyro: 42 | |
| depth_confidence: 980 | |
| video_fisheye_cam0: 686 | |
| video_fisheye_cam1: 686 | |
| video_fisheye_cam2: 686 | |
| video_fisheye_cam3: 686 | |
| video_stereo_left: 686 | |
| video_stereo_right: 686 | |
| audio_fisheye_cam0_aac: 168 | |
| caption_objects_interaction_text: 896 | |
| slam_point_cloud: 22 | |
| calibration: 117 | |
| total: 8546 | |
| ``` | |
| ## Run Commands | |
| Action prediction: | |
| ```bash | |
| cd /path/to/Ropedia | |
| source .venv/bin/activate | |
| python scripts/train_all_modalities_model.py | |
| ``` | |
| Subtask prediction: | |
| ```bash | |
| python scripts/train_all_modalities_model.py --target subtask | |
| ``` | |
| The first run builds reusable caches in: | |
| ```text | |
| outputs/feature_cache/ | |
| ``` | |
| ## Current Results | |
| Action-label model: | |
| ```text | |
| outputs/min_all_modalities_action_model/ | |
| accuracy: 0.9828 | |
| balanced_accuracy: 0.9856 | |
| macro_f1: 0.9829 | |
| weighted_f1: 0.9863 | |
| majority_baseline: 0.1375 | |
| classes: 18 | |
| feature_dim: 8546 | |
| test_windows: 291 | |
| ``` | |
| Subtask-label model: | |
| ```text | |
| outputs/min_all_modalities_subtask_model/ | |
| accuracy: 0.9828 | |
| balanced_accuracy: 0.9505 | |
| macro_f1: 0.9173 | |
| weighted_f1: 0.9841 | |
| majority_baseline: 0.1448 | |
| classes: 14 | |
| feature_dim: 8546 | |
| test_windows: 290 | |
| ``` | |
| ## How To Interpret This | |
| This proves that the full sample can be converted into a complete supervised learning pipeline on this Mac. | |
| It does **not** prove real generalization, because the public sample is one episode and the split is random windows from that same episode. Neighboring windows are correlated. | |
| For a serious embodied-AI experiment: | |
| ```text | |
| many episodes | |
| -> cache features per episode | |
| -> split by episode or task instance | |
| -> train on some episodes | |
| -> test on unseen episodes | |
| ``` | |
| The next useful upgrade is not a bigger classifier. It is a better split and more episodes. | |