Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Unified 20-Task Suite
The public Xperience-10M sample task surface is one unified set of 20 tasks. All task contracts are presented together under the same window, split, feature, baseline, and leakage-control contract.
Historical artifact paths containing tier2_task_suite are kept for stable
links, but they should be read as provenance directories inside the unified task suite, not
as a separate benchmark tier.
Shared Setup
- Episode scope:
1public sample episode. - Frames/windows:
5,821frames and1,161aligned windows. - Windowing:
20frames per window, stride5frames. - Feature vector:
8,546dimensions from the shared feature manifest. - Split: chronological 70/30 train/test by time within the sample episode.
- Baselines: minimal interpretable heads and compact neural MLP heads.
- Raw data: MP4/HDF5/RRD files are not redistributed.
Task Table
| # | Task | Artifact id | Input -> output | Primary metric | Minimal | Neural |
|---|---|---|---|---|---|---|
| 1 | Action Recognition | timeline_action |
20-frame multimodal window -> current action class | macro-F1 (higher better) | 0.0500 | 0.0148 |
| 2 | Procedure Step Recognition | timeline_subtask |
20-frame multimodal window -> current procedure step | macro-F1 (higher better) | 0.0506 | 0.0281 |
| 3 | Action Boundary Detection | transition_detection |
current window with boundary target -> boundary or steady | macro-F1 (higher better) | 0.6118 | 0.5862 |
| 4 | Next-Action Prediction | next_action |
current window at time t -> action at t+20 frames | macro-F1 (higher better) | 0.0593 | 0.0419 |
| 5 | Hand Trajectory Forecasting | hand_trajectory_forecast |
current multimodal window -> future hand-joint trajectory | MPJPE (lower better) | 0.8647 | 0.1079 |
| 6 | Contact State Prediction | contact_prediction |
non-contact, non-caption features -> contact or no contact | macro-F1 (higher better) | 1.0000 | 1.0000 |
| 7 | Object Relevance Prediction | object_relevance |
non-caption multimodal features -> relevant object set | micro-F1 (higher better) | 0.1803 | 0.1679 |
| 8 | Language Grounding | caption_grounding |
text-like query and candidate windows -> ranked matching moments | MRR (higher better) | 0.0160 | 0.0168 |
| 9 | Cross-Modal Retrieval | cross_modal_retrieval |
motion/IMU/pose query; depth/video candidates -> ranked visual windows | MRR (higher better) | 0.2693 | 0.1300 |
| 10 | Cross-Modal Reconstruction | modality_reconstruction |
motion, IMU, and camera/pose features -> reconstructed depth/video vector | R2 (higher better) | -0.0153 | -0.0102 |
| 11 | Temporal Order Verification | temporal_order |
two adjacent windows plus difference vector -> correct or reversed | F1 (higher better) | 0.5400 | 0.8520 |
| 12 | Multimodal Synchronization Detection | misalignment_detection |
motion-side and visual/depth-side feature groups -> aligned or shifted | F1 (higher better) | 0.5052 | 0.7153 |
| 13 | Long-Horizon Next-Action Forecasting | long_horizon_next_action |
Current 20-frame non-caption multimodal window. -> Action label five seconds later. | macro-F1 (higher better) | 0.0750 | 0.0655 |
| 14 | Long-Horizon Next-Subtask Forecasting | next_subtask_forecast |
Current 20-frame non-caption multimodal window. -> Procedure subtask label five seconds later. | macro-F1 (higher better) | 0.0455 | 0.0507 |
| 15 | Interaction Text Prediction | interaction_text_prediction |
Current 20-frame sensor window with caption-text features removed. -> Raw annotation interaction phrase for the same window. | macro-F1 (higher better) | 0.0444 | 0.0381 |
| 16 | Action-Object Relation Prediction | action_object_relation |
Current 20-frame sensor window with caption-text features removed. -> Joint action plus active object-set relation. | macro-F1 (higher better) | 0.0000 | 0.0000 |
| 17 | Future Object-Set Forecasting | object_set_forecast |
Current 20-frame sensor window with caption-text features removed. -> Object set active five seconds later. | micro-F1 (higher better) | 0.1694 | 0.1972 |
| 18 | IMU-to-Hand Pose Reconstruction | imu_to_hand_pose |
Current IMU acceleration/gyroscope feature block only. -> Current left/right hand joint feature blocks. | MAE (lower better) | 0.0420 | 0.0426 |
| 19 | Camera-View Synchronization Retrieval | camera_view_sync_retrieval |
Fisheye camera-1 feature query projected into fisheye camera-3 feature space. -> The synchronized held-out camera-3 window. | MRR (higher better) | 0.4943 | 0.2409 |
| 20 | Time-to-Next-Transition Regression | time_to_transition |
Current 20-frame non-caption multimodal window. -> Frames until the next action-label boundary, capped at 200 frames. | MAE frames (lower better) | 10.5374 | 10.5545 |
Machine-Readable Copy
The JSON mirror is docs/data/task_suite_20.json.