Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # Audio Ablation and Raw-Audio Upgrade | |
| This report is generated from committed task-suite artifacts plus the local public-sample MP4 audio stream. | |
| It measures whether audio changes each single-episode task under the same chronological split. | |
| ## Raw Audio Feature | |
| - Source: `local_public_sample/fisheye_cam0.mp4` | |
| - Has audio: `True` | |
| - Sample rate: `16000` | |
| - Window feature dim: `588` | |
| - Feature: Per-window raw waveform STFT log-mel statistics plus delta and waveform envelope statistics. | |
| ## Task Deltas | |
| | Task | Metric | Current audio | No audio | Current audio delta | Raw replaces audio | Raw replacement delta | | |
| | --- | --- | ---: | ---: | ---: | ---: | ---: | | |
| | Current Action Recognition | macro_f1 | 0.0091 | 0.0088 | 0.0003 | 0.0013 | -0.0077 | | |
| | Current Subtask Recognition | macro_f1 | 0.0113 | 0.0112 | 0.0001 | 0.0008 | -0.0104 | | |
| | Action Transition Detection | macro_f1 | 0.4621 | 0.4687 | -0.0066 | 0.4792 | 0.0171 | | |
| | Next-Action Prediction | macro_f1 | 0.0106 | 0.0107 | -0.0001 | 0.0060 | -0.0046 | | |
| | Future Hand Motion Forecasting | mae | 4.4664 | 4.3038 | -0.1626 | 4.3059 | 0.1605 | | |
| | Contact State Prediction | macro_f1 | 1.0000 | 1.0000 | 0.0000 | 1.0000 | 0.0000 | | |
| | Relevant Object Prediction | micro_f1 | 0.1581 | 0.1479 | 0.0102 | 0.1787 | 0.0206 | | |
| | Language-to-Time Grounding | mrr | 0.0321 | 0.0272 | 0.0049 | 0.0248 | -0.0072 | | |
| | Cross-Modal Window Retrieval | mrr | 0.3751 | 0.3892 | -0.0141 | 0.3275 | -0.0476 | | |
| | Sensor-to-Visual Reconstruction | mae | 9.7942 | 10.4467 | 0.6524 | 8.8307 | 0.9635 | | |
| | Temporal Order Verification | macro_f1 | 0.5172 | 0.4943 | 0.0230 | 0.5302 | 0.0129 | | |
| | Cross-Modal Misalignment Detection | macro_f1 | 0.4173 | 0.4226 | -0.0052 | 0.4438 | 0.0264 | | |
| ## Aggregate | |
| - Mean current-audio delta: `0.041849794979543296` | |
| - Tasks where current handcrafted audio improves the primary metric: `6` | |
| - Mean raw-replacement delta vs current handcrafted audio: `0.09362598132150173` | |
| - Tasks where raw log-mel replacement improves over current handcrafted audio: `6` | |
| Positive deltas always mean better according to each task's primary metric. For MAE tasks, lower MAE is converted into a positive improvement. | |