Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 5,175 Bytes
d735235 cf07180 d735235 a8124a8 d735235 45c1706 d735235 a8124a8 d735235 a8124a8 d735235 cf07180 d735235 3c21768 d735235 a8124a8 d735235 a8124a8 d735235 cf07180 d735235 a8124a8 d735235 cf07180 d735235 a8124a8 d735235 cf07180 d735235 ca4ac1c 146ae33 ca4ac1c d735235 a8fd797 d735235 2bd8497 a8fd797 d735235 2bd8497 d735235 a8fd797 d735235 cf07180 d735235 45c1706 d735235 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | # Research Takeaways
This generated note summarizes what the current public Xperience-10M sample
pipeline actually shows. It is built from committed metric artifacts, not
from hand-edited score text.
## Scope
- validated episodes: 1
- frames: 5,821
- aligned windows: 1,161
- current feature dimension: 8,546
- raw Xperience-10M data is not redistributed
- Audio from the sample MP4 stream is represented in the current feature vector
## Takeaways
### One episode can become a real benchmark contract
The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,546-dimensional feature contract.
| Metric | Value |
| --- | ---: |
| `frames` | 5,821 |
| `windows` | 1,161 |
| `feature_dim` | 8,546 |
Source: `docs/data/summary_metrics.json`.
Current scope: This benchmark defines the task contract; cross-episode generalization is evaluated in the multi-episode stage.
### Chronological splits expose action-class shift
Earlier all-feature action classifiers reach high macro-F1 on their local split, but the core chronological action/subtask heads are much harder because later held-out windows include unseen labels.
| Metric | Value |
| --- | ---: |
| `all_feature_action_macro_f1` | 0.9829 |
| `suite_action_macro_f1` | 0.0500 |
| `suite_subtask_macro_f1` | 0.0506 |
| `unseen_action_test_classes` | 4 |
Source: `results/episode_task_suite/summary_report.json`.
Current scope: This split is useful for studying label shift; broad action-recognition conclusions need held-out episodes.
### Small neural heads help dynamic and temporal probes
The MLP heads substantially improve hand trajectory forecasting, temporal-order verification, and motion/visual synchronization.
| Metric | Value |
| --- | ---: |
| `hand_mpjpe_minimal` | 0.8647 |
| `hand_mpjpe_neural` | 0.1079 |
| `hand_mpjpe_relative_improvement` | 0.8753 |
| `temporal_order_f1_minimal` | 0.5400 |
| `temporal_order_f1_neural` | 0.8520 |
| `misalignment_f1_minimal` | 0.5052 |
| `misalignment_f1_neural` | 0.7153 |
Source: `results/episode_task_suite/neural_mlp/*/metrics.json`.
Current scope: These gains are measured within one episode and are candidates for held-out-episode testing.
### Retrieval and reconstruction remain the harder multimodal problems
Ridge/cosine retrieval remains stronger than the neural projection on this sample, and cross-modal reconstruction still has negative R2.
| Metric | Value |
| --- | ---: |
| `retrieval_mrr_minimal` | 0.2693 |
| `retrieval_mrr_neural` | 0.1300 |
| `retrieval_top5_minimal` | 0.3678 |
| `reconstruction_r2_minimal` | -0.0153 |
| `reconstruction_r2_neural` | -0.0102 |
Source: `results/episode_task_suite/cross_modal_retrieval/metrics.json`.
Current scope: The current reconstruction task predicts feature vectors; depth, mesh, NeRF, and Gaussian-splatting outputs are future task variants.
### Audio helps some tasks and hurts others on the public sample
Audio improves the primary metric on 6 walkthrough-backed task contracts, while raw log-mel replacement improves over the current handcrafted block on 6 of those contracts. The largest current-audio gain appears in feature reconstruction, not in action classification.
| Metric | Value |
| --- | ---: |
| `tasks_where_current_audio_improves` | 6 |
| `mean_current_audio_delta` | 0.0418 |
| `tasks_where_raw_replacement_improves` | 6 |
| `mean_raw_replacement_delta_vs_current` | 0.0936 |
| `reconstruction_current_audio_delta` | 0.6524 |
| `object_relevance_current_audio_delta` | 0.0102 |
Source: `results/audio_ablation/audio_ablation_summary.json`.
Current scope: This is a single-episode ablation over fixed ridge heads. It validates that audio is wired into the task suite and shows where it changes metrics; it does not prove cross-episode audio generalization.
### The next scientific unit is held-out episodes, not more adjacent windows
The selected Qwen3-Omni path now has a verified two-epoch held-out diagnostic result. It proves the cross-episode train/validation/eval loop and meets the strict-JSON target, while weak action/subtask metrics remain the next modeling problem.
| Metric | Value |
| --- | ---: |
| `selected_episodes` | 128 |
| `held_out_test_windows` | n/a |
| `json_validity_rate` | n/a |
| `action_macro_f1` | n/a |
Source: `docs/data/omni_finetune_verified_result.json`.
Current scope: The selected-episode Qwen3-Omni diagnostic pilot is verified on the 96/16/16 split and now meets the 98% target for JSON validity; action/subtask quality remains weak, so current results are diagnostic baselines, not strong model-quality claims.
## How To Read These Results
- High single-episode scores are useful pipeline checks for the current task contracts.
- Low chronological action/subtask scores are informative because they expose later-label shift.
- Neural gains on trajectory/order/alignment make those tasks good candidates for the next fine-tuning stage.
- Audio ablation is task-specific: audio representation choices help some probes and hurt others.
- Retrieval and reconstruction remain the main multimodal representation challenges.
- The next credible model-quality result needs held-out episodes.
|