Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # Foundation Model Plan | |
| This plan extends the current Xperience-10M scale-up path beyond the prepared | |
| Qwen3-Omni LoRA pilot. It separates immediate trainable work from later | |
| world-model and robot-policy branches, so the project can choose a backbone | |
| without mixing different research goals. | |
| Current status: this remains the backbone-selection plan, but the repo now has | |
| verified held-out multi-episode foundation-model diagnostics: Qwen3-Omni LoRA | |
| for structured JSON tasks, Cosmos3-Nano for future-window compatibility, | |
| Cosmos3-Super Reasoner as a base-weight JSON-task evaluation, and Cosmos3-Super | |
| Forward-Dynamics LoRA as the first fine-tuned Super adapter branch. | |
| ## Three Pipeline Tracks | |
| The project can pursue the three presentation directions, but they should be | |
| treated as pipeline tracks with different maturity levels: | |
| | Track | Role | Current boundary | Next gate | | |
| | --- | --- | --- | --- | | |
| | Spatial intelligence models | Recover and reason over scene/object state from multiview RGB, depth, pose, calibration, object cues, and language. | Pipeline and evaluation contract, not a completed spatial model claim. | Raw depth/pose artifacts, spatial-memory exporter, and held-out spatial QA/object-memory metrics. | | |
| | Human-video world models | Predict future action, subtask, object set, contact transition, camera motion, or latent visual state from observed interaction windows. | Partially evidenced by future-task probes and Cosmos-style branch artifacts. | Stronger future-state metrics, qualitative future examples, and held-out episode breakdowns. | | |
| | Vision-language-action models | Convert egocentric video, language, hand/body motion, contacts, and objects into action chunks or policy-compatible targets. | Feasible but gated by action-space conversion. | Traceable action tokens, normalization, retargeting metadata, and policy/VLA held-out metrics. | | |
| The full pipeline-level contract is in | |
| [`THREE_FOUNDATION_PIPELINES.md`](THREE_FOUNDATION_PIPELINES.md), with a | |
| machine-readable copy at | |
| [`docs/data/three_foundation_pipelines.json`](docs/data/three_foundation_pipelines.json). | |
| ## Backbone Decision | |
| | Priority | Model family | Best role for this project | Why it fits Xperience-10M | Current decision | | |
| | --- | --- | --- | --- | --- | | |
| | 1 | Qwen3-Omni | Multimodal instruction model and JSON task predictor | Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge | Keep as the first selected-episode LoRA pilot | | |
| | 2 | Cosmos 3 | Embodied world model, action generation, and synthetic future prediction | Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives | Add as the first world-model branch after the data gate | | |
| | 3 | NVIDIA GR00T | Humanoid/action-policy foundation model | Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes | Track as a humanoid policy branch, not the first LoRA pilot | | |
| | 4 | OpenVLA / OpenVLA-OFT | Open vision-language-action policy baseline | Useful when windows are converted into visual observation plus action-token targets | Use after action-space design is explicit | | |
| | 5 | openpi pi0/pi0.5 | Open robot policy and action expert baseline | Useful for action chunking, policy fine-tuning, and embodiment transfer experiments | Candidate for policy branch once action labels are retargeted | | |
| | 6 | Gemini Robotics | Closed/API embodied reasoning reference | Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target | Use only as an external comparison or annotation assistant | | |
| | 7 | Octo / SmolVLA-style lightweight policies | Smaller reproducible robot-policy baselines | Good for cheaper action-policy experiments, but less directly omni-modal | Optional baseline branch after selected-episode data preparation | | |
| | Future | Xperience Embodied Foundation Model | Xperience-native domain model pretrained from scratch on full-corpus embodied experience | Would learn a shared temporal representation across video, audio, depth, pose, mocap, IMU, and language | Long-term goal after smaller pilots prove value and full-corpus storage/compute are available | | |
| ## Why Qwen3-Omni Still Goes First | |
| The immediate pilot is about proving the full data path: | |
| - prepared multi-episode Xperience-10M data, | |
| - episode-level train/test separation, | |
| - window-level supervised examples, | |
| - multimodal prompt construction, | |
| - sensor bridge for depth, pose, mocap, and IMU, | |
| - LoRA training, | |
| - held-out predictions and metrics. | |
| Qwen3-Omni is the most direct first target because the existing scripts already | |
| prepare video/audio/language prompts and adapter inputs. It is also suitable for | |
| the unified 20 current task contracts, which mostly produce labels, structured JSON, or | |
| short task answers. | |
| The executable Qwen branch and future branch contracts are now represented as | |
| config files under `configs/omni_backbones/`. Validate them with: | |
| ```bash | |
| python scripts/omni/backbone_registry.py --validate --json | |
| ``` | |
| The shared extension rules are in | |
| [`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md). A new | |
| foundation branch should add a config first, then implement the exporter, | |
| trainer, evaluator, and launcher required by that config. | |
| ## Long-Term Native Pretraining Goal | |
| Qwen3-Omni, Cosmos 3, GR00T, OpenVLA, and openpi are backbone choices for the | |
| next experiments. The longer-term goal is different: train an | |
| **Xperience Embodied Foundation Model** that is native to the Xperience-10M | |
| modality structure. | |
| That model would not start as a general internet-scale omni model. It would be | |
| a domain model over synchronized embodied experience: multi-view egocentric | |
| video, audio, depth, pose/SLAM, hand and body mocap, IMU, calibration, and | |
| language annotations. Its pretraining should combine masked multimodal | |
| modeling, cross-modal contrastive alignment, future-state prediction, | |
| ego-motion and hand-motion forecasting, action/procedure prediction, language | |
| grounding, contact/affordance prediction, and optional policy-style targets | |
| after action conversion. | |
| This is not a current result in the repo. It becomes appropriate only after: | |
| - the selected multi-episode pipeline trains and evaluates cleanly, | |
| - scaling from 128 episodes to thousands of episodes shows measurable value, | |
| - raw-corpus storage and derived-shard capacity are available, | |
| - distributed training and checkpoint/restart infrastructure are reliable, | |
| - evaluation covers held-out episodes, sessions, activities, objects, and | |
| missing-modality robustness. | |
| The full plan is documented in | |
| [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md). | |
| ## Why Cosmos 3 Should Be Added Next | |
| Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first | |
| world-model branch after the data gate. The reason is that the Xperience-10M | |
| modalities are unusually aligned with physical-world modeling: | |
| - video streams for visual state, | |
| - embedded audio for event cues, | |
| - depth and calibration for spatial structure, | |
| - pose/SLAM for camera motion, | |
| - hand/body mocap for embodied state, | |
| - IMU for inertial dynamics, | |
| - language annotations for task semantics. | |
| The practical Cosmos 3 branch should start with three targets: | |
| 1. **Future-window prediction:** condition on earlier video/sensor windows and | |
| predict future visual or latent state. | |
| 2. **Action-conditioned world modeling:** use mocap/action labels as controls | |
| and predict what changes in the scene. | |
| 3. **Synthetic data expansion:** generate or score candidate futures, then test | |
| whether synthetic windows improve downstream task heads. | |
| A Cosmos 3 branch is now represented by two public-safe verified packages: | |
| Cosmos3-Nano future-window compatibility and Cosmos3-Super forward-dynamics | |
| LoRA. The Super LoRA target is camera-pose-conditioned future vision velocity, | |
| so it should be analyzed as a world-model loss result rather than a JSON-task | |
| classifier. | |
| ## Policy-Model Branch | |
| OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as | |
| policy/action branches. They need a clear action target before training: | |
| - egocentric action class, | |
| - next subtask, | |
| - hand trajectory chunk, | |
| - contact state, | |
| - object-affordance target, | |
| - retargeted humanoid/body action, | |
| - or robot-compatible action tokens. | |
| The current public sample can prototype the data conversion, but policy quality | |
| requires multi-episode diversity. The first useful policy experiment should be a | |
| 64-128 episode run, not a one-sample demonstration. | |
| ## Evaluation Additions | |
| The foundation-model stage should add metrics beyond the current 20-task suite: | |
| | Evaluation target | Metric family | Applies to | | |
| | --- | --- | --- | | |
| | Structured task prediction | JSON validity, macro-F1, accuracy, micro-F1 | Qwen3-Omni, Gemini Robotics comparison | | |
| | Future state prediction | retrieval rank, temporal consistency, feature reconstruction, visual inspection | Cosmos 3 | | |
| | Action-conditioned dynamics | transition accuracy, contact accuracy, next-action accuracy | Cosmos 3, OpenVLA, openpi, GR00T | | |
| | Affordance and object interaction | object micro-F1, contact-object consistency, caption grounding | all branches | | |
| | Cross-episode generalization | held-out episode metrics, held-out session metrics, leakage checks | all trainable branches | | |
| ## Execution Order | |
| 1. Keep the selected 96/16/16 split as the comparison spine. | |
| 2. Treat the verified Qwen3-Omni LoRA package as the structured JSON baseline. | |
| 3. Treat Cosmos3-Nano compatibility and Cosmos3-Super Forward-Dynamics LoRA as separate world-model branches with different metrics. | |
| 4. Run a model-selection dry run on 3-8 episodes for any next backbone before scaling beyond the selected split. | |
| 5. Promote Cosmos 3 to larger world-model experiments if video/sensor | |
| preprocessing, storage, and loss metrics justify the extra cost. | |
| 6. Promote OpenVLA/openpi/GR00T only after action targets are explicit and | |
| retargeting artifacts are traceable. | |
| 7. Update public cards only when a branch has real manifests, predictions, | |
| metrics, and qualitative examples. | |
| 8. Start Xperience-native pretraining only after smaller scaling stages, | |
| full-corpus storage, multi-node compute, and held-out evaluation protocols | |
| are in place. | |
| ## Source Links | |
| - Qwen3-Omni: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct | |
| - NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/ | |
| - NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t | |
| - OpenVLA: https://openvla.github.io/ | |
| - openpi: https://github.com/Physical-Intelligence/openpi | |
| - Gemini Robotics: https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/ | |
| - Octo: https://octo-models.github.io/ | |
| - LeRobot / SmolVLA: https://github.com/huggingface/lerobot | |
| - Xperience Embodied Foundation Model pretraining plan: | |
| `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` | |