| # Foundation Model Plan |
|
|
| This plan extends the current Xperience-10M scale-up path beyond the prepared |
| Qwen3-Omni LoRA pilot. It separates immediate trainable work from later |
| world-model and robot-policy branches, so the project can choose a backbone |
| without mixing different research goals. |
|
|
| Current status: this is a planning artifact. The public repo has verified |
| single-episode task heads and setup-stage Qwen3-Omni scripts. It has not yet |
| run a held-out multi-episode foundation-model evaluation. |
|
|
| ## Backbone Decision |
|
|
| | Priority | Model family | Best role for this project | Why it fits Xperience-10M | Current decision | |
| | --- | --- | --- | --- | --- | |
| | 1 | Qwen3-Omni | Multimodal instruction model and JSON task predictor | Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge | Keep as the first selected-episode LoRA pilot | |
| | 2 | Cosmos 3 | Embodied world model, action generation, and synthetic future prediction | Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives | Add as the first world-model branch after the data gate | |
| | 3 | NVIDIA GR00T | Humanoid/action-policy foundation model | Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes | Track as a humanoid policy branch, not the first LoRA pilot | |
| | 4 | OpenVLA / OpenVLA-OFT | Open vision-language-action policy baseline | Useful when windows are converted into visual observation plus action-token targets | Use after action-space design is explicit | |
| | 5 | openpi pi0/pi0.5 | Open robot policy and action expert baseline | Useful for action chunking, policy fine-tuning, and embodiment transfer experiments | Candidate for policy branch once action labels are retargeted | |
| | 6 | Gemini Robotics | Closed/API embodied reasoning reference | Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target | Use only as an external comparison or annotation assistant | |
| | 7 | Octo / SmolVLA-style lightweight policies | Smaller reproducible robot-policy baselines | Good for cheaper action-policy experiments, but less directly omni-modal | Optional baseline branch after selected-episode data preparation | |
| | Future | Xperience Embodied Foundation Model | Xperience-native domain model pretrained from scratch on full-corpus embodied experience | Would learn a shared temporal representation across video, audio, depth, pose, mocap, IMU, and language | Long-term goal after smaller pilots prove value and full-corpus storage/compute are available | |
|
|
| ## Why Qwen3-Omni Still Goes First |
|
|
| The immediate pilot is about proving the full data path: |
|
|
| - prepared multi-episode Xperience-10M data, |
| - episode-level train/test separation, |
| - window-level supervised examples, |
| - multimodal prompt construction, |
| - sensor bridge for depth, pose, mocap, and IMU, |
| - LoRA training, |
| - held-out predictions and metrics. |
|
|
| Qwen3-Omni is the most direct first target because the existing scripts already |
| prepare video/audio/language prompts and adapter inputs. It is also suitable for |
| the 12 current task contracts, which mostly produce labels, structured JSON, or |
| short task answers. |
|
|
| The executable Qwen branch and future branch contracts are now represented as |
| config files under `configs/omni_backbones/`. Validate them with: |
|
|
| ```bash |
| python scripts/omni/backbone_registry.py --validate --json |
| ``` |
|
|
| The shared extension rules are in |
| [`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md). A new |
| foundation branch should add a config first, then implement the exporter, |
| trainer, evaluator, and launcher required by that config. |
|
|
| ## Long-Term Native Pretraining Goal |
|
|
| Qwen3-Omni, Cosmos 3, GR00T, OpenVLA, and openpi are backbone choices for the |
| next experiments. The longer-term goal is different: train an |
| **Xperience Embodied Foundation Model** that is native to the Xperience-10M |
| modality structure. |
|
|
| That model would not start as a general internet-scale omni model. It would be |
| a domain model over synchronized embodied experience: multi-view egocentric |
| video, audio, depth, pose/SLAM, hand and body mocap, IMU, calibration, and |
| language annotations. Its pretraining should combine masked multimodal |
| modeling, cross-modal contrastive alignment, future-state prediction, |
| ego-motion and hand-motion forecasting, action/procedure prediction, language |
| grounding, contact/affordance prediction, and optional policy-style targets |
| after action conversion. |
|
|
| This is not a current result in the repo. It becomes appropriate only after: |
|
|
| - the selected multi-episode pipeline trains and evaluates cleanly, |
| - scaling from 128 episodes to thousands of episodes shows measurable value, |
| - raw-corpus storage and derived-shard capacity are available, |
| - distributed training and checkpoint/restart infrastructure are reliable, |
| - evaluation covers held-out episodes, sessions, activities, objects, and |
| missing-modality robustness. |
|
|
| The full plan is documented in |
| [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md). |
|
|
| ## Why Cosmos 3 Should Be Added Next |
|
|
| Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first |
| world-model branch after the data gate. The reason is that the Xperience-10M |
| modalities are unusually aligned with physical-world modeling: |
|
|
| - video streams for visual state, |
| - embedded audio for event cues, |
| - depth and calibration for spatial structure, |
| - pose/SLAM for camera motion, |
| - hand/body mocap for embodied state, |
| - IMU for inertial dynamics, |
| - language annotations for task semantics. |
|
|
| The practical Cosmos 3 branch should start with three targets: |
|
|
| 1. **Future-window prediction:** condition on earlier video/sensor windows and |
| predict future visual or latent state. |
| 2. **Action-conditioned world modeling:** use mocap/action labels as controls |
| and predict what changes in the scene. |
| 3. **Synthetic data expansion:** generate or score candidate futures, then test |
| whether synthetic windows improve downstream task heads. |
|
|
| A Cosmos 3 branch is ready to publish only after committed manifests, generated |
| outputs, held-out metrics, and qualitative examples are available. |
|
|
| ## Policy-Model Branch |
|
|
| OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as |
| policy/action branches. They need a clear action target before training: |
|
|
| - egocentric action class, |
| - next subtask, |
| - hand trajectory chunk, |
| - contact state, |
| - object-affordance target, |
| - retargeted humanoid/body action, |
| - or robot-compatible action tokens. |
|
|
| The current public sample can prototype the data conversion, but policy quality |
| requires multi-episode diversity. The first useful policy experiment should be a |
| 64-128 episode run, not a one-sample demonstration. |
|
|
| ## Evaluation Additions |
|
|
| The foundation-model stage should add metrics beyond the current 12-task suite: |
|
|
| | Evaluation target | Metric family | Applies to | |
| | --- | --- | --- | |
| | Structured task prediction | JSON validity, macro-F1, accuracy, micro-F1 | Qwen3-Omni, Gemini Robotics comparison | |
| | Future state prediction | retrieval rank, temporal consistency, feature reconstruction, visual inspection | Cosmos 3 | |
| | Action-conditioned dynamics | transition accuracy, contact accuracy, next-action accuracy | Cosmos 3, OpenVLA, openpi, GR00T | |
| | Affordance and object interaction | object micro-F1, contact-object consistency, caption grounding | all branches | |
| | Cross-episode generalization | held-out episode metrics, held-out session metrics, leakage checks | all trainable branches | |
|
|
| ## Execution Order |
|
|
| 1. Finish selected multi-episode pilot preparation. |
| 2. Run the Qwen3-Omni LoRA pilot exactly once as the first held-out baseline. |
| 3. Run a model-selection dry run on 3-8 episodes: Qwen3-Omni prompt-only, |
| Qwen3-Omni LoRA, Cosmos 3 world-model preprocessing, and one policy baseline. |
| 4. Promote Cosmos 3 to the first world-model experiment if video/sensor |
| preprocessing and storage fit. |
| 5. Promote OpenVLA/openpi/GR00T only after action targets are explicit and |
| retargeting artifacts are traceable. |
| 6. Update public cards only when a branch has real manifests, predictions, |
| metrics, and qualitative examples. |
| 7. Start Xperience-native pretraining only after smaller scaling stages, |
| full-corpus storage, multi-node compute, and held-out evaluation protocols |
| are in place. |
|
|
| ## Source Links |
|
|
| - Qwen3-Omni: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct |
| - NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/ |
| - NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t |
| - OpenVLA: https://openvla.github.io/ |
| - openpi: https://github.com/Physical-Intelligence/openpi |
| - Gemini Robotics: https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/ |
| - Octo: https://octo-models.github.io/ |
| - LeRobot / SmolVLA: https://github.com/huggingface/lerobot |
| - Xperience Embodied Foundation Model pretraining plan: |
| `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` |
|
|