| # Foundation Model Plan |
|
|
| This plan extends the current Xperience-10M scale-up path beyond the prepared |
| Qwen3-Omni LoRA pilot. It separates immediate trainable work from later |
| world-model and robot-policy branches, so the project can choose a backbone |
| without mixing different research goals. |
|
|
| Current status: this remains the backbone-selection plan, but the repo now has |
| verified held-out multi-episode foundation-model diagnostics: Qwen3-Omni LoRA |
| for structured JSON tasks, Cosmos3-Nano for future-window compatibility, |
| Cosmos3-Super Reasoner as a base-weight JSON-task evaluation, and Cosmos3-Super |
| Forward-Dynamics LoRA as the first fine-tuned Super adapter branch. |
|
|
| ## Three Pipeline Tracks |
|
|
| The project can pursue the three presentation directions, but they should be |
| treated as pipeline tracks with different maturity levels: |
|
|
| | Track | Role | Current boundary | Next gate | |
| | --- | --- | --- | --- | |
| | Spatial intelligence models | Recover and reason over scene/object state from multiview RGB, depth, pose, calibration, object cues, and language. | Pipeline and evaluation contract, not a completed spatial model claim. | Raw depth/pose artifacts, spatial-memory exporter, and held-out spatial QA/object-memory metrics. | |
| | Human-video world models | Predict future action, subtask, object set, contact transition, camera motion, or latent visual state from observed interaction windows. | Partially evidenced by future-task probes and Cosmos-style branch artifacts. | Stronger future-state metrics, qualitative future examples, and held-out episode breakdowns. | |
| | Vision-language-action models | Convert egocentric video, language, hand/body motion, contacts, and objects into action chunks or policy-compatible targets. | Feasible but gated by action-space conversion. | Traceable action tokens, normalization, retargeting metadata, and policy/VLA held-out metrics. | |
|
|
| The full pipeline-level contract is in |
| [`THREE_FOUNDATION_PIPELINES.md`](THREE_FOUNDATION_PIPELINES.md), with a |
| machine-readable copy at |
| [`docs/data/three_foundation_pipelines.json`](docs/data/three_foundation_pipelines.json). |
|
|
| ## Backbone Decision |
|
|
| | Priority | Model family | Best role for this project | Why it fits Xperience-10M | Current decision | |
| | --- | --- | --- | --- | --- | |
| | 1 | Qwen3-Omni | Multimodal instruction model and JSON task predictor | Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge | Keep as the first selected-episode LoRA pilot | |
| | 2 | Cosmos 3 | Embodied world model, action generation, and synthetic future prediction | Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives | Add as the first world-model track after the data gate | |
| | 3 | NVIDIA GR00T | Humanoid/action-policy foundation model | Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes | Track as a humanoid policy branch, not the first LoRA pilot | |
| | 4 | OpenVLA / OpenVLA-OFT | Open vision-language-action policy baseline | Useful when windows are converted into visual observation plus action-token targets | Use after action-space design is explicit | |
| | 5 | openpi pi0/pi0.5 | Open robot policy and action expert baseline | Useful for action chunking, policy fine-tuning, and embodiment transfer experiments | Candidate for policy branch once action labels are retargeted | |
| | 6 | Gemini Robotics | Closed/API embodied reasoning reference | Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target | Use only as an external comparison or annotation assistant | |
| | 7 | Octo / SmolVLA-style lightweight policies | Smaller reproducible robot-policy baselines | Good for cheaper action-policy experiments, but less directly omni-modal | Optional baseline branch after selected-episode data preparation | |
| | Future | Xperience Embodied Foundation Model | Xperience-native domain model pretrained from scratch on full-corpus embodied experience | Would learn a shared temporal representation across video, audio, depth, pose, mocap, IMU, and language | Long-term goal after smaller pilots prove value and full-corpus storage/compute are available | |
|
|
| ## Why Qwen3-Omni Still Goes First |
|
|
| The immediate pilot is about proving the full data path: |
|
|
| - prepared multi-episode Xperience-10M data, |
| - episode-level train/test separation, |
| - window-level supervised examples, |
| - multimodal prompt construction, |
| - sensor bridge for depth, pose, mocap, and IMU, |
| - LoRA training, |
| - held-out predictions and metrics. |
|
|
| Qwen3-Omni is the most direct first target because the existing scripts already |
| prepare video/audio/language prompts and adapter inputs. It is also suitable for |
| the unified 20 current task contracts, which mostly produce labels, structured JSON, or |
| short task answers. |
|
|
| The executable Qwen branch and future branch contracts are now represented as |
| config files under `configs/omni_backbones/`. Validate them with: |
|
|
| ```bash |
| python scripts/omni/backbone_registry.py --validate --json |
| ``` |
|
|
| The shared extension rules are in |
| [`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md). A new |
| foundation branch should add a config first, then implement the exporter, |
| trainer, evaluator, and launcher required by that config. |
|
|
| ## Long-Term Native Pretraining Goal |
|
|
| Qwen3-Omni, Cosmos 3, GR00T, OpenVLA, and openpi are backbone choices for the |
| next experiments. The longer-term goal is different: train an |
| **Xperience Embodied Foundation Model** that is native to the Xperience-10M |
| modality structure. |
|
|
| That model would not start as a general internet-scale omni model. It would be |
| a domain model over synchronized embodied experience: multi-view egocentric |
| video, audio, depth, pose/SLAM, hand and body mocap, IMU, calibration, and |
| language annotations. Its pretraining should combine masked multimodal |
| modeling, cross-modal contrastive alignment, future-state prediction, |
| ego-motion and hand-motion forecasting, action/procedure prediction, language |
| grounding, contact/affordance prediction, and optional policy-style targets |
| after action conversion. |
|
|
| This is not a current result in the repo. It becomes appropriate only after: |
|
|
| - the selected multi-episode pipeline trains and evaluates cleanly, |
| - scaling from 128 episodes to thousands of episodes shows measurable value, |
| - raw-corpus storage and derived-shard capacity are available, |
| - distributed training and checkpoint/restart infrastructure are reliable, |
| - evaluation covers held-out episodes, sessions, activities, objects, and |
| missing-modality robustness. |
|
|
| The full plan is documented in |
| [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md). |
|
|
| ## Why Cosmos 3 Should Be Added Next |
|
|
| Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first |
| world-model track after the data gate. The reason is that the Xperience-10M |
| modalities are unusually aligned with physical-world modeling: |
|
|
| - video streams for visual state, |
| - embedded audio for event cues, |
| - depth and calibration for spatial structure, |
| - pose/SLAM for camera motion, |
| - hand/body mocap for embodied state, |
| - IMU for inertial dynamics, |
| - language annotations for task semantics. |
|
|
| The practical Cosmos 3 branch should start with three targets: |
|
|
| 1. **Future-window prediction:** condition on earlier video/sensor windows and |
| predict future visual or latent state. |
| 2. **Action-conditioned world modeling:** use mocap/action labels as controls |
| and predict what changes in the scene. |
| 3. **Synthetic data expansion:** generate or score candidate futures, then test |
| whether synthetic windows improve downstream task heads. |
|
|
| A Cosmos 3 branch is now represented by two public-safe verified packages: |
| Cosmos3-Nano future-window compatibility and Cosmos3-Super forward-dynamics |
| LoRA. The Super LoRA target is camera-pose-conditioned future vision velocity, |
| so it should be analyzed as a world-model loss result rather than a JSON-task |
| classifier. |
|
|
| ## Policy-Model Branch |
|
|
| OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as |
| policy/action branches. They need a clear action target before training: |
|
|
| - egocentric action class, |
| - next subtask, |
| - hand trajectory chunk, |
| - contact state, |
| - object-affordance target, |
| - retargeted humanoid/body action, |
| - or robot-compatible action tokens. |
|
|
| The current public sample can prototype the data conversion, but policy quality |
| requires multi-episode diversity. The first useful policy experiment should be a |
| 64-128 episode run, not a one-sample demonstration. |
|
|
| ## Evaluation Additions |
|
|
| The foundation-model stage should add metrics beyond the current 20-task suite: |
|
|
| | Evaluation target | Metric family | Applies to | |
| | --- | --- | --- | |
| | Structured task prediction | JSON validity, macro-F1, accuracy, micro-F1 | Qwen3-Omni, Gemini Robotics comparison | |
| | Future state prediction | retrieval rank, temporal consistency, feature reconstruction, visual inspection | Cosmos 3 | |
| | Action-conditioned dynamics | transition accuracy, contact accuracy, next-action accuracy | Cosmos 3, OpenVLA, openpi, GR00T | |
| | Affordance and object interaction | object micro-F1, contact-object consistency, caption grounding | all branches | |
| | Cross-episode generalization | held-out episode metrics, held-out session metrics, leakage checks | all trainable branches | |
|
|
| ## Execution Order |
|
|
| 1. Keep the selected 96/16/16 split as the comparison spine. |
| 2. Treat the verified Qwen3-Omni LoRA package as the structured JSON baseline. |
| 3. Treat Cosmos3-Nano compatibility and Cosmos3-Super Forward-Dynamics LoRA as separate Cosmos3 world-model artifacts with different metrics. |
| 4. Run a model-selection dry run on 3-8 episodes for any next backbone before scaling beyond the selected split. |
| 5. Promote Cosmos 3 to larger world-model experiments if video/sensor |
| preprocessing, storage, and loss metrics justify the extra cost. |
| 6. Promote OpenVLA/openpi/GR00T only after action targets are explicit and |
| retargeting artifacts are traceable. |
| 7. Update public cards only when a branch has real manifests, predictions, |
| metrics, and qualitative examples. |
| 8. Start Xperience-native pretraining only after smaller scaling stages, |
| full-corpus storage, multi-node compute, and held-out evaluation protocols |
| are in place. |
|
|
| ## Source Links |
|
|
| - Qwen3-Omni: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct |
| - NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/ |
| - NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t |
| - OpenVLA: https://openvla.github.io/ |
| - openpi: https://github.com/Physical-Intelligence/openpi |
| - Gemini Robotics: https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/ |
| - Octo: https://octo-models.github.io/ |
| - LeRobot / SmolVLA: https://github.com/huggingface/lerobot |
| - Xperience Embodied Foundation Model pretraining plan: |
| `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` |
|
|