# Foundation Model Plan This plan extends the current Xperience-10M scale-up path beyond the prepared Qwen3-Omni LoRA pilot. It separates immediate trainable work from later world-model and robot-policy branches, so the project can choose a backbone without mixing different research goals. Current status: this is a planning artifact. The public repo has verified single-episode task heads and setup-stage Qwen3-Omni scripts. It has not yet run a held-out multi-episode foundation-model evaluation. ## Backbone Decision | Priority | Model family | Best role for this project | Why it fits Xperience-10M | Current decision | | --- | --- | --- | --- | --- | | 1 | Qwen3-Omni | Multimodal instruction model and JSON task predictor | Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge | Keep as the first selected-episode LoRA pilot | | 2 | Cosmos 3 | Embodied world model, action generation, and synthetic future prediction | Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives | Add as the first world-model branch after the data gate | | 3 | NVIDIA GR00T | Humanoid/action-policy foundation model | Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes | Track as a humanoid policy branch, not the first LoRA pilot | | 4 | OpenVLA / OpenVLA-OFT | Open vision-language-action policy baseline | Useful when windows are converted into visual observation plus action-token targets | Use after action-space design is explicit | | 5 | openpi pi0/pi0.5 | Open robot policy and action expert baseline | Useful for action chunking, policy fine-tuning, and embodiment transfer experiments | Candidate for policy branch once action labels are retargeted | | 6 | Gemini Robotics | Closed/API embodied reasoning reference | Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target | Use only as an external comparison or annotation assistant | | 7 | Octo / SmolVLA-style lightweight policies | Smaller reproducible robot-policy baselines | Good for cheaper action-policy experiments, but less directly omni-modal | Optional baseline branch after selected-episode data staging | ## Why Qwen3-Omni Still Goes First The immediate pilot is about proving the full data path: - staged multi-episode Xperience-10M data, - episode-level train/test separation, - window-level supervised examples, - multimodal prompt construction, - sensor bridge for depth, pose, mocap, and IMU, - LoRA training, - held-out predictions and metrics. Qwen3-Omni is the most direct first target because the existing scripts already prepare video/audio/language prompts and adapter inputs. It is also suitable for the 12 current task contracts, which mostly produce labels, structured JSON, or short task answers. ## Why Cosmos 3 Should Be Added Next Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first world-model branch after the data gate. The reason is that the Xperience-10M modalities are unusually aligned with physical-world modeling: - video streams for visual state, - embedded audio for event cues, - depth and calibration for spatial structure, - pose/SLAM for camera motion, - hand/body mocap for embodied state, - IMU for inertial dynamics, - language annotations for task semantics. The practical Cosmos 3 branch should start with three targets: 1. **Future-window prediction:** condition on earlier video/sensor windows and predict future visual or latent state. 2. **Action-conditioned world modeling:** use mocap/action labels as controls and predict what changes in the scene. 3. **Synthetic data expansion:** generate or score candidate futures, then test whether synthetic windows improve downstream task heads. Do not claim a Cosmos 3 result until there are committed manifests, generated outputs, held-out metrics, and qualitative examples. ## Policy-Model Branch OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as policy/action branches. They need a clear action target before training: - egocentric action class, - next subtask, - hand trajectory chunk, - contact state, - object-affordance target, - retargeted humanoid/body action, - or robot-compatible action tokens. The current public sample can prototype the data conversion, but policy quality requires multi-episode diversity. The first useful policy experiment should be a 64-128 episode run, not a one-sample demonstration. ## Evaluation Additions The foundation-model stage should add metrics beyond the current 12-task suite: | Evaluation target | Metric family | Applies to | | --- | --- | --- | | Structured task prediction | JSON validity, macro-F1, accuracy, micro-F1 | Qwen3-Omni, Gemini Robotics comparison | | Future state prediction | retrieval rank, temporal consistency, feature reconstruction, visual inspection | Cosmos 3 | | Action-conditioned dynamics | transition accuracy, contact accuracy, next-action accuracy | Cosmos 3, OpenVLA, openpi, GR00T | | Affordance and object interaction | object micro-F1, contact-object consistency, caption grounding | all branches | | Cross-episode generalization | held-out episode metrics, held-out session metrics, leakage audit | all trainable branches | ## Execution Order 1. Finish multi-episode data staging for the selected relay. 2. Run the Qwen3-Omni LoRA pilot exactly once as the first held-out baseline. 3. Run a model-selection dry run on 3-8 episodes: Qwen3-Omni prompt-only, Qwen3-Omni LoRA, Cosmos 3 world-model preprocessing, and one policy baseline. 4. Promote Cosmos 3 to the first world-model experiment if video/sensor preprocessing and storage fit. 5. Promote OpenVLA/openpi/GR00T only after action targets are explicit and retargeting artifacts are traceable. 6. Update public cards only when a branch has real manifests, predictions, metrics, and qualitative examples. ## Source Links - Qwen3-Omni: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct - NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/ - NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t - OpenVLA: https://openvla.github.io/ - openpi: https://github.com/Physical-Intelligence/openpi - Gemini Robotics: https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/ - Octo: https://octo-models.github.io/ - LeRobot / SmolVLA: https://github.com/huggingface/lerobot