File size: 11,003 Bytes
31e3087 d0f5abd 31e3087 fe4bbfa 31e3087 45c1706 389c0f8 31e3087 fc9e8cf bfcf156 31e3087 fc9e8cf 31e3087 3c21768 31e3087 bfcf156 31e3087 389c0f8 31e3087 d0f5abd 31e3087 3c21768 31e3087 04c0bde 31e3087 d0f5abd 389c0f8 d0f5abd 9371cfb d0f5abd 31e3087 d0f5abd bfcf156 31e3087 bfcf156 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | # Foundation Model Plan
This plan extends the current Xperience-10M scale-up path beyond the prepared
Qwen3-Omni LoRA pilot. It separates immediate trainable work from later
world-model and robot-policy branches, so the project can choose a backbone
without mixing different research goals.
Current status: this remains the backbone-selection plan, but the repo now has
verified held-out multi-episode foundation-model diagnostics: Qwen3-Omni LoRA
for structured JSON tasks, Cosmos3-Nano for future-window compatibility,
Cosmos3-Super Reasoner as a base-weight JSON-task evaluation, and Cosmos3-Super
Forward-Dynamics LoRA as the first fine-tuned Super adapter branch.
## Three Pipeline Tracks
The project can pursue the three presentation directions, but they should be
treated as pipeline tracks with different maturity levels:
| Track | Role | Current boundary | Next gate |
| --- | --- | --- | --- |
| Spatial intelligence models | Recover and reason over scene/object state from multiview RGB, depth, pose, calibration, object cues, and language. | Pipeline and evaluation contract, not a completed spatial model claim. | Raw depth/pose artifacts, spatial-memory exporter, and held-out spatial QA/object-memory metrics. |
| Human-video world models | Predict future action, subtask, object set, contact transition, camera motion, or latent visual state from observed interaction windows. | Partially evidenced by future-task probes and Cosmos-style branch artifacts. | Stronger future-state metrics, qualitative future examples, and held-out episode breakdowns. |
| Vision-language-action models | Convert egocentric video, language, hand/body motion, contacts, and objects into action chunks or policy-compatible targets. | Feasible but gated by action-space conversion. | Traceable action tokens, normalization, retargeting metadata, and policy/VLA held-out metrics. |
The full pipeline-level contract is in
[`THREE_FOUNDATION_PIPELINES.md`](THREE_FOUNDATION_PIPELINES.md), with a
machine-readable copy at
[`docs/data/three_foundation_pipelines.json`](docs/data/three_foundation_pipelines.json).
## Backbone Decision
| Priority | Model family | Best role for this project | Why it fits Xperience-10M | Current decision |
| --- | --- | --- | --- | --- |
| 1 | Qwen3-Omni | Multimodal instruction model and JSON task predictor | Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge | Keep as the first selected-episode LoRA pilot |
| 2 | Cosmos 3 | Embodied world model, action generation, and synthetic future prediction | Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives | Add as the first world-model track after the data gate |
| 3 | NVIDIA GR00T | Humanoid/action-policy foundation model | Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes | Track as a humanoid policy branch, not the first LoRA pilot |
| 4 | OpenVLA / OpenVLA-OFT | Open vision-language-action policy baseline | Useful when windows are converted into visual observation plus action-token targets | Use after action-space design is explicit |
| 5 | openpi pi0/pi0.5 | Open robot policy and action expert baseline | Useful for action chunking, policy fine-tuning, and embodiment transfer experiments | Candidate for policy branch once action labels are retargeted |
| 6 | Gemini Robotics | Closed/API embodied reasoning reference | Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target | Use only as an external comparison or annotation assistant |
| 7 | Octo / SmolVLA-style lightweight policies | Smaller reproducible robot-policy baselines | Good for cheaper action-policy experiments, but less directly omni-modal | Optional baseline branch after selected-episode data preparation |
| Future | Xperience Embodied Foundation Model | Xperience-native domain model pretrained from scratch on full-corpus embodied experience | Would learn a shared temporal representation across video, audio, depth, pose, mocap, IMU, and language | Long-term goal after smaller pilots prove value and full-corpus storage/compute are available |
## Why Qwen3-Omni Still Goes First
The immediate pilot is about proving the full data path:
- prepared multi-episode Xperience-10M data,
- episode-level train/test separation,
- window-level supervised examples,
- multimodal prompt construction,
- sensor bridge for depth, pose, mocap, and IMU,
- LoRA training,
- held-out predictions and metrics.
Qwen3-Omni is the most direct first target because the existing scripts already
prepare video/audio/language prompts and adapter inputs. It is also suitable for
the unified 20 current task contracts, which mostly produce labels, structured JSON, or
short task answers.
The executable Qwen branch and future branch contracts are now represented as
config files under `configs/omni_backbones/`. Validate them with:
```bash
python scripts/omni/backbone_registry.py --validate --json
```
The shared extension rules are in
[`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md). A new
foundation branch should add a config first, then implement the exporter,
trainer, evaluator, and launcher required by that config.
## Long-Term Native Pretraining Goal
Qwen3-Omni, Cosmos 3, GR00T, OpenVLA, and openpi are backbone choices for the
next experiments. The longer-term goal is different: train an
**Xperience Embodied Foundation Model** that is native to the Xperience-10M
modality structure.
That model would not start as a general internet-scale omni model. It would be
a domain model over synchronized embodied experience: multi-view egocentric
video, audio, depth, pose/SLAM, hand and body mocap, IMU, calibration, and
language annotations. Its pretraining should combine masked multimodal
modeling, cross-modal contrastive alignment, future-state prediction,
ego-motion and hand-motion forecasting, action/procedure prediction, language
grounding, contact/affordance prediction, and optional policy-style targets
after action conversion.
This is not a current result in the repo. It becomes appropriate only after:
- the selected multi-episode pipeline trains and evaluates cleanly,
- scaling from 128 episodes to thousands of episodes shows measurable value,
- raw-corpus storage and derived-shard capacity are available,
- distributed training and checkpoint/restart infrastructure are reliable,
- evaluation covers held-out episodes, sessions, activities, objects, and
missing-modality robustness.
The full plan is documented in
[`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md).
## Why Cosmos 3 Should Be Added Next
Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first
world-model track after the data gate. The reason is that the Xperience-10M
modalities are unusually aligned with physical-world modeling:
- video streams for visual state,
- embedded audio for event cues,
- depth and calibration for spatial structure,
- pose/SLAM for camera motion,
- hand/body mocap for embodied state,
- IMU for inertial dynamics,
- language annotations for task semantics.
The practical Cosmos 3 branch should start with three targets:
1. **Future-window prediction:** condition on earlier video/sensor windows and
predict future visual or latent state.
2. **Action-conditioned world modeling:** use mocap/action labels as controls
and predict what changes in the scene.
3. **Synthetic data expansion:** generate or score candidate futures, then test
whether synthetic windows improve downstream task heads.
A Cosmos 3 branch is now represented by two public-safe verified packages:
Cosmos3-Nano future-window compatibility and Cosmos3-Super forward-dynamics
LoRA. The Super LoRA target is camera-pose-conditioned future vision velocity,
so it should be analyzed as a world-model loss result rather than a JSON-task
classifier.
## Policy-Model Branch
OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as
policy/action branches. They need a clear action target before training:
- egocentric action class,
- next subtask,
- hand trajectory chunk,
- contact state,
- object-affordance target,
- retargeted humanoid/body action,
- or robot-compatible action tokens.
The current public sample can prototype the data conversion, but policy quality
requires multi-episode diversity. The first useful policy experiment should be a
64-128 episode run, not a one-sample demonstration.
## Evaluation Additions
The foundation-model stage should add metrics beyond the current 20-task suite:
| Evaluation target | Metric family | Applies to |
| --- | --- | --- |
| Structured task prediction | JSON validity, macro-F1, accuracy, micro-F1 | Qwen3-Omni, Gemini Robotics comparison |
| Future state prediction | retrieval rank, temporal consistency, feature reconstruction, visual inspection | Cosmos 3 |
| Action-conditioned dynamics | transition accuracy, contact accuracy, next-action accuracy | Cosmos 3, OpenVLA, openpi, GR00T |
| Affordance and object interaction | object micro-F1, contact-object consistency, caption grounding | all branches |
| Cross-episode generalization | held-out episode metrics, held-out session metrics, leakage checks | all trainable branches |
## Execution Order
1. Keep the selected 96/16/16 split as the comparison spine.
2. Treat the verified Qwen3-Omni LoRA package as the structured JSON baseline.
3. Treat Cosmos3-Nano compatibility and Cosmos3-Super Forward-Dynamics LoRA as separate Cosmos3 world-model artifacts with different metrics.
4. Run a model-selection dry run on 3-8 episodes for any next backbone before scaling beyond the selected split.
5. Promote Cosmos 3 to larger world-model experiments if video/sensor
preprocessing, storage, and loss metrics justify the extra cost.
6. Promote OpenVLA/openpi/GR00T only after action targets are explicit and
retargeting artifacts are traceable.
7. Update public cards only when a branch has real manifests, predictions,
metrics, and qualitative examples.
8. Start Xperience-native pretraining only after smaller scaling stages,
full-corpus storage, multi-node compute, and held-out evaluation protocols
are in place.
## Source Links
- Qwen3-Omni: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
- NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/
- NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t
- OpenVLA: https://openvla.github.io/
- openpi: https://github.com/Physical-Intelligence/openpi
- Gemini Robotics: https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/
- Octo: https://octo-models.github.io/
- LeRobot / SmolVLA: https://github.com/huggingface/lerobot
- Xperience Embodied Foundation Model pretraining plan:
`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`
|