Foundation Model Plan

This plan extends the current Xperience-10M scale-up path beyond the prepared Qwen3-Omni LoRA pilot. It separates immediate trainable work from later world-model and robot-policy branches, so the project can choose a backbone without mixing different research goals.

Current status: this is a planning artifact. The public repo has verified single-episode task heads and setup-stage Qwen3-Omni scripts. It has not yet run a held-out multi-episode foundation-model evaluation.

Backbone Decision

Priority	Model family	Best role for this project	Why it fits Xperience-10M	Current decision
1	Qwen3-Omni	Multimodal instruction model and JSON task predictor	Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge	Keep as the first selected-episode LoRA pilot
2	Cosmos 3	Embodied world model, action generation, and synthetic future prediction	Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives	Add as the first world-model branch after the data gate
3	NVIDIA GR00T	Humanoid/action-policy foundation model	Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes	Track as a humanoid policy branch, not the first LoRA pilot
4	OpenVLA / OpenVLA-OFT	Open vision-language-action policy baseline	Useful when windows are converted into visual observation plus action-token targets	Use after action-space design is explicit
5	openpi pi0/pi0.5	Open robot policy and action expert baseline	Useful for action chunking, policy fine-tuning, and embodiment transfer experiments	Candidate for policy branch once action labels are retargeted
6	Gemini Robotics	Closed/API embodied reasoning reference	Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target	Use only as an external comparison or annotation assistant
7	Octo / SmolVLA-style lightweight policies	Smaller reproducible robot-policy baselines	Good for cheaper action-policy experiments, but less directly omni-modal	Optional baseline branch after selected-episode data staging

Why Qwen3-Omni Still Goes First

The immediate pilot is about proving the full data path:

staged multi-episode Xperience-10M data,
episode-level train/test separation,
window-level supervised examples,
multimodal prompt construction,
sensor bridge for depth, pose, mocap, and IMU,
LoRA training,
held-out predictions and metrics.

Qwen3-Omni is the most direct first target because the existing scripts already prepare video/audio/language prompts and adapter inputs. It is also suitable for the 12 current task contracts, which mostly produce labels, structured JSON, or short task answers.

Why Cosmos 3 Should Be Added Next

Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first world-model branch after the data gate. The reason is that the Xperience-10M modalities are unusually aligned with physical-world modeling:

video streams for visual state,
embedded audio for event cues,
depth and calibration for spatial structure,
pose/SLAM for camera motion,
hand/body mocap for embodied state,
IMU for inertial dynamics,
language annotations for task semantics.

The practical Cosmos 3 branch should start with three targets:

Future-window prediction: condition on earlier video/sensor windows and predict future visual or latent state.
Action-conditioned world modeling: use mocap/action labels as controls and predict what changes in the scene.
Synthetic data expansion: generate or score candidate futures, then test whether synthetic windows improve downstream task heads.

Do not claim a Cosmos 3 result until there are committed manifests, generated outputs, held-out metrics, and qualitative examples.

Policy-Model Branch

OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as policy/action branches. They need a clear action target before training:

egocentric action class,
next subtask,
hand trajectory chunk,
contact state,
object-affordance target,
retargeted humanoid/body action,
or robot-compatible action tokens.

The current public sample can prototype the data conversion, but policy quality requires multi-episode diversity. The first useful policy experiment should be a 64-128 episode run, not a one-sample demonstration.

Evaluation Additions

The foundation-model stage should add metrics beyond the current 12-task suite:

Evaluation target	Metric family	Applies to
Structured task prediction	JSON validity, macro-F1, accuracy, micro-F1	Qwen3-Omni, Gemini Robotics comparison
Future state prediction	retrieval rank, temporal consistency, feature reconstruction, visual inspection	Cosmos 3
Action-conditioned dynamics	transition accuracy, contact accuracy, next-action accuracy	Cosmos 3, OpenVLA, openpi, GR00T
Affordance and object interaction	object micro-F1, contact-object consistency, caption grounding	all branches
Cross-episode generalization	held-out episode metrics, held-out session metrics, leakage audit	all trainable branches

Execution Order

Finish multi-episode data staging for the selected relay.
Run the Qwen3-Omni LoRA pilot exactly once as the first held-out baseline.
Run a model-selection dry run on 3-8 episodes: Qwen3-Omni prompt-only, Qwen3-Omni LoRA, Cosmos 3 world-model preprocessing, and one policy baseline.
Promote Cosmos 3 to the first world-model experiment if video/sensor preprocessing and storage fit.
Promote OpenVLA/openpi/GR00T only after action targets are explicit and retargeting artifacts are traceable.
Update public cards only when a branch has real manifests, predictions, metrics, and qualitative examples.

Source Links

Qwen3-Omni: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/
NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t
OpenVLA: https://openvla.github.io/
openpi: https://github.com/Physical-Intelligence/openpi
Gemini Robotics: https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/
Octo: https://octo-models.github.io/
LeRobot / SmolVLA: https://github.com/huggingface/lerobot