ropedia-xperience-10m-task-baselines / XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md

Add files using upload-large-folder tool

3c21768 verified 9 days ago

9.21 kB

	# Xperience Embodied Foundation Model Pretraining Goal

	This document describes a future research direction for the project: a
	domain-specific embodied foundation model pretrained on the full Xperience-10M
	corpus, if full-episode access, storage, and compute become available.

	Current status: this is a planning artifact. The public project currently
	contains a public-sample task suite, lightweight baselines, Qwen3-Omni LoRA
	preparation, and a smoke LoRA artifact. It does not currently contain a
	from-scratch Xperience foundation model or full-corpus pretraining run.

	## Why This Is A Natural Long-Term Goal

	Xperience-10M is designed for physical-AI pretraining rather than only
	single-task supervised learning. The official dataset card describes 10 million
	experiences, 10,000 hours of synchronized first-person recordings, six video
	streams, audio, stereo depth, camera pose, hand and full-body mocap, IMU, and
	hierarchical language annotations. It also reports 2.88B RGB frames, 720M depth
	frames, 576M pose/mocap frames, 7.2B IMU frames, and about 1 PB of total data.

	That scale and alignment make a specific Xperience-native model plausible:
	not a general web-scale omni model, but an embodied model specialized for
	egocentric perception, human-object interaction, temporal dynamics, physical
	state, and task intent.

	## Target Model

	The proposed model name is Xperience Embodied Foundation Model.

	The model should learn a shared temporal representation of embodied experience:
	what the wearer sees and hears, how the camera moves, how the body and hands
	move, what objects are involved, what geometry is present, and what task is
	being performed.

	Expected modules:

	\| Module \| Input \| Role \|
	\| --- \| --- \| --- \|
	\| Multi-view video encoder \| fisheye/stereo/RGB streams \| visual state, egocentric context, object interaction \|
	\| Audio encoder \| synchronized MP4 audio \| event cues, contact-like sound, temporal grounding \|
	\| Depth and geometry encoder \| depth, confidence, calibration \| spatial structure and 3D/4D scene cues \|
	\| Pose/SLAM encoder \| camera trajectory and orientation \| ego-motion, viewpoint, scene traversal \|
	\| Mocap encoder \| hand/body joints \| human motion, hand-object interaction, affordance cues \|
	\| IMU encoder \| accelerometer/gyroscope streams \| inertial dynamics and wearable motion \|
	\| Language encoder/decoder \| task/subtask/action/object annotations \| semantic grounding and structured generation \|
	\| Temporal fusion transformer \| aligned per-window modality tokens \| shared embodied representation across time \|
	\| Task heads / decoders \| fused representation \| action, caption, future motion, retrieval, reconstruction, and world-state outputs \|

	## Pretraining Objectives

	The model should not rely on one loss. It should combine complementary
	objectives so that every modality contributes to the shared representation.

	\| Objective \| What the model learns \| Example output \|
	\| --- \| --- \| --- \|
	\| Masked multimodal modeling \| recover hidden video/depth/sensor tokens from context \| reconstructed latent patches or sensor features \|
	\| Cross-modal contrastive alignment \| align video, motion, audio, geometry, and language from the same time window \| matching score or retrieval embedding \|
	\| Future-state prediction \| predict what changes after the current window \| future visual/depth/motion latent \|
	\| Ego-motion and hand-motion forecasting \| model wearer/body dynamics \| future camera delta or hand trajectory \|
	\| Action and procedure prediction \| connect physical state to task semantics \| action, subtask, transition, next action \|
	\| Language grounding and captioning \| connect temporal windows to natural language \| caption, object/action grounding, structured JSON \|
	\| Contact and affordance prediction \| learn interaction state from human-object motion \| contact state, relevant object set \|
	\| Optional policy-style targets \| learn action-like outputs after target conversion \| action token, motion chunk, retargeted policy target \|

	## Staged Pretraining Plan

	### Stage 0: Data Contract And Quality Gate

	Use the existing public-sample task suite to define the data contract. Before
	pretraining, every episode must pass a strict manifest check:

	- `annotation.hdf5` exists and is readable,
	- video streams are present or missing views are explicitly recorded,
	- audio can be extracted or marked unavailable,
	- depth, pose, mocap, IMU, calibration, and language fields are indexed,
	- windows are aligned by timestamp or frame index,
	- train/val/test splits are episode-level, not window-level leakage splits,
	- raw data remains outside public repos and Hugging Face artifact mirrors.

	### Stage 1: 128-1,000 Episode Representation Pilot

	Start with a smaller model and a selected subset. The goal is to test whether
	the multimodal objectives train stably and improve held-out task performance.

	Recommended scale:

	- 128 to 1,000 episodes,
	- frozen or lightly trainable video/audio encoders at first,
	- 0.3B-1B temporal fusion model,
	- all available sensor modalities represented as tokens,
	- evaluation on the unified 20-task suite, the 180-result matrix, and future-state/retrieval probes.

	### Stage 2: 10K Episode Domain Model

	Scale after the pilot proves value. This stage should train a stronger
	Xperience-specific representation model rather than only fine-tuning a general
	omni model.

	Recommended scale:

	- thousands to 10K episodes,
	- 1B-3B parameter multimodal temporal model,
	- mixed supervised, contrastive, and predictive objectives,
	- held-out sessions and held-out activities,
	- robustness to missing camera views and sensor dropout.

	### Stage 3: Full-Corpus Xperience Embodied Foundation Model

	Use this stage only if storage, data throughput, and multi-node compute are
	available. The goal is a domain foundation model over embodied human experience,
	not a general internet-scale language model.

	Recommended scale:

	- all available Xperience-10M episodes,
	- 3B-7B domain model as a realistic first full-corpus target,
	- larger models only after scaling curves justify the cost,
	- mixture of reconstruction, retrieval, forecasting, language, and world-model
	objectives,
	- downstream evaluation on held-out episodes, held-out sessions, unseen
	objects, unseen activities, and downstream robotics/world-model tasks.

	## Hardware Requirements

	These are planning ranges, not completed run measurements from this repo.

	\| Training goal \| Typical compute \| Storage and data path \| Practical use \|
	\| --- \| --- \| --- \| --- \|
	\| 0.3B-1B pilot \| 8-32 modern 80GB-class data-center GPUs \| tens of TB plus fast local cache \| prove objectives and data loaders \|
	\| 1B-3B domain model \| 32-128 GPUs \| 100TB-scale cache, high-throughput decoding \| serious research-scale pretraining \|
	\| 3B-7B full-corpus domain model \| 128-512 GPUs \| PB-scale storage plus 100-400Gbps networking \| first full Xperience-native foundation model \|
	\| 30B-class omni model from scratch \| 512-2,000+ GPUs \| PB-scale storage, multi-node orchestration, large checkpoint budget \| lab-scale project, not the first target \|
	\| frontier general omni model \| thousands of GPUs \| data beyond Xperience-10M plus large infrastructure \| out of scope for this project \|

	For full-corpus work, storage is as important as GPU count:

	- raw corpus storage around the official dataset scale,
	- 1.5-3x extra capacity for derived shards, caches, checkpoints, and metadata,
	- fast NVMe cache for active shards,
	- parallel media decoding and feature extraction workers,
	- distributed training with reliable checkpoint/restart,
	- per-episode provenance and split manifests.

	## Evaluation Protocol

	The model should not be judged only by training loss. Evaluation should include:

	- JSON validity and structured task metrics from the current task suite,
	- action/subtask/contact/object metrics on held-out episodes,
	- text-to-window and window-to-text retrieval,
	- future ego-motion and hand-motion forecasting,
	- cross-modal reconstruction and missing-modality robustness,
	- held-out object/activity/session generalization,
	- qualitative inspection of retrieved or generated future states,
	- downstream transfer to Qwen3-Omni, Cosmos-style world modeling, and
	policy/action branches.

	## Relationship To Existing Public Work

	The current public project is the harness for this future model:

	- the unified 20-task suite defines concrete input/output contracts,
	- minimal and neural baselines provide initial supervised targets,
	- audio/modality diagnostics show which signals contribute,
	- Qwen3-Omni LoRA provides the first trainable multi-episode adapter path,
	- Cosmos and policy branches define downstream model families,
	- the pretraining goal unifies these into a long-term representation-learning
	direction.

	The next practical step is still selected multi-episode preparation and
	held-out Qwen3-Omni LoRA evaluation. Full-corpus pretraining should come after
	the smaller scaling stages show measurable value.

	## Source Links

	- Official Xperience-10M dataset: https://huggingface.co/datasets/ropedia-ai/xperience-10m
	- Ropedia Xperience-10M release page: https://ropedia.com/blog/20260316_xperience_10m
	- Ropedia physical-AI data infrastructure page: https://ropedia-dev.com/