Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # Xperience Embodied Foundation Model Pretraining Goal | |
| This document describes a future research direction for the project: a | |
| domain-specific embodied foundation model pretrained on the full Xperience-10M | |
| corpus, if full-episode access, storage, and compute become available. | |
| Current status: this is a planning artifact. The public project currently | |
| contains a public-sample task suite, lightweight baselines, Qwen3-Omni LoRA | |
| preparation, and a smoke LoRA artifact. It does not currently contain a | |
| from-scratch Xperience foundation model or full-corpus pretraining run. | |
| ## Why This Is A Natural Long-Term Goal | |
| Xperience-10M is designed for physical-AI pretraining rather than only | |
| single-task supervised learning. The official dataset card describes 10 million | |
| experiences, 10,000 hours of synchronized first-person recordings, six video | |
| streams, audio, stereo depth, camera pose, hand and full-body mocap, IMU, and | |
| hierarchical language annotations. It also reports 2.88B RGB frames, 720M depth | |
| frames, 576M pose/mocap frames, 7.2B IMU frames, and about 1 PB of total data. | |
| That scale and alignment make a specific Xperience-native model plausible: | |
| not a general web-scale omni model, but an embodied model specialized for | |
| egocentric perception, human-object interaction, temporal dynamics, physical | |
| state, and task intent. | |
| ## Target Model | |
| The proposed model name is **Xperience Embodied Foundation Model**. | |
| The model should learn a shared temporal representation of embodied experience: | |
| what the wearer sees and hears, how the camera moves, how the body and hands | |
| move, what objects are involved, what geometry is present, and what task is | |
| being performed. | |
| Expected modules: | |
| | Module | Input | Role | | |
| | --- | --- | --- | | |
| | Multi-view video encoder | fisheye/stereo/RGB streams | visual state, egocentric context, object interaction | | |
| | Audio encoder | synchronized MP4 audio | event cues, contact-like sound, temporal grounding | | |
| | Depth and geometry encoder | depth, confidence, calibration | spatial structure and 3D/4D scene cues | | |
| | Pose/SLAM encoder | camera trajectory and orientation | ego-motion, viewpoint, scene traversal | | |
| | Mocap encoder | hand/body joints | human motion, hand-object interaction, affordance cues | | |
| | IMU encoder | accelerometer/gyroscope streams | inertial dynamics and wearable motion | | |
| | Language encoder/decoder | task/subtask/action/object annotations | semantic grounding and structured generation | | |
| | Temporal fusion transformer | aligned per-window modality tokens | shared embodied representation across time | | |
| | Task heads / decoders | fused representation | action, caption, future motion, retrieval, reconstruction, and world-state outputs | | |
| ## Pretraining Objectives | |
| The model should not rely on one loss. It should combine complementary | |
| objectives so that every modality contributes to the shared representation. | |
| | Objective | What the model learns | Example output | | |
| | --- | --- | --- | | |
| | Masked multimodal modeling | recover hidden video/depth/sensor tokens from context | reconstructed latent patches or sensor features | | |
| | Cross-modal contrastive alignment | align video, motion, audio, geometry, and language from the same time window | matching score or retrieval embedding | | |
| | Future-state prediction | predict what changes after the current window | future visual/depth/motion latent | | |
| | Ego-motion and hand-motion forecasting | model wearer/body dynamics | future camera delta or hand trajectory | | |
| | Action and procedure prediction | connect physical state to task semantics | action, subtask, transition, next action | | |
| | Language grounding and captioning | connect temporal windows to natural language | caption, object/action grounding, structured JSON | | |
| | Contact and affordance prediction | learn interaction state from human-object motion | contact state, relevant object set | | |
| | Optional policy-style targets | learn action-like outputs after target conversion | action token, motion chunk, retargeted policy target | | |
| ## Staged Pretraining Plan | |
| ### Stage 0: Data Contract And Quality Gate | |
| Use the existing public-sample task suite to define the data contract. Before | |
| pretraining, every episode must pass a strict manifest check: | |
| - `annotation.hdf5` exists and is readable, | |
| - video streams are present or missing views are explicitly recorded, | |
| - audio can be extracted or marked unavailable, | |
| - depth, pose, mocap, IMU, calibration, and language fields are indexed, | |
| - windows are aligned by timestamp or frame index, | |
| - train/val/test splits are episode-level, not window-level leakage splits, | |
| - raw data remains outside public repos and Hugging Face artifact mirrors. | |
| ### Stage 1: 128-1,000 Episode Representation Pilot | |
| Start with a smaller model and a selected subset. The goal is to test whether | |
| the multimodal objectives train stably and improve held-out task performance. | |
| Recommended scale: | |
| - 128 to 1,000 episodes, | |
| - frozen or lightly trainable video/audio encoders at first, | |
| - 0.3B-1B temporal fusion model, | |
| - all available sensor modalities represented as tokens, | |
| - evaluation on the unified 20-task suite, the 180-result matrix, and future-state/retrieval probes. | |
| ### Stage 2: 10K Episode Domain Model | |
| Scale after the pilot proves value. This stage should train a stronger | |
| Xperience-specific representation model rather than only fine-tuning a general | |
| omni model. | |
| Recommended scale: | |
| - thousands to 10K episodes, | |
| - 1B-3B parameter multimodal temporal model, | |
| - mixed supervised, contrastive, and predictive objectives, | |
| - held-out sessions and held-out activities, | |
| - robustness to missing camera views and sensor dropout. | |
| ### Stage 3: Full-Corpus Xperience Embodied Foundation Model | |
| Use this stage only if storage, data throughput, and multi-node compute are | |
| available. The goal is a domain foundation model over embodied human experience, | |
| not a general internet-scale language model. | |
| Recommended scale: | |
| - all available Xperience-10M episodes, | |
| - 3B-7B domain model as a realistic first full-corpus target, | |
| - larger models only after scaling curves justify the cost, | |
| - mixture of reconstruction, retrieval, forecasting, language, and world-model | |
| objectives, | |
| - downstream evaluation on held-out episodes, held-out sessions, unseen | |
| objects, unseen activities, and downstream robotics/world-model tasks. | |
| ## Hardware Requirements | |
| These are planning ranges, not completed run measurements from this repo. | |
| | Training goal | Typical compute | Storage and data path | Practical use | | |
| | --- | --- | --- | --- | | |
| | 0.3B-1B pilot | 8-32 modern 80GB-class data-center GPUs | tens of TB plus fast local cache | prove objectives and data loaders | | |
| | 1B-3B domain model | 32-128 GPUs | 100TB-scale cache, high-throughput decoding | serious research-scale pretraining | | |
| | 3B-7B full-corpus domain model | 128-512 GPUs | PB-scale storage plus 100-400Gbps networking | first full Xperience-native foundation model | | |
| | 30B-class omni model from scratch | 512-2,000+ GPUs | PB-scale storage, multi-node orchestration, large checkpoint budget | lab-scale project, not the first target | | |
| | frontier general omni model | thousands of GPUs | data beyond Xperience-10M plus large infrastructure | out of scope for this project | | |
| For full-corpus work, storage is as important as GPU count: | |
| - raw corpus storage around the official dataset scale, | |
| - 1.5-3x extra capacity for derived shards, caches, checkpoints, and metadata, | |
| - fast NVMe cache for active shards, | |
| - parallel media decoding and feature extraction workers, | |
| - distributed training with reliable checkpoint/restart, | |
| - per-episode provenance and split manifests. | |
| ## Evaluation Protocol | |
| The model should not be judged only by training loss. Evaluation should include: | |
| - JSON validity and structured task metrics from the current task suite, | |
| - action/subtask/contact/object metrics on held-out episodes, | |
| - text-to-window and window-to-text retrieval, | |
| - future ego-motion and hand-motion forecasting, | |
| - cross-modal reconstruction and missing-modality robustness, | |
| - held-out object/activity/session generalization, | |
| - qualitative inspection of retrieved or generated future states, | |
| - downstream transfer to Qwen3-Omni, Cosmos-style world modeling, and | |
| policy/action branches. | |
| ## Relationship To Existing Public Work | |
| The current public project is the harness for this future model: | |
| - the unified 20-task suite defines concrete input/output contracts, | |
| - minimal and neural baselines provide initial supervised targets, | |
| - audio/modality diagnostics show which signals contribute, | |
| - Qwen3-Omni LoRA provides the first trainable multi-episode adapter path, | |
| - Cosmos and policy branches define downstream model families, | |
| - the pretraining goal unifies these into a long-term representation-learning | |
| direction. | |
| The next practical step is still selected multi-episode preparation and | |
| held-out Qwen3-Omni LoRA evaluation. Full-corpus pretraining should come after | |
| the smaller scaling stages show measurable value. | |
| ## Source Links | |
| - Official Xperience-10M dataset: https://huggingface.co/datasets/ropedia-ai/xperience-10m | |
| - Ropedia Xperience-10M release page: https://ropedia.com/blog/20260316_xperience_10m | |
| - Ropedia physical-AI data infrastructure page: https://ropedia-dev.com/ | |