# Three Foundation Pipeline Tracks Xperience-10M can support the three directions shown in the presentation, but they should be positioned as **pipeline tracks**, not as three already solved model claims. The same dataset can feed all three tracks because it combines egocentric and multiview video, audio, depth, camera pose, hand/body motion, inertial signals, object/contact annotations, and language captions. ## Track Summary | Track | Question | Core inputs | First public pipeline | Current maturity | | --- | --- | --- | --- | --- | | Spatial intelligence models | Can the model recover and reason over space from video? | Multiview RGB, egocentric video, depth, camera pose, calibration, object cues, language questions. | Window or episode exporter that builds scene/object memory targets, then evaluates spatial QA, object permanence, counting, retrieval, and pose-aware consistency. | Ready as a pipeline and evaluation contract; strong claims require raw depth/pose access and held-out multi-episode evaluation. | | Human-video world models | Can the model predict what happens next? | Observed video/audio/sensor windows, hand/body motion, object/contact state, action/subtask labels, future windows. | Future-state and future-action probes over the existing split, then Cosmos-style or latent world-model training with separate dynamics metrics. | Partially evidenced through current future-task probes and Cosmos-style branch artifacts; still needs stronger visual/latent future metrics. | | Vision-language-action models | Can the model turn what it sees and reads into action? | Egocentric video, language captions, hand/body motion, contacts, objects, procedure/subtask labels. | Observation-language-to-action target conversion, action-chunk scoring, policy-token baselines, then VLA/policy model fine-tuning. | Feasible but gated by action-target conversion; do not claim policy quality until action tokens, normalization, and held-out policy metrics exist. | ## Published Direction Figures The repo and public mirrors include three high-resolution direction images from the original direction slides. Spatial intelligence and human-video world modeling use the clean high-resolution slide PNGs supplied for publication and are exported as 2560-pixel public assets. The 2026-06-19 refresh verified the Spatial, Human-video, and VLA clean PNGs as committed source-slide assets. They are communication assets, not evidence of completed model-quality training. The exact technical scope remains the text and JSON contract in this document and `docs/data/three_foundation_pipelines.json`. | Track | Enhanced public asset | Committed source | | --- | --- | --- | | Spatial intelligence models | `docs/assets/foundation-pipelines/spatial-intelligence-pipeline.png` | `docs/assets/foundation-pipelines/source-slides/spatial-intelligence-slide.png` | | Human-video world models | `docs/assets/foundation-pipelines/human-video-world-model-pipeline.png` | `docs/assets/foundation-pipelines/source-slides/human-video-world-model-slide.png` | | Vision-language-action models | `docs/assets/foundation-pipelines/vision-language-action-pipeline.png` | `docs/assets/foundation-pipelines/source-slides/vision-language-action-slide.png` | The deterministic restoration script is `scripts/render_foundation_pipeline_diagrams.py`; it uses the three clean slide PNGs directly and keeps the original presentation-photo sources only as provenance. ## 1. Spatial Intelligence Pipeline Purpose: train and evaluate models that turn flat video into spatial state and spatial reasoning. Data contract: - Inputs: multiview RGB, egocentric RGB, depth, camera pose, calibration, object labels, contact labels, optional language queries. - Intermediate artifacts: synchronized camera window manifest, pose/depth availability report, scene/object memory records, object permanence targets, spatial relation targets, and spatial QA prompts. - Outputs: object count, object persistence, relative location, 3D geometry consistency, multiview retrieval, camera-motion-aware scene memory, and language answers grounded in the scene. First practical implementation: 1. Build a spatial-memory exporter over the same episode split discipline. 2. Start with metric depth and pose consistency tasks before adding heavier reconstruction. 3. Evaluate with retrieval rank, count accuracy, relation accuracy, and object-memory consistency. 4. Add qualitative scene-memory examples only when they are backed by saved target files and metrics. What to avoid claiming now: full neural rendering, full 3D reconstruction, or general spatial intelligence unless those artifacts and held-out metrics exist. ## 2. Human-Video World Model Pipeline Purpose: train and evaluate models that predict future state from human interaction video. Data contract: - Inputs: observed video/audio/sensor windows, hand/body motion, camera pose, object/contact state, action/subtask labels, and optional language context. - Intermediate artifacts: observed/future window pairs, future label targets, action-conditioned target records, visual or latent reconstruction targets, and temporal consistency metadata. - Outputs: next action, next subtask, future object set, future state embedding, camera-motion delta, contact transition, and future-window quality metrics. First practical implementation: 1. Keep Qwen-style structured future probes for task-level interpretability. 2. Keep Cosmos-style branches separate because they answer dynamics and visual future questions, not JSON task classification. 3. Add latent or feature-reconstruction metrics before claiming world-model quality. 4. Compare future-task metrics by held-out episode, task family, and visible object/action family. What to avoid claiming now: a strong world model from low structured scores alone. The public result should say that the pipeline and first probes exist, while stronger future-state training remains the next step. ## 3. Vision-Language-Action Pipeline Purpose: train and evaluate models that map visual-language context to action chunks or policy-compatible targets. Data contract: - Inputs: egocentric video, language captions, hand/body motion, object/contact state, action/subtask labels, and optional retargeting metadata. - Intermediate artifacts: action-token vocabulary, action-chunk windows, normalization stats, retargeting report, leakage audit, and action-space model card. - Outputs: next action, action chunk, object-conditioned action, contact state, subtask transition, and policy/VLA held-out metrics. First practical implementation: 1. Define the action space before training any policy model. 2. Start with next-action, next-subtask, contact, and object-conditioned action tasks using the existing 20-task surface. 3. Add hand-trajectory or robot-compatible action chunks only after the conversion is traceable. 4. Treat OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models as policy branches that inherit the same split, manifest, and package rules. What to avoid claiming now: robot policy quality. The current project can build the conversion and scoring pipeline; policy quality needs action-space evidence and held-out model results. ## Shared Pipeline Discipline All three tracks should reuse the same public discipline: - episode-level train/validation/test split, - manifest-first exporters, - no target leakage from future labels or captions into inputs unless the task explicitly asks for them, - task-specific metrics and saved predictions, - public-safe packages that exclude raw private data and heavyweight base model weights, - website and model cards updated only after validators pass. This framing lets the project pursue all three directions at once while keeping the claims precise: spatial intelligence is the geometry/reasoning pipeline, world modeling is the future-state pipeline, and VLA is the action-conversion and policy pipeline.