1 sample episode: task lab
One public episode becomes aligned windows, task targets, Minimal heads, and Neural MLP heads.
Inspect the sample files, task targets, and local baseline runs.
Selected-128 comparison rows and held-out model behavior.
The public suite has two evidence lines. Line 1 uses one public sample episode to make the 20-task lab inspectable and reproducible. Line 2 uses 128 selected episodes to compare aligned baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window. The public matrix is complete at 180/180 scored method-task records, with six compact-proxy cells explicitly marked.
One public episode becomes aligned windows, task targets, Minimal heads, and Neural MLP heads.
Inspect the sample files, task targets, and local baseline runs.
Selected-128 comparison rows and held-out model behavior.
Seven methods share the selected-episode surface and the same 20 task axes.
Compare same-split baselines, Qwen3-Omni v6, and Cosmos3 rows.
Direct raw-target metrics for the 6 proxy-marked cells.
This page is a guided entry point for one inspectable Xperience-10M sample line and one selected-128 comparison line. Use the detailed tables after you know which evidence line and task layer you are reading.
Each click answers one reader question before the archive-level evidence begins.
Use this landing section to understand the project before opening the archive-level evidence. First identify the two evidence lines, then the 20-task suite, then the result table and public mirrors.
The mark identifies the shared public package across the GitHub repository, GitHub Pages dashboard, Hugging Face Space, artifact dataset, model mirrors, and social preview. Use this area as the project identity checkpoint before reading the 1-episode and selected-128 evidence lines.
A public Xperience-10M task-suite package: one inspectable sample line plus a selected-128 comparison line, both organized around the same 20 task contracts.
Open the 20 tasksUse the 1-episode line for reproducible task-head baselines. Use the 128-episode line for metadata/raw baselines, Qwen3-Omni v6, and Cosmos3 diagnostics.
Open the result tableDirect scores, compact-proxy scores, normalized radar values, adapters, and public mirrors stay labeled separately so the evidence type is visible before interpretation.
Check the glossaryThe page keeps every public artifact, but the fastest path is: understand the 20 tasks, inspect the sample files, compare the two result lines, then open the exact reproducibility surface you need.
Read the project in that order. The 20 tasks are scored contracts; the four directions group what those scores study; the three pipelines define training recipes; the unified embodied model is the long-term integration target.
Every metric, radar axis, and method row uses these same 20 task contracts.
Human motion, 3D/4D reconstruction, egocentric interaction, and world modeling are groupings over the same tasks.
Spatial intelligence, human-video world models, and vision-language-action models reuse the same files with different input-output recipes.
The pipeline outputs converge toward perception, 3D memory, language, action, and planning in one model family.
Task cards, radars, and the 180-record table all use the same numbered task IDs.
Directions A-D group the same 20 tasks. They are not extra tasks or a second tier.
Spatial, world-model, and VLA pipelines are recipes for future scale-up and task-grounded model training.
This is the expandable integration goal, not an extra scored task axis in the 180-result matrix.
The generated image uses one node per task; this list gives the exact public task names.
Tasks can support more than one direction. This layer answers what the scores study; it is not a separate benchmark or task tier.
These are scale-up recipes for model training. They answer how the same files can be turned into larger model inputs and targets.
This target integrates perception, 3D memory, language, action, and planning. It is a research direction for scale-up, not a tenth method or twenty-first task.
Action, subtask, object, language, motion, synchronization, retrieval, and forecasting targets evaluated by each public method row.
Human Modeling & Motion Understanding; 3D/4D Reconstruction & Neural Rendering; Egocentric Vision & Interaction; Scene Reconstruction & World Modeling.
Spatial intelligence models; Human-video world models; Vision-language-action models.
A long-term model target where perception, spatial memory, language, action, and planning are trained against the same evidence structure.
| Line | Data unit | Score statement | Best use | Read separately from | Start here |
|---|---|---|---|---|---|
| 1 sample episode | One public Xperience-10M sample episode; 5,821 frames; 1,161 aligned 20-frame windows; 8,546-dimensional feature contract. | 40/40 direct scores from Minimal and Neural MLP heads. | Raw sample inspection, file organization, task definitions, local reproduction, and controlled Minimal-vs-Neural baseline behavior. | The selected-128 comparison rows and broader held-out model behavior. | Raw browser 1-episode radar data result summary data evidence-line data |
| 128 selected episodes | Selected held-out 96/16/16 split; 34,269 exported windows; public-safe metadata/raw-feature artifacts linked to official gated episode paths. | 140/140 selected-128 scores: 134 direct + 6 compact-proxy. | Same-split comparison across metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super, Cosmos3-Nano, and scale-up decisions. | Direct raw-target interpretation for the proxy-marked cells. | 128-episode radar data source and feature index HF selected-128 windows result summary data evidence-line note |
| Evidence line | Method block | Methods | Score statement | Read as |
|---|---|---|---|---|
| 1 sample episode | Task-head baselines | Minimal; Neural MLP | 40/40 direct scores. | Task-lab reproducibility and simple-vs-neural behavior. |
| 128 selected episodes | Aligned baseline heads | Metadata simple/NN; raw-feature simple/NN | 80/80 scores: 74 direct + 6 compact-proxy. | Same-split metadata/raw-feature baseline comparison. |
| 128 selected episodes | Qwen3-Omni series | Qwen3-Omni v6 LoRA | 20/20 direct scores from verified selected-128 Qwen3-Omni LoRA and task-specific probes. | Trainable Qwen3-Omni diagnostic baseline on the selected-128 surface. |
| 128 selected episodes | Cosmos3 series | Cosmos3-Super Reasoner; Cosmos3-Nano Future Window | 40/40 direct scores from verified public-safe reasoner and future-window artifacts. | Cosmos3 reasoner and future-window diagnostics on the selected-128 surface. |
Cosmos3-Super Forward-Dynamics LoRA is published as a separate fine-tuned adapter with weights/results; it is not counted as a 20-task matrix method row.
| Qwen run | Purpose | Main change | Eval signal | Use now |
|---|---|---|---|---|
| v1 | Prove the selected-128 LoRA/eval/package loop. | First verified 96/16/16 selected-episode Qwen3-Omni LoRA run. | 448 eval; JSON 0.8750; contact 0.6451. | Lineage only. |
| v2 | Make answers schema-checked. | Structured-JSON contract with full-8-GPU LoRA on the same split. | 448 eval; JSON 0.9978; contact 0.7188. | Structured-output ablation. |
| v3 | Separate prompt/eval effects from training. | Strict-label prompt/eval over the v2 adapter; no new adapter training. | 448 eval; JSON 1.0000; contact 0.7210. | Prompt/eval ablation. |
| v4 | Test longer structured-JSON LoRA training. | New four-epoch full-8-GPU adapter on the same selected split. | 448 eval; JSON 1.0000; contact 0.7299. | Overfit/metric-tradeoff evidence. |
| v5 | Move to denser multiscale evaluation. | Multiscale cap96 export with 4,032 held-out predictions. | 4,032 eval; JSON 1.0000; contact 0.7865. | Pinned prior release; stronger on several non-contact metrics. |
| v6 | Publish the current Qwen 20-task row. | Rank64/lr5e-5 multiscale LoRA plus verified task-specific probes. | 4,032 eval; JSON 0.9990; contact 0.8177. | Current public 20-task Qwen3-Omni row. |
Qwen v1-v6 are run-lineage labels inside the selected-128 evidence line, not project evidence lines. Use v6 for the public 20-task Qwen3-Omni row; keep v5 as the pinned prior multiscale comparator; read v1-v4 as pipeline-hardening and ablation evidence. Full details are available in the Qwen lineage data and Qwen lineage note.
Use this route if you need the project story, what is public, and how to read each result family.
Read overviewUse the 20-task suite, radar, matrix, and source audit to compare methods without losing metric provenance.
Open task suiteUse scripts, validators, mirrors, and checks when you want to rerun or trust the public package.
Open reproduce pathUse directions and scale-up resources for spatial, world-model, VLA, Qwen3-Omni, and Cosmos3 follow-up work.
Open directionsThe project keeps source code, visual explanation, derived artifacts, model outputs, and release checks on different public surfaces. This map shows what each surface is responsible for before you dive into the full file set.
This panel now does one job: explain which public surface owns which type of evidence. Use the dedicated reading path for step-by-step onboarding, the glossary for terminology, and the artifact library for detailed file-level links.
Xperience-10M is much larger than the public sample. This project focuses on the sample available now, turns it into clear task contracts and baseline artifacts, and keeps the same data contract ready for held-out multi-episode training when more episodes are prepared.
A research-development lab for understanding synchronized egocentric multimodal data, defining embodied-AI tasks, and testing small baselines before omni-model fine-tuning.
The next model-quality stage is stronger action/subtask modeling on the same held-out split, using dense/multiscale windows before requiring more raw episodes.
Maps one public episode into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals.
Defines embodied-AI inputs, process modules, outputs, metrics, and case-study walkthroughs instead of treating the sample as a generic classification file.
Keeps chronological splits, predictions, confusion matrices, leakage notes, and single-episode limits visible before moving to broader model-quality reads.
Connects the same data contract to 128-episode baselines, a no-new-episode enhancement pack, Qwen3-Omni LoRA, Cosmos-style world modeling, policy/VLA tracks, and the later Xperience-native pretraining goal.
The overview keeps only the headline comparison so the same radar visuals are not repeated. Open the task-suite section for the enlarged unified radar, the 1-episode split radar, the 128-episode split radar, and the source-linked 180-record table.
Minimal and Neural MLP baselines cover 40/40 method-task records on the original public-sample episode.
Metadata, raw-feature, Qwen3-Omni, and Cosmos3 methods cover 140/140 rows with direct/proxy evidence notes preserved.
The task-suite section holds the large readable radars, normalized-score notes, and the downloadable chart data.
Use this as the front door for the project: it links the unified 20 tasks, four research tracks, current sample evidence, and the multi-episode Qwen3-Omni scale-up path.
One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.
The unified task suite has minimal and neural baseline evidence across one 20-axis task surface with shared windows, splits, and label discipline.
The public description is aligned to the official gated Xperience-10M dataset card, including modalities, scale, access, and current project coverage. The source snapshot records 31.9 TB on the HF surface, an about-1PB full-scale storage statement, 12,103 episode folders as upstream metadata, not a local data inventory, public sample license cc-by-nc-4.0, HOMIE Toolkit and Rerun 0.29.0 source tooling, and the official limited diversity note. See the source-alignment record for exact provenance.
Metrics, figures, walkthroughs, baseline weights, Qwen3-Omni results, and Cosmos3 public-safe packages are staged across GitHub, GitHub Pages, and Hugging Face.
The first selected-episode LoRA pilot is packaged with real held-out predictions and metrics. It proves the pipeline, while the weak scores make it a baseline for error analysis.
Shows how the current selected split can be stressed without more episodes: dense windows, hierarchical labels, raw-feature shards, and `multiscale_20s10_40s20_80s40` as the next export target.
Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.
The project path moves from the current public-sample task lab to the latest verified Qwen3-Omni diagnostic run, same-split 128-episode baseline alignment, a no-new-episode enhancement pack, action/subtask error analysis, robustness runs, world/policy tracks, and the future Xperience Embodied Foundation Model pretraining goal.
The main page keeps the roadmap readable: stage cards show what has shipped, what is active, and what evidence supports each step. The focused detail page shows the same track, task, scale-up, and evidence content in a single linear reader path.
One public episode is converted into aligned windows, task contracts, minimal baselines, neural heads, walkthroughs, and figures.
Prepare official gated episodes while preserving episode-level separation and recording missing-view coverage. The first selected split is available for Qwen3-Omni diagnostics.
Train lightweight adapters on selected prepared episodes and evaluate on held-out episodes with committed predictions, metrics, and run reports.
Align simple metadata/text baselines, raw-feature proxies, and neural MLP baselines to the same selected 96/16/16 split and the unified 20-task axes used by the public result matrix.
Use the same selected split, estimate dense/multiscale window exports, define hierarchical action/subtask targets, and prioritize raw-feature shards for tasks that metadata baselines cannot cover.
Keep the 96/16/16 split, tighten JSON decoding or target formatting, and analyze action/subtask failures before presenting stronger model-quality numbers.
Keep Qwen3-Omni as the first trainable held-out pilot, use Cosmos 3 for world modeling and forward-dynamics trainer development, and stage policy candidates after robot-compatible action targets are explicit.
Test whether pilot conclusions survive broader sessions, missing modalities, and stronger ablations.
Extend toward future-window prediction, action-conditioned world modeling, synthetic-data tests, policy-style next action, and affordance reasoning.
Pretrain an Xperience-native domain model over synchronized video, audio, depth, pose, mocap, IMU, and language after smaller scaling stages prove value.
The roadmap has three public surfaces: this concise planning view, a focused detail page for the full track and task map, and structured mirrors used by validators and HF/GitHub publication scripts.
Read the four research directions, linked task heads, stage gates, current evidence, and next steps in one clean sequence without nested controls.
Spatial intelligence, human-video world modeling, and VLA training use different inputs and outputs while sharing the Xperience-10M sample structure.
Use these notes to understand which selected-episode artifacts are public-safe, which raw files remain gated, and how the 128-episode comparison is staged.
The long-range goal is an Xperience-native embodied foundation model after smaller selected-episode stages prove value and infrastructure is ready.
These are not the best first reading path. They are kept here so the public site, README mirrors, HF bundles, and local validators point to the same source records.
This is the 3-pipeline layer of the public structure. It is separate from the four research directions: directions organize what the 20 task scores study, while these pipelines describe how Xperience-10M files can train larger spatial, world-model, and vision-language-action models.
Train spatial-memory models from multiview RGB, egocentric video, depth, pose, calibration, object/contact cues, and language prompts; evaluate spatial QA, object permanence, counting, retrieval, and pose-aware consistency.
Use windows.csv and shared_windows.npz to slice each 20-frame window, then join six MP4 RGB streams with annotation.hdf5 depth, camera pose, SLAM/calibration, object cues, contacts, and optional language questions.
Build targets such as camera-view match, object relevance, object-set memory, depth/pose reconstruction proxy, caption-grounded retrieval, and spatial QA answers derived from the same public annotation timeline.
Train future-prediction models from observed interaction windows to score next action, next subtask, future object set, contact transition, camera-motion delta, and latent future state, with Qwen-style probes and Cosmos-style dynamics kept separate.
Take the current 20-frame observed window at time t from shared_windows.npz: RGB/audio/sensor summaries, hand/body motion, camera pose, current object/contact state, and current action/subtask context only.
Shift the same episode timeline forward to produce next-action, next-subtask, future object-set, contact-transition, time-to-transition, camera-motion delta, or latent/future-feature targets. Future labels stay out of the input.
Train VLA or policy-compatible heads only after converting egocentric video, captions, hand/body motion, contacts, objects, and procedures into traceable action tokens, chunks, and object-conditioned action targets.
Use egocentric/fisheye video windows, caption/object context from annotation.hdf5, hand/body mocap, contact state, and current subtask text as the observation-language side of each training pair.
For the one-sample suite, output action-token proxies: current/next action, object-conditioned action relation, contact state, interaction-text class, subtask transition, or hand-trajectory/action-chunk proxy. Robot action chunks need a later retargeting converter.
Build an episode atlas, category tags, balance report, and split builder across activities, objects, scenes, sessions, people, and missing modalities.
Open development planVersion train/val/test manifests, task cards, leakage checks, metric scripts, and reference baselines so future model scores are comparable.
Open benchmark protocol noteTrain contrastive and masked-prediction encoders over synchronized video, audio, depth, pose, mocap, IMU, and language windows.
Open representation-learning planMine action steps, transitions, preconditions, effects, and temporal graphs that connect egocentric perception to planning.
current task mapAdd contact, reachable-object, tool-use, and next-affordance tasks using hands, mocap, objects, contacts, video, and language.
task walkthroughsFuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps for spatial reasoning and object permanence.
model tracksTrack timestamp drift, missing streams, calibration consistency, corrupted files, and degraded-mode manifests before large training runs.
evidence contractConvert mocap, hand trajectories, contacts, and object states into action tokens, robot-compatible targets, and imitation-learning examples.
foundation planThe protocol is generated from committed metric artifacts so readers can see the exact data unit, split, task targets, leakage controls, and current limitations before comparing scores.
One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by 8,546 synchronized multimodal dimensions.
Single-episode chronological 70/30 train/test split. This avoids random future-window mixing; cross-episode generalization is measured in the later multi-episode pilot.
protocol documentAll 20 tasks list input, target, primary metric, baseline score, and source artifact path in the unified suite file.
task contract dataScalers fit on train windows only; future labels, target-side signals, caption/object labels, and contact labels stay on the target side unless explicitly queried.
builder scriptAudio and no-audio variants are evaluated across the walkthrough-backed task contracts under the same chronological split.
audio summaryQwen3-Omni is the first trainable baseline, Cosmos 3 is the world-model track with a camera-pose proxy forward-dynamics contract ready for trainer work, policy models wait for robot-compatible action targets, and Xperience-native pretraining remains a later full-corpus goal.
backbone planThis public-sample run covers single-episode task development. The selected multi-episode Qwen3-Omni final diagnostic result is verified and meets the JSON-validity target; Cosmos3-Nano has a verified future-window compatibility package; and Cosmos3-Super has a verified base-weight JSON-task evaluation plus a fine-tuned forward-dynamics LoRA branch. The next stage is action/subtask error analysis, stronger held-out metrics, and policy-target conversion.
result comparisonBefore adding episodes, the suite should try `multiscale_20s10_40s20_80s40`, hierarchical action/subtask targets, label-normalized scoring, and compact raw-feature shards for unsupported tasks.
enhancement dataFuture Omni, Cosmos, and policy tracks use the same episode split discipline, training metadata, held-out predictions, metrics, run report, and public-safe package gate.
scale-up statusThe project shows the completed public-sample task suite and the first verified multi-episode Qwen3-Omni diagnostic pilot, then lays out the next quality-improvement and model-extension steps.
5,821 frames become 1,161 synchronized 20-frame windows with an 8,546-dimensional representation, 20 task contracts, and the current 180-record public result matrix.
Audio variants improve the primary metric on 6 walkthrough-backed task contracts, while the feature manifest and raw browser keep each modality source inspectable.
The Ropedia directions are labeled as direct, proxy, or diagnostic coverage and connected to spatial intelligence, human-video world-model, and vision-language-action training paths.
Qwen3-Omni is the trainable held-out LoRA track; Cosmos3 covers world-model diagnostics; OpenVLA/openpi/GR00T become policy candidates after robot-compatible action conversion; Xperience-native pretraining remains the later full-corpus goal.
The selected 96/16/16 split has a verified Qwen3-Omni v6 package with 4,032 held-out predictions and 99.90% JSON validity. Cosmos3-Nano, Cosmos3-Super Reasoner, and Cosmos3-Super Forward-Dynamics LoRA remain separate public-safe diagnostic branches.
The current selected-128 line has a source/feature index and can be pushed through dense/multiscale windows without changing the held-out episode split; the recommended scenario is `multiscale_20s10_40s20_80s40`.
The website, GitHub repo, HF Space, artifact dataset, baseline model repo, consolidated weights/results repo, collection, official dataset, and HOMIE toolkit point to the same research package.
The logo, raw-sample stream thumbnails, task-suite figure, unified radar, model-architecture figures, and LoRA training-flow figure are indexed and reused across public surfaces.
The project shares derived task artifacts, figures, reports, lightweight baselines, reproduction commands, validators, and audit records. Raw videos, raw annotations, RRD visualizations, gated data, and full Qwen weights stay outside the repo.
A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.
Start with the project brief, status, dataset context, task results, roadmap, and Qwen3-Omni scale-up notes. They separate implemented single-episode work from the prepared multi-episode stage.
Use the window table and feature manifest to see the aligned sample unit, modality sources, and leakage controls.
Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.
The multi-episode Qwen3-Omni path now has a final verified diagnostic package and public LoRA adapter. The native-pretraining plan shows how this can grow into a full-corpus research direction after action/subtask improvements and stronger task metrics.
Use the no-new-episode enhancement pack before requesting more storage: it records dense-window estimates, `multiscale_20s10_40s20_80s40`, hierarchical labels, and raw-feature shard priorities.
The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while focusing the implemented artifacts on one public sample episode. The source-alignment record keeps 31.9 TB, about-1PB, 12,103 episode folders, cc-by-nc-4.0, HOMIE Toolkit, Rerun 0.29.0, not a local data inventory, and limited diversity visible on the public site.
Xperience-10M is a gated large-scale egocentric multimodal dataset for embodied AI, robotics, spatial intelligence, and world modeling.
The one-episode line builds the inspectable 20-task lab. Use Line 2 for selected-128 held-out comparison.
sample datasetOne public sample episode is exposed through the raw browser and modality atlas: synchronized video, embedded audio, HDF5 annotation groups, depth, pose/SLAM, mocap, IMU, calibration, language-derived signals, and source links in one place.
open raw browserraw manifeststream metadataThe selected 128-episode Qwen3-Omni LoRA v6 diagnostic run is verified with 4,032 held-out test predictions and 99.90% JSON validity. Action/subtask metrics are still weak, so this remains a baseline for error analysis.
LoRA adapterv5/v6 comparisonRaw MP4, HDF5, RRD files are streamed from the official public sample source when opened here; private gated data and full Qwen weights are not redistributed in this project.
data noticeThe public task lab uses one sample episode with 5,821 frames, 1,161 aligned windows, 8,546-dimensional task inputs, and 20 integrated task contracts.
task suitetask contract dataAction/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, misalignment, long-horizon forecasting, interaction text, action-object relation, sensor bridging, camera sync, and transition timing.
summary metricsThis project is for research exploration and excludes identity recognition, surveillance, biometric profiling, sensitive-attribute inference, and safety-critical deployment.
use notesFull audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, held-out Qwen3-Omni evaluation, and future Xperience-native pretraining.
native pretrainingOpen each official Xperience-10M sample file from the project page. Video and audio use compact browser previews derived from the official MP4 files, with direct links beside them for the full raw Hugging Face sources. HDF5 and RRD files are shown with their role, size, organization, and direct source links.
Fisheye camera 0 stream and the public sample audio source. This file can be played as video and as the embedded audio track.
Playing a 12 second fast-start preview derived from the official raw MP4. Use the source link for the complete file.
Video features feed visual tasks; the embedded audio stream feeds audio ablation and acoustic feature blocks.
The official public sample is one episode folder. The task suite reads the HDF5 annotations and six synchronized MP4 streams, then writes 20-frame windows with a 5-frame stride.
xperience-10m-sample/ annotation.hdf5 fisheye_cam0.mp4 fisheye_cam1.mp4 fisheye_cam2.mp4 fisheye_cam3.mp4 stereo_left.mp4 stereo_right.mp4 visualization.rrd
The raw HDF5 is a binary container, so the browser shows its organization rather than loading the whole file into memory.
The source streams are summarized once here, next to the playable files and HDF5 map.
Small derived modality thumbnails remain in the modality atlas data; raw MP4, HDF5, and RRD files are not redistributed.
Task map, radar comparisons, task cards, and the 180-result table are kept in one reading flow. Start with the map, inspect the score surfaces, then open each task card for its input, process, output, metric, and all nine public method scores.
Each axis below has a task card with nine raw method scores, normalized radar values, source artifacts, and matching method-row entries in the 180-result matrix.
The unified radar keeps all nine methods in one comparison board, but groups them into small-multiple panels so each method family can be read directly. The split radars separate the 1-episode Minimal/NN baseline comparison from the 128-episode metadata/raw, Qwen3-Omni v6 LoRA, and Cosmos3-Super/Nano comparison.
Higher-is-better metrics are normalized to 0-1; lower-is-better metrics are converted to best/value within the task. The SVG uses sqrt(normalized score) only for visual radius, while raw values, linear normalized scores, status reasons, sources, and compact proxy notes remain in the data mirrors.
The matrix has 180/180 scored method-task records: 174 direct scores and 6 compact-proxy scores. The audit records the source artifact, metric key, and proxy reason for each marked cell.
Minimal and Neural MLP are both scored on all 20 public-sample task contracts in one enlarged panel without 128-episode methods competing for attention.
Seven aligned 128-episode methods cover all 20 axes across metadata/text, raw-feature, and foundation-model panels. Proxy axes stay labeled in the chart and source data.
Each card uses its assigned icon and shows the task name, input sources, process, output target, metric, and compact nine-method score panel. Use the filters for scanning; the cards stay tied to the task map, radar axes, and 180-record matrix.
The overall generated atlas keeps the icon family visible, while each task card below uses its own crisp assigned SVG for reliable loading and public mirrors.
Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.
This is the shortest way to understand the method section: start from official synchronized files, build aligned windows, create task targets, run method heads or model probes, then publish source-linked metrics.
RGB/fisheye/stereo video, audio, HDF5 annotations, pose, depth, calibration, IMU, mocap, and Rerun visualization sources.
20-frame windows with a 5-frame stride keep the input unit consistent across task heads and selected-128 exports.
Each window maps to one of the 20 contracts: classification, forecasting, retrieval, reconstruction, synchronization, or regression.
Minimal, Neural MLP, selected-128 metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano produce the public method rows.
Metrics, predictions, matrices, radar data, validators, and mirror parity checks are published with direct/proxy status preserved.
Raw valid episodes move through split validation, parallel export, video/audio/text formatting, sensor-bridge features, LoRA training, and sealed held-out evaluation.
It documents the selected 128-episode final diagnostic result and the action/subtask improvement path for stronger held-out metrics.
It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.
General embodied-intelligence model quality requires many episodes and held-out episode splits; the public sample is the development harness for that next stage.
Read results in this order: choose the line, open the matching radar, inspect the matrix row, then check proxy flags before interpreting totals.
The score printed in each table cell is the value emitted by the runner or verified package. Use this value in text, tables, and comparisons.
The radar converts mixed metrics to a 0-1 plotting scale. It helps pattern recognition, but it is not a replacement for the raw metric.
Direct scores use the task target. Compact-proxy scores are bounded substitutes where a raw public target is unavailable, and stay marked before comparison.
| Reader need | Use this value | Why | Where it appears |
|---|---|---|---|
| Write a paper or README number | Raw metric value | This is emitted by the runner or verified result package and keeps the original metric scale. | 180-result table cells, result JSON, method summaries. |
| Compare shapes across many tasks | Normalized radar value | This is a 0-1 plotting transform for visual comparison only; cite the raw metric beside it. | Unified radar and split radar charts. |
| Interpret a low-availability target | Compact-proxy score plus its audit note | The task target is represented by a bounded substitute, so the proxy label and reason travel with the value. | Proxy audit, matrix badges, selected-128 baseline rows. |
Use 1 episode for the task lab. Use 128 episodes for the selected comparison surface.
Single-episode radar shows Minimal vs Neural MLP. The 128-episode radar shows metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano.
Each score keeps method, task, metric key, source artifact, and status.
Six selected-128 scores are compact proxies and stay marked in the audit.
Minimal and Neural MLP heads are both scored on all 20 public-sample task contracts. All 40 scores are direct task-target metrics.
A reproducible public task suite and baseline behavior check.
Metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window use the aligned 128-episode surface. It has 134 direct scores plus 6 compact-proxy scores.
A same-split comparison table with explicit source and proxy status.
| Line | Methods | Tasks | Scored records | Direct scores | Proxy scores | Machine-readable source |
|---|---|---|---|---|---|---|
| 1 sample episode | 2 | 20 | 40/40 | 40 | 0 | single-episode radar data |
| 128 selected episodes | 7 | 20 | 140/140 | 134 | 6 compact-proxy scores, each source-linked and reasoned. | 128-episode radar data |
| Total public matrix | 9 | 20 | 180/180 | 174 | 6 | two-line result summary data |
| Line | Block | Methods | Records | Evidence type | Primary artifact |
|---|---|---|---|---|---|
| 1 sample episode | Task-head baselines | Minimal; Neural MLP | 40 direct | Direct target metrics on the public sample windows. | single-episode radar data |
| 128 selected episodes | Aligned baseline heads | Metadata simple/NN; raw-feature simple/NN | 74 direct + 6 compact-proxy | Processed-target metrics where available; proxy cells remain source-linked. | score/proxy audit |
| 128 selected episodes | Qwen3-Omni series | Qwen3-Omni v6 LoRA | 20 direct | Verified selected-128 LoRA and task-specific probe artifacts. | model comparison data |
| 128 selected episodes | Cosmos3 series | Cosmos3-Super Reasoner; Cosmos3-Nano Future Window | 40 direct | Verified reasoner and future-window public-safe artifacts; forward-dynamics LoRA is a separate adapter artifact outside the 20-task method rows. | model comparison data |
Each cell shows the raw metric value to cite, the normalized radar value, the metric key, and a direct/proxy badge. The table is generated from the same structured matrix data used by the radar, so values stay aligned across GitHub, the website, and Hugging Face mirrors.
| MethodiOne named method family in the matrix, such as Minimal, 128ep Raw NN, Qwen3-Omni v6, or Cosmos3-Super. | LineiA reading lane for a group of results: Line 1 is one public sample episode; Line 2 is selected-128 held-out comparison. | RecordsiOne method evaluated on one task. 9 methods x 20 tasks gives 180 public result records. | DirectiA metric computed against the task target directly. This is the preferred score type in the 20-task matrix. | ProxyiA bounded proxy metric when a direct raw target is not publicly available. It stays explicit so readers do not over-read it. | Scope |
|---|---|---|---|---|---|
| Loading result summary... | |||||
| Method | Loading tasks... |
|---|---|
| Loading 180-result matrix... | |
Best-practice reading rule: compare methods within the same evidence line first, then use the proxy badges before interpreting cross-method totals. Six compact-proxy cells are intentionally visible rather than blended into direct raw-target scores.
Motion-only and all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. They now sit beside the result matrix instead of opening a separate results page.
The neural baseline keeps the same windows, splits, and leakage filters, then swaps in small PyTorch MLP heads. Read it as a nonlinear-head check before heavier Qwen3/Cosmos model branches.
These charts keep the main lesson visible: within-episode labels can be easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.
Open the single-episode explorer to inspect window-level labels, predictions, modality statistics, object labels, and diagnostic scores. The audio ablation summary records the task-by-task audio contribution.
The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,546-dimensional representation for repeatable task evaluation.
All-feature action reaches 0.9829 macro-F1 on its local split, while the chronological action head in the core task suite is 0.0500 macro-F1 with four unseen later action labels.
takeawaysHand MPJPE improves from 0.8647 to 0.1079; temporal-order F1 rises from 0.5400 to 0.8520; misalignment F1 rises from 0.5052 to 0.7153.
metricsRidge/cosine retrieval remains stronger than the neural projection here, and cross-modal feature reconstruction still has negative R2.
retrieval metricsThe next credible model-quality unit is a held-out multi-episode pilot across different sessions, not more adjacent windows from one sample.
scale-up statusThis is the 4-direction layer of the public structure. Each direction groups task evidence by research question; it does not create a separate task set. Each task is mapped as direct, proxy, or diagnostic evidence using the same minimal and Neural MLP baseline contracts.
Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.
Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.
Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.
Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.
Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.
Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.
All 20 tasks live in the same task table, task-card grid, radar, and 180-record result matrix. Historical result paths are retained only for exact provenance links.
The public task package has one structured 20-task contract, per-task metrics, prediction/rank files, reader summaries, radar charts, and the 180-record method-task matrix.
Open task contract data · Open 180-record matrix · Open unified radar
Every task uses the same 20-frame window unit, 5-frame stride, 8,546-dimensional feature manifest, chronological split discipline, and minimal/neural comparison pattern unless a task-specific leakage rule removes target-side features.
Historical provenance data and historical provenance chart remain available for exact source tracing.
Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.
Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.
Output: high_motion or low_motion.
Case: retrieve the synchronized stereo-left window from a fisheye-camera query.
Input: fisheye_cam0 video features against stereo_left candidate features.
Output: ranked synchronized view candidates.
Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.
Input: non-caption multimodal features.
Output: 0-to-1 progress inside the current action.
Case: predict how the camera translation changes over the next 20 frames.
Input: current sensors excluding camera translation and captions.
Output: future camera-translation delta vector.
The four research directions now have coded extension probes, prediction/rank CSVs, structured metrics, a reader summary, and a website chart generated from real sample-window features.
A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.
The diagram separates the shared episode-window representation from the task-specific heads, so the task contracts stay readable before scaling to larger models.
Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.
Input: inspect the 20-frame multimodal window before choosing the target.
In the coffee-making sample, a pouring window maps to the current action label.
Metric: macro-F1. Minimal 0.0500; neural MLP 0.0148.
Current limitation: single-episode chronological split.
Each task reads an aligned window, not a black-box feature blob. The cards below show the path from public-sample streams to window rows and feature groups, then the chart shows the size of each input block.
Six synchronized MP4 camera streams, embedded audio, HDF5 annotations, depth, pose/SLAM, mocap, IMU, calibration, and language-derived fields remain tied to the public sample episode.
Open sample browserThe task suite builds 20-frame windows with a 5-frame stride. Each row records episode id, start frame, end frame, center frame, timestamps, and split assignment.
Open window manifestThe feature contract groups the 8,546 dimensions by modality and derived signal. Omitted or target-side fields stay visible so leakage checks can be audited.
Open feature contractWhere the signal comes from: video, audio, depth, pose/SLAM, mocap, IMU, calibration, or language-derived annotation fields.
The named column group used by Minimal baselines, Neural MLP heads, and public-sample task cards.
Bar length means number of input dimensions. It is not a model score and it is not a task ranking.
Read this as provenance. A task card may say “20-frame multimodal window”; this section shows which source streams and derived feature groups make up that window before any task-specific label is attached.
Back to task cardsThis glossary covers the overloaded project terms plus adjacent technical terms from embodied AI, egocentric multimodal data, spatial geometry, world models, VLA/policy learning, training, evaluation, and public artifact reading.
Episode, window, modality, fisheye, depth, IMU, calibration, and synchronization terms are grouped with the project data boundary terms.
Camera pose, SLAM, point clouds, rollouts, forward dynamics, long-horizon forecasting, and temporal leakage are defined next to the relevant tasks.
Action chunks, policies, imitation learning, behavior cloning, end effectors, dexterity, contact, and language grounding are included for extension readers.
Direct/proxy scores, raw metrics, radar values, held-out evaluation, Qwen/Cosmos branches, adapters, and HF mirrors remain explicitly separated.
Loading glossary terms...
| Term | Meaning here | Use it for | Do not confuse with |
|---|---|---|---|
| Egocentric videomultimodal sensing | Video captured from a first-person or body-mounted viewpoint. | The sample streams are egocentric views of human interaction and are the visual basis for many tasks. | Third-person robot-camera footage. |
| Camera posespatial geometry | The camera position and orientation at a time step. | Supports spatial-intelligence tasks, view synchronization, and geometry diagnostics. | The human body pose. |
| Forward dynamicstemporal and world models | Predicting the next state from the current state and action/context. | The Cosmos3-Super LoRA branch uses a forward-dynamics-style diagnostic contract. | Reverse inference from result back to cause. |
| Vision-language-action modelrobotics and VLA | A model that maps visual context and language into actions. | The VLA direction is a future path after action targets are converted into robot-compatible chunks. | A vision-language model that only answers text. |
| Direct scoretasks and metrics | A metric computed against the task target directly. | The preferred score type in the 20-task matrix. | Compact-proxy score. |
| Held-out evaluationtraining and evaluation | Testing on examples not used for training. | Required before promoting Qwen/Cosmos results to public evidence. | Training-set loss. |
If you want a public surface, a score check, a reproducible run, or the next model-training package, start with the route cards below. The full artifact library remains available after that first choice. Raw Xperience-10M data and Qwen weights are not redistributed.
Open GitHub, HF Space, artifact dataset, baseline models, or consolidated weights/results without guessing.
Open public surfacesUse validators, source alignment, mirror parity, and live URL/hash checks before trusting a number.
Open checksStart from scripts, windows, feature manifests, task contracts, and minimal/neural result outputs.
Open commandsUse Qwen3-Omni v6, Cosmos3-Super/Nano packages, the 128-episode feature index, and foundation-model plans for the next runs.
Open scale-upStart with the files that define the sample windows, modality inputs, task contracts, metrics, walkthroughs, and research-direction mapping.
Every task definition, split detail, feature dimension, and minimal/neural metric in one project output.
Window start/end frames and aligned action/subtask labels for the public sample episode.
window tableSource map for the current modality inputs used by the task suite.
feature inputsPer-task PyTorch MLP metrics, predictions, histories, and checkpoints for the unified task contracts, with historical result-bundle paths retained for provenance.
neural MLP outputsMaps the walkthrough-backed task contracts to the four research tracks: human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling.
research direction outputsFour coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.
extension probe outputsCase studies for the walkthrough-backed task contracts, including input, middle process modules, output, metric, limitation, and task-player data.
walkthrough outputsAll 72 task/variant rows comparing current audio, no audio, raw audio, replacement, and combined-input settings.
audio ablation outputsInteractive window-level view of labels, predictions, modality statistics, object labels, and diagnostics.
open explorerThe strongest self-supervised signal from the single episode.
retrieval metricsUse these files to navigate the whole project, open the published mirrors, or reproduce the public-sample pipeline.
Single navigation view for GitHub, GitHub Pages, HF Space, artifact dataset, baseline model repo, Qwen3-Omni/Cosmos3 repos, and result-reading lanes.
Definitions for project-specific terms plus broader embodied-AI, egocentric multimodal data, spatial geometry, world-model, VLA, training, evaluation, adapter, and public mirror terms.
Human-readable map from project scope to data contract, task evidence, platform mirrors, and scale-up status.
Training, visualization, taxonomy, walkthrough, validator, and omni-readiness scripts.
scripts/The dashboard packaged as a public static Space.
HF SpaceStatic dashboard container published to GitHub Container Registry for local browsing with Docker, without raw data or model weights.
GHCR packageMetrics, predictions, docs, and lightweight derived files without raw data redistribution.
artifact collectionMinimal NumPy softmax, ridge baselines, and neural task-head model files.
model repoConsolidated public-safe baseline weights, Qwen3-Omni and Cosmos3 adapters/packages, verified results, analysis files, and manifest.
weights/results repoSpace, artifacts, baseline models, Qwen3-Omni v6 LoRA, Cosmos3-Super, and Cosmos3-Nano repos grouped into one public project collection.
collectionClassifier metrics, predictions, confusion matrix, and model weights.
model metricsCompact route through the project for readers who want the shortest path from scope to results after choosing a surface.
project packetThe multi-episode Qwen3-Omni path is documented, scripted, and verified as a validation-monitored held-out pilot. The next stronger metrics come from structured-output and error-analysis improvements.
Groups Line 1 task-head baselines and Line 2 selected-128 methods: metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Nano Future Window, and Cosmos3-Super Reasoner.
Maps every selected official Xperience-10M episode id to its gated source tree and the public-safe processed features: Qwen v6 multiscale windows, dense multiscale rows, and metadata matrices.
Canonical no-new-episode plan for denser supervision: `multiscale_20s10_40s20_80s40`, hierarchical action/subtask labels, stronger scoring slices, and raw-feature shard priorities.
enhancement dataBackbone selection matrix covering Qwen3-Omni, Cosmos 3, GR00T, OpenVLA/openpi, Gemini Robotics, Octo, SmolVLA-style policy candidates, and the future Xperience-native pretraining goal.
foundation model planPublic data-access path, selected 128-episode pilot plan, and preparation requirements.
data accessSeparates the 1-episode sensor-adapter smoke test from Qwen run v1-v6. v6 is the current 20-task matrix row, while v5 remains the pinned prior release.
Qwen v1-v6 lineageQwen groupShows the verified Nano future-window compatibility package, the Super base-weight Reasoner JSON-task evaluation, and the Super fine-tuned forward-dynamics LoRA artifact with separate loss metrics.
Cosmos groupsFuture runs need validation tracking, held-out predictions, quality-target reporting, and the same public-safe package gate.
training requirementsFuture plan for a domain-specific embodied foundation model trained from scratch over full-corpus video, audio, geometry, motion, inertial, and language streams.
pretraining planThis tab now does one job: show the audit files that prove the public pages, mirrors, and package contents are internally consistent.
Checks the expected public package files, docs, figures, structured data, and artifact boundaries before a release is trusted.
Validates local website links, referenced data files, asset paths, and generated dashboard dependencies.
website integrityCompares GitHub, HF Space, artifact dataset, baseline model repo, and weights/results mirror snapshots.
mirror parityRecords public-surface readiness, reader-map links, live status pointers, and cross-repo publication checks.
surface QATracks live URLs and hash checks used after publishing to GitHub Pages and Hugging Face surfaces.
live statusSeparates official Xperience-10M source facts from local/public project inventory and derived artifacts.
source alignmentChecks task-count, task-contract, and result-matrix consistency across generated public data files.
task surfaceSummarizes build, validation, mirror, and publication gates that should pass before readers rely on the release.
quality gatesThe selected pilot uses 128 source-balanced episodes across 128 different session UUIDs. The latest v6 held-out package is verified, and its weak metrics define the next structured-output and error-analysis pass.
128 complete episodes selected from 128 unique top-level sessions, balanced across episode-size bands and split 96/16/16 for train/val/test.
source/feature indexDownload raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.
The current Qwen3-Omni LoRA artifact is the verified v6 selected 128-episode diagnostic adapter. The v5 row remains pinned as the prior release, and the 1-episode Qwen entry is only a sensor-adapter smoke test.
model groupsThe next suite push does not need more episodes first: use `multiscale_20s10_40s20_80s40`, hierarchical action/subtask targets, and raw-feature shards while keeping the held-out split fixed.
enhancement dataQwen3-Omni uses a separate LoRA model repo; Cosmos3-Nano remains a compatibility package; Cosmos3-Super now has a verified forward-dynamics LoRA artifact with weights in a dedicated model repo.
Cosmos3-Super weightsThe long-term goal is a full-corpus Xperience Embodied Foundation Model trained on synchronized perception, geometry, motion, inertial, audio, and language streams after smaller scaling stages validate the approach.
pretraining planRaw Xperience-10M data is not redistributed here. The reproduction guide states the commands, expected outputs, exact-match reproduction record, and multi-episode requirements.
Human-readable commands, expected artifacts, and current scope for the public single-episode pipeline.
reproducibility guideMachine-readable command matrix covering sample download, baselines, the unified 20-task suite, figures, and validation.
reproducibility matrixThe last metric rebuild reproduced the public-sample outputs from a fresh cache and matched the committed metrics.
reproduction auditThe website organizes the dataset sample, tasks, methods, results, directions, and scale-up path in one tabbed reader flow.
project materialsThe comparison data groups selected-128 baselines, Qwen3-Omni v6 LoRA, Cosmos3-Nano Future Window, and Cosmos3-Super Reasoner. Full Qwen v1-v6 detail stays in a separate lineage audit.
comparisonQwen v1-v6Minimal path: install the toolkit dependencies, download the official sample, run the task suite with neural heads, regenerate the historical provenance bundle, build the unified 20-task index, regenerate visualizations, then rebuild the supporting project reports.
git clone https://github.com/Ropedia/HOMIE-toolkit.git
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
pip install -r ropedia-xperience-10m-task-suite/requirements.txt
pip install torch
hf download ropedia-ai/xperience-10m-sample \
--repo-type dataset \
--local-dir data/sample/xperience-10m-sample
cd ropedia-xperience-10m-task-suite
export WORKSPACE=/path/to/workspace
python scripts/episode_task_suite.py --workspace "$WORKSPACE" --include-neural
python scripts/research_direction_extension_tasks.py
python scripts/tier2_task_suite.py --workspace "$WORKSPACE"
python scripts/build_unified_task_suite.py
python scripts/task_walkthroughs.py
python scripts/build_evaluation_protocol.py
python scripts/generate_visualizations.py
python scripts/render_overview_figures.py
python scripts/render_task_suite_infographic.py
python scripts/export_modality_atlas_assets.py
python scripts/validate_website_integrity.py
python scripts/validate_scope_claims.py
python scripts/build_artifact_index.py
python scripts/validate_mirror_parity.py
python scripts/validate_publication_package.py