Use this route if you need the project story, what is public, and what claims are safe.
Read overviewProject overview and contributions.
The page is organized like a compact research project: motivation and scope, dataset sample, task suite, method, baselines, research directions, interactive walkthroughs, and resources for continuing the work. The public sample is used as a real but bounded research system, not as a final full-dataset benchmark.
Use the 20-task suite, radar, matrix, and source audit to compare methods without losing metric provenance.
Open task suiteUse scripts, validators, mirrors, and checks when you want to rerun or trust the public package.
Open reproduce pathUse directions and scale-up resources for spatial, world-model, VLA, Qwen3, and Cosmos follow-up work.
Open directionsChoose the right entry point without losing the evidence trail.
The project keeps source code, visual explanation, derived artifacts, model outputs, and release checks on different public surfaces. This map shows what each surface is responsible for before you dive into the full file set.
Start with the brief and status files, then use the dashboard for the visual story.
Use the task contract, protocol, walkthroughs, and radar matrix to follow each scored axis.
Open the sample explorer, raw-file manifest, and feature manifest before reading model scores.
Single-episode baselines, 128-episode aligned baselines, Qwen3, and Cosmos branches stay separated by evidence type.
Spatial intelligence, human-video world modeling, and vision-language-action are documented as trainable directions with task mappings.
Publication checks validate source alignment, package contents, mirror parity, and live URL/hash status.
From one public episode to an extensible embodied-AI task lab.
Xperience-10M is much larger than the public sample. This project focuses on the sample available now, turns it into clear task contracts and baseline artifacts, and keeps the same data contract ready for held-out multi-episode training when more episodes are prepared.
A research-development lab for understanding synchronized egocentric multimodal data, defining embodied-AI tasks, and testing small baselines before omni-model fine-tuning.
- 1,161 aligned windows from one public sample episode
- 20 unified task contracts with minimal and neural evidence
- Tasks 13-20 aligned to the same setup as tasks 1-12
- Four research-direction maps and extension probes
The next model-quality stage is stronger action/subtask modeling on the same held-out split, using dense/multiscale windows before requiring more raw episodes.
Maps one public episode into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals.
Defines embodied-AI inputs, process modules, outputs, metrics, and case-study walkthroughs instead of treating the sample as a generic classification file.
Keeps chronological splits, predictions, confusion matrices, leakage notes, and single-episode limitations explicit before claiming broader model quality.
Connects the same data contract to 128-episode baselines, a no-new-episode enhancement pack, Qwen3-Omni LoRA, Cosmos-style world modeling, policy-model branches, and the later Xperience-native pretraining goal.
1-Episode 20-Task Radar
Minimal and Neural MLP baselines over the original public-sample episode, with 40/40 scored method-task records.
128-Episode 20-Task Radar
Metadata, raw-feature, Qwen3-Omni, and Cosmos3 branches on the aligned 128-episode surface, with all 140 rows scored and proxy/evidence notes kept explicit.
Interactive research roadmap
Use this as the front door for the project: it links the unified 20 tasks, four research tracks, current sample evidence, and the multi-episode Qwen3-Omni scale-up path.
Multimodal episode pipeline
One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.
Task suite and baseline heads
The unified task suite has minimal baseline evidence, and the original task cards plus tasks 13-20 share the same windows, splits, and label discipline.
Dataset source alignment
The public description is aligned to the official gated Xperience-10M dataset card, including modalities, scale, access, and current project coverage. The source snapshot records 31.9 TB on the HF surface, an about-1PB full-scale storage statement, 12,103 episode folders as upstream metadata, not a local data inventory, public sample license cc-by-nc-4.0, HOMIE Toolkit and Rerun 0.29.0 source tooling, and the official limited diversity note. See data/source_alignment_audit.json.
Public research artifacts
Metrics, figures, walkthroughs, baseline weights, Qwen3-Omni results, and Cosmos3 public-safe packages are staged across GitHub, GitHub Pages, and Hugging Face.
Qwen3-Omni held-out pilot
The first selected-episode LoRA pilot is packaged with real held-out predictions and metrics. It proves the pipeline, while the weak scores make it a baseline for error analysis.
128-Episode Task Suite Enhancement Pack
Shows how the current selected split can be stressed without more episodes: dense windows, hierarchical labels, raw-feature shards, and `multiscale_20s10_40s20_80s40` as the next export target.
Data governance
Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.
Research roadmap.
The project path moves from the current public-sample task lab to the latest verified Qwen3-Omni diagnostic branch, same-split 128-episode baseline alignment, a no-new-episode enhancement pack, action/subtask error analysis, robustness runs, world/policy branches, and the future Xperience Embodied Foundation Model pretraining goal.
Public-Sample Task Lab
One public episode is converted into aligned windows, task contracts, minimal baselines, neural heads, walkthroughs, and figures.
Multi-Episode Data Preparation
Prepare official gated episodes while preserving episode-level separation and recording missing-view coverage. The first selected split is available for Qwen3-Omni diagnostics.
Qwen3-Omni LoRA Latest Diagnostic Branch
Train lightweight adapters on selected prepared episodes and evaluate on held-out episodes with committed predictions, metrics, and run reports.
128-Episode Same-Split Simple/NN Baselines
Align simple metadata/text baselines, raw-feature proxies, and neural MLP baselines to the same selected 96/16/16 split and the unified 20-task axes used by the public result matrix.
128-Episode Task Suite Enhancement Pack
Use the same selected split, estimate dense/multiscale window exports, define hierarchical action/subtask targets, and prioritize raw-feature shards for tasks that metadata baselines cannot cover.
Action/Subtask Error-Analysis Pass
Keep the 96/16/16 split, tighten JSON decoding or target formatting, and analyze action/subtask failures before larger model-quality claims.
Foundation-Model Selection Matrix
Keep Qwen3-Omni as the first trainable held-out pilot, use Cosmos 3 for world modeling and forward-dynamics trainer development, and stage policy candidates after robot-compatible action targets are explicit.
64-128 Episode Robustness Run
Test whether pilot conclusions survive broader sessions, missing modalities, and stronger ablations.
Cosmos 3 and Policy-Model Extensions
Extend toward future-window prediction, action-conditioned world modeling, synthetic-data tests, policy-style next action, and affordance reasoning.
Xperience Embodied Foundation Model Pretraining
Pretrain an Xperience-native domain model over synchronized video, audio, depth, pose, mocap, IMU, and language after smaller scaling stages prove value.
Additional development directions.
Beyond the current task heads, Qwen3-Omni fine-tuning path, Cosmos/world-model branch, and future native pretraining goal, Xperience-10M can support three foundation pipeline tracks plus several concrete research-development tracks.
Spatial intelligence models
Train spatial-memory models from multiview RGB, egocentric video, depth, pose, calibration, object/contact cues, and language prompts; evaluate spatial QA, object permanence, counting, retrieval, and pose-aware consistency.
Human-video world models
Train future-prediction models from observed interaction windows to score next action, next subtask, future object set, contact transition, camera-motion delta, and latent future state, with Qwen-style probes and Cosmos-style dynamics kept separate.
Vision-language-action models
Train VLA or policy-compatible heads only after converting egocentric video, captions, hand/body motion, contacts, objects, and procedures into traceable action tokens, chunks, and object-conditioned action targets.
Episode taxonomy and data engine
Build an episode atlas, category tags, balance report, and split builder across activities, objects, scenes, sessions, people, and missing modalities.
direction dataStandardized benchmark protocol
Version train/val/test manifests, task cards, leakage checks, metric scripts, and reference baselines so future model scores are comparable.
direction noteMultimodal representation learning
Train contrastive and masked-prediction encoders over synchronized video, audio, depth, pose, mocap, IMU, and language windows.
JSON planSkill and procedure graphs
Mine action steps, transitions, preconditions, effects, and temporal graphs that connect egocentric perception to planning.
current task mapHuman-object affordances
Add contact, reachable-object, tool-use, and next-affordance tasks using hands, mocap, objects, contacts, video, and language.
task walkthroughs3D/4D scene and object memory
Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps for spatial reasoning and object permanence.
model branchesQuality and sync diagnostics
Track timestamp drift, missing streams, calibration consistency, corrupted files, and degraded-mode manifests before large training runs.
evidence contractPolicy and simulation transfer
Convert mocap, hand trajectories, contacts, and object states into action tokens, robot-compatible targets, and imitation-learning examples.
foundation planEvaluation protocol is explicit.
The protocol is generated from committed metric artifacts so readers can see the exact data unit, split, task targets, leakage controls, and current limitations before comparing scores.
Data unit
One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by 8,546 synchronized multimodal dimensions.
Split policy
Single-episode chronological 70/30 train/test split. This avoids random future-window mixing; cross-episode generalization is measured in the later multi-episode pilot.
protocol documentMetric contract
All 20 tasks list input, target, primary metric, baseline score, and source artifact path in the unified suite file.
task_suite_20.jsonLeakage controls
Scalers fit on train windows only; future labels, target-side signals, caption/object labels, and contact labels stay on the target side unless explicitly queried.
builder scriptAudio ablation
Audio and no-audio variants are evaluated across the original task contracts under the same chronological split.
audio summaryFoundation branch selection
Qwen3-Omni is the first trainable baseline, Cosmos 3 becomes the world-model branch with a camera-pose proxy forward-dynamics contract ready for trainer work, policy models wait for robot-compatible action targets, and Xperience-native pretraining remains a later full-corpus goal.
backbone planNext evaluation stage
This public-sample run covers single-episode task development. The selected multi-episode Qwen3-Omni final diagnostic result is verified and meets the JSON-validity target; Cosmos3-Nano has a verified future-window compatibility package; and Cosmos3-Super has a verified base-weight JSON-task evaluation plus a fine-tuned forward-dynamics LoRA branch. The next stage is action/subtask error analysis, stronger model-quality runs, and policy-target conversion.
result comparison128-Episode Task Suite Enhancement Pack
Before adding episodes, the suite should try `multiscale_20s10_40s20_80s40`, hierarchical action/subtask targets, label-normalized scoring, and compact raw-feature shards for unsupported tasks.
task_suite_enhancement_128.jsonScale-up requirement
Future Omni, Cosmos, and policy branches use the same episode split discipline, training metadata, held-out predictions, metrics, run report, and public-safe package gate.
scale-up statusCurrent experiments and next milestones.
The project shows the completed public-sample task suite and the first verified multi-episode Qwen3-Omni diagnostic pilot, then lays out the next quality-improvement and model-extension steps.
Aligned Xperience-10M sample windows
5,821 frames become 1,161 synchronized 20-frame windows with an 8,546-dimensional representation.
20 task contracts + 180 public results
The current release reports nine method families over the unified 20-task axes, with minimal, neural, 128-episode, Qwen3, Cosmos3, and proxy-scored rows kept source-linked.
Audio contribution is measured task by task
Audio variants improve the primary metric on 6 of the original task contracts in this single-episode setting.
Four research directions are mapped by evidence type
The Ropedia directions are labeled as direct, proxy, or diagnostic coverage, plus one coded extension probe per direction.
Foundation backbones are separated by role
Qwen3-Omni stays first for held-out LoRA; Cosmos 3 is the world-model branch with camera-pose proxy forward-dynamics targets ready for trainer work; OpenVLA/openpi/GR00T are policy candidates after robot-compatible action conversion; Xperience-native pretraining is the later full-corpus goal.
Qwen3-Omni and Cosmos3 branches
The selected 96/16/16 episode split now has a verified Qwen3-Omni v6 package with 4,032 held-out test predictions and 99.90% JSON validity. Cosmos3-Nano has 378 held-out future-window predictions, Cosmos3-Super Reasoner has 448 held-out base-weight JSON-task predictions, and Cosmos3-Super Forward-Dynamics LoRA has 448 held-out loss records.
128-Episode Task Suite Enhancement Pack
The current 3,808-window export can be expanded through dense/multiscale windows without changing the held-out episode split; the recommended scenario is `multiscale_20s10_40s20_80s40`.
Multi-episode pilot status is explicit
The Qwen3-Omni notes separate earlier diagnostic packages, the final 128-episode LoRA result, and the next action/subtask error-analysis pass.
Public pages are connected
The website, GitHub repo, Hugging Face Space, artifact dataset, baseline model repo, consolidated weights/results repo, and collection point to the same research project.
Figures are indexed
The visual set includes the logo, modality atlas, task-suite figure, unified 20-task model radar, model-architecture figure, tasks 13-20 chart, and Qwen3-Omni LoRA training-flow figure.
Brand assets are packaged consistently
The project logo is used consistently in the website header, favicon, README/HF cards, and social preview.
Raw dataset files are not redistributed
The public project shares derived task artifacts, figures, reports, and lightweight baseline files. Raw Xperience-10M videos, HDF5 annotations, RRD visualizations, gated data, and full Qwen weights stay outside the repo.
The dashboard is designed as the visual entry point
Tabs organize the sample data, 20 tasks, model method, results, research directions, and next-stage resources.
Reproduction path is documented
The reproduction guide lists the public sample setup, task-suite rebuild, neural heads, figure generation, and expected outputs.
Official dataset source is linked
The project keeps the official Xperience-10M dataset, public sample, dataset website, and HOMIE toolkit visible so readers can trace the data source.
Research reading path.
A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.
Understand the current scope
Start with the project brief, status, dataset context, task results, roadmap, and Qwen3-Omni scale-up notes. They separate implemented single-episode work from the prepared multi-episode stage.
Inspect one model input
Use the window table and feature manifest to see the aligned sample unit, modality sources, and leakage controls.
Compare minimal vs neural heads
Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.
Check the scale-up gate
The multi-episode Qwen3-Omni path now has a final verified diagnostic package and public LoRA adapter. The native-pretraining plan shows how this can grow into a full-corpus research direction after action/subtask improvements and stronger task metrics.
Push the current 128 episodes harder
Use the no-new-episode enhancement pack before requesting more storage: it records dense-window estimates, `multiscale_20s10_40s20_80s40`, hierarchical labels, and raw-feature shard priorities.
Aligned with the official dataset card.
The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while focusing the implemented artifacts on one public sample episode. The source-alignment record keeps 31.9 TB, about-1PB, 12,103 episode folders, cc-by-nc-4.0, HOMIE Toolkit, Rerun 0.29.0, not a local data inventory, limited diversity, and data/source_alignment_audit.json visible on the public site.
Official dataset
Xperience-10M is a gated large-scale egocentric multimodal dataset for embodied AI, robotics, spatial intelligence, and world modeling.
Public sample
The current unified 20-task suite is built from one public sample episode, not from the entire gated dataset.
sample datasetModalities
The sample exposes synchronized video, audio, depth, pose/SLAM, motion capture, inertial signals, calibration, and language annotations.
modality atlasMulti-episode pilot
The selected 128-episode Qwen3-Omni LoRA v6 diagnostic branch is verified with 4,032 held-out test predictions and 99.90% JSON validity. Action/subtask metrics are still weak, so this remains a baseline for error analysis.
LoRA adapterv5/v6 comparisonRaw sample browser
The Data tab now exposes the official public sample files directly, including playable MP4 video streams and the audio track embedded in fisheye_cam0.mp4.
open raw browserraw manifestData boundary
Raw MP4, HDF5, RRD files are streamed from the official public sample source when opened here; private gated data and full Qwen weights are not redistributed in this project.
data noticeCurrent project subset
One public sample episode, 5,821 frames, 1,161 aligned windows, 8,546-dimensional task inputs, plus direct links to the official raw sample files.
modality atlasCovered now
Action/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, misalignment, long-horizon forecasting, interaction text, action-object relation, sensor bridging, camera sync, and transition timing.
summary metricsResponsible use
This project is for research exploration and excludes identity recognition, surveillance, biometric profiling, sensitive-attribute inference, and safety-critical deployment.
use notesLater milestones
Full audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, held-out Qwen3-Omni evaluation, and future Xperience-native pretraining.
native pretrainingRaw public sample browser.
Open each official Xperience-10M sample file from the project page. Video and audio use compact browser previews derived from the official MP4 files, with direct links beside them for the full raw Hugging Face sources. HDF5 and RRD files are shown with their role, size, organization, and direct source links.
fisheye_cam0.mp4
Fisheye camera 0 stream and the public sample audio source. This file can be played as video and as the embedded audio track.
Playing a 12 second fast-start preview derived from the official raw MP4. Use the source link for the complete file.
Video features feed visual tasks; the embedded audio stream feeds audio ablation and acoustic feature blocks.
Sample folder organization
The official public sample is one episode folder. The task suite reads the HDF5 annotations and six synchronized MP4 streams, then writes 20-frame windows with a 5-frame stride.
xperience-10m-sample/ annotation.hdf5 fisheye_cam0.mp4 fisheye_cam1.mp4 fisheye_cam2.mp4 fisheye_cam3.mp4 stereo_left.mp4 stereo_right.mp4 visualization.rrd
annotation.hdf5 group map
The raw HDF5 is a binary container, so the browser shows its organization rather than loading the whole file into memory.
Ropedia Xperience-10M Unified 20-Task Suite.
The suite connects synchronized multimodal windows to 20 task contracts. The large map visualizes the original task families, while tasks 13-20 are listed as the aligned continuation under the same setup.
Unified plus split radars
The unified radar keeps all 9 methods in one view. The two split radars separate the clean 1-episode Minimal/NN baseline comparison from the 128-episode metadata/raw/Qwen/Cosmos comparison.
Metric normalization
Higher-is-better metrics are plotted directly on 0-1 axes. Lower-is-better metrics are converted to best/value within the task, while raw values, status reasons, sources, and the two raw128 compact proxy notes remain in the JSON mirrors.
Score gap audit
The matrix has 180 method-task records and 180 numeric scores. The gap audit remains published as the evidence ledger for which artifacts support each score, including documented proxy axes where raw targets are absent.
1-Episode 20-Task Radar
Minimal and Neural MLP are both scored on all 20 public-sample task contracts, shown as two filled polygons without 128-episode overlays.
128-Episode 20-Task Radar
Raw128 Simple and Raw128 NN score all 20 axes; metadata, Qwen3, and Cosmos branches keep 20 records but only plot evaluated numeric targets.
Readable modality atlas.
Each Xperience-10M stream gets a large thumbnail, a plain sample-content line, and the exact current-baseline use. These are small derived images only; no raw MP4, HDF5, or RRD data is redistributed.
Video
6 synchronized camera MP4 streams
RGB/fisheye/stereo frame statistics
Audio
Audio stream embedded in MP4
Acoustic signal
Depth
Depth map + confidence channel
Spatial geometry signal
Pose / SLAM
Trajectory + sparse SLAM map
Position + orientation features
Motion Capture
Body + hand joint tracks
3D mocap feature statistics
Inertial
Accelerometer + gyroscope
Wearable motion statistics
Language
Object tags + action captions
Task labels + semantic targets
The atlas redistributes only small derived thumbnails and metadata. Raw MP4, HDF5, and RRD files remain excluded from this repo and the Hugging Face mirrors.
From raw episode to research artifacts.
Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.
Qwen3-Omni LoRA training flow
Raw valid episodes move through split validation, parallel export, video/audio/text formatting, sensor-bridge features, LoRA training, and sealed held-out evaluation.
What the figure represents
It documents the selected 128-episode final diagnostic result and the action/subtask improvement path needed for stronger model-quality numbers.
What this project enables
It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.
What still needs more data
General embodied-intelligence model quality requires many episodes and held-out episode splits; the public sample is the development harness for that next stage.
What the current results actually say.
A generated takeaways layer reads the committed metrics, summarizes useful research signals, and identifies what still needs held-out episodes.
One episode becomes a benchmark contract
The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,546-dimensional representation for repeatable task evaluation.
Chronological split exposes class shift
All-feature action reaches 0.9829 macro-F1 on its local split, while the chronological action head in the core task suite is 0.0500 macro-F1 with four unseen later action labels.
takeawaysNeural heads help dynamics
Hand MPJPE improves from 0.8647 to 0.1079; temporal-order F1 rises from 0.5400 to 0.8520; misalignment F1 rises from 0.5052 to 0.7153.
metricsRetrieval and reconstruction remain open
Ridge/cosine retrieval remains stronger than the neural projection here, and cross-modal feature reconstruction still has negative R2.
retrieval metricsScale means held-out episodes
The next credible model-quality unit is a held-out multi-episode pilot across different sessions, not more adjacent windows from one sample.
scale-up statusSmall baselines, no hidden machinery.
Motion-only and current all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. The neural run keeps the same features and splits, then swaps in PyTorch MLP heads.
Motion-only action
0.9688Current all-feature action
0.9829Motion-only subtask
0.9528Current all-feature subtask
0.9173Neural MLP heads, same task contracts.
The neural baseline uses small PyTorch MLP classifiers/regressors on the same 8,546-dimensional windows, chronological splits, and leakage filters. This isolates the value of a nonlinear head before moving to heavier Qwen/Omni experiments.
Neural hand forecast
0.1079Neural temporal order
0.8520Neural misalignment
0.7153Neural cross-modal retrieval
0.1300The original tasks organized into four research directions.
Each task is mapped as direct, proxy, or diagnostic evidence for the Ropedia research tracks. The mapping uses two current baselines: minimal interpretable heads and neural MLP heads over the same feature contract.
A. Human Modeling & Motion Understanding
Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.
B. 3D/4D Reconstruction & Neural Rendering
Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.
C. Egocentric Vision & Interaction
Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.
D. Scene Reconstruction & World Modeling
Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.
Baseline 1: minimal heads
Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.
Baseline 2: neural MLP heads
Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.
Tasks 13-20 complete the unified 20-task suite.
The original four direction probes remain as focused examples. Tasks 13-20 add eight sample-supported baselines using the same windows, feature manifest, chronological split, and minimal/neural head pattern as tasks 1-12.
Long-Horizon Next-Action Forecasting
Input: current non-caption multimodal window.
Output: action label five seconds later.
Long-Horizon Next-Subtask Forecasting
Input: current non-caption multimodal window.
Output: procedure subtask five seconds later.
Interaction Text Prediction
Input: current sensor window with caption features removed.
Output: raw annotation interaction phrase.
Action-Object Relation Prediction
Input: current sensor window with caption features removed.
Output: joint action plus active object-set label.
Future Object-Set Forecasting
Input: current sensor window with caption features removed.
Output: object set active five seconds later.
IMU-to-Hand Pose Reconstruction
Input: IMU acceleration and gyroscope features only.
Output: current left/right hand joint feature blocks.
Camera-View Synchronization Retrieval
Input: fisheye camera-1 feature query.
Output: synchronized fisheye camera-3 window rank.
Time-to-Next-Transition Regression
Input: current non-caption multimodal window.
Output: capped frames until the next action boundary.
Tasks 13-20 artifact package
The eight-task package has JSON metrics, prediction/rank files, a Markdown summary, and a chart generated from the local public-sample annotation and committed shared-window tensor.
Setup alignment
Tasks 13-20 use the same 20-frame windows, 5-frame stride, 8,546-dimensional feature manifest, chronological split, and minimal/neural comparison pattern as tasks 1-12.
Body and Hand Motion Intensity
Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.
Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.
Output: high_motion or low_motion.
Multi-View Consistency Retrieval
Case: retrieve the synchronized stereo-left window from a fisheye-camera query.
Input: fisheye_cam0 video features against stereo_left candidate features.
Output: ranked synchronized view candidates.
Action Phase Progress Estimation
Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.
Input: non-caption multimodal features.
Output: 0-to-1 progress inside the current action.
Short-Horizon Ego-Motion Forecasting
Case: predict how the camera translation changes over the next 20 frames.
Input: current sensors excluding camera translation and captions.
Output: future camera-translation delta vector.
What changed
The four research directions now have coded extension probes, prediction/rank CSVs, JSON metrics, a Markdown summary, and a website chart generated from real sample-window features.
What still needs scale
A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.
The original task heads share four head families.
The diagram separates the shared episode-window representation from the task-specific heads, so the task contracts stay readable before scaling to larger models.
Interactive task walkthrough.
Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.
Input: inspect the 20-frame multimodal window before choosing the target.
Action Recognition
In the coffee-making sample, a pouring window maps to the current action label.
Metric: macro-F1. Minimal 0.0500; neural MLP 0.0148.
Current limitation: single-episode chronological split.
Task cards and metrics.
The original task cards use readable research names, representative modality thumbnails, explicit input-process-output contracts, and verified minimal versus neural scores. The unified 20-task index adds tasks 13-20 in the same suite.
Every model input has a source.
The point is not hidden complexity. Every input group maps back to a source modality and a manifest entry.
Diagnostics separate memorization from signal.
The charts make the main lesson visible: within-episode supervised labels are easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.
Open the single-episode explorer to inspect window-level labels, predictions, modality statistics, object labels, and diagnostic scores. The audio ablation summary records the task-by-task audio contribution.
Research artifacts for the next experiments.
Metrics, predictions, manifests, lightweight model weights, and derived window artifacts are organized so the project can be inspected, extended, and scaled before rerunning the full pipeline. Raw Xperience-10M data and Qwen weights are not redistributed.
Open GitHub, HF Space, artifact dataset, baseline models, or consolidated weights/results without guessing.
Open public surfacesUse validators, source alignment, mirror parity, and live URL/hash checks before trusting a number.
Open checksStart from scripts, windows, feature manifests, task contracts, and minimal/neural result outputs.
Open commandsUse Qwen3/Cosmos packages, 128-episode feature index, and foundation-model plans for the next runs.
Open scale-upFrom one episode to task heads
Start with the files that define the sample windows, modality inputs, task contracts, metrics, walkthroughs, and research-direction mapping.
Task results
Every task definition, split detail, feature dimension, and minimal/neural metric in one project output.
Windows table
Window start/end frames and aligned action/subtask labels for the public sample episode.
window tableFeature inputs
Source map for the current modality inputs used by the task suite.
feature inputsNeural MLP task results
Per-task PyTorch MLP metrics, predictions, histories, and checkpoints for the original task contracts, with tasks 13-20 published in the aligned result bundle.
neural MLP outputsFour-direction taxonomy
Maps the original tasks to the four research tracks: human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling.
research direction outputsDirection extension probes
Four coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.
extension probe outputsTask walkthroughs
Case studies for the original tasks, including input, middle process modules, output, metric, limitation, and task-player data.
walkthrough outputsAudio ablation and raw upgrade
All 72 task/variant rows comparing current audio, no audio, raw audio, replacement, and combined-input settings.
audio ablation outputsSingle-episode explorer
Interactive window-level view of labels, predictions, modality statistics, object labels, and diagnostics.
open explorerCross-modal retrieval
The strongest self-supervised signal from the single episode.
retrieval metricsProject map, mirrors, and runnable code
Use these files to navigate the whole project, open the published mirrors, or reproduce the public-sample pipeline.
Public reader map
Single navigation layer for GitHub, GitHub Pages, HF Space, artifact dataset, baseline model repo, model-branch repos, and public claim boundaries.
Artifact guide
Human-readable map from project scope to data contract, task evidence, platform mirrors, and scale-up status.
Reproduction scripts
Training, visualization, taxonomy, walkthrough, validator, and omni-readiness scripts.
scripts/Hugging Face Space
The dashboard packaged as a public static Space.
HF SpaceGitHub Package
Static dashboard container published to GitHub Container Registry for local browsing with Docker, without raw data or model weights.
GHCR packageDerived HF artifacts
Metrics, predictions, docs, and lightweight derived files without raw data redistribution.
artifact collectionHF baseline models
Minimal NumPy softmax, ridge baselines, and neural task-head model files.
model repoHF weights + results
Consolidated public-safe baseline weights, Qwen3/Cosmos adapters, verified results, analysis files, and manifest.
weights/results repoHF collection
Space, artifacts, baseline models, and verified Qwen3/Cosmos3 adapter repos grouped into one public project collection.
collectionCurrent all-feature action model
Classifier metrics, predictions, confusion matrix, and model weights.
model metricsProject packet
Compact route through the project for readers who want the shortest path from scope to results after choosing a surface.
project packetVerified diagnostic pilot
The multi-episode Qwen3-Omni path is documented, scripted, and verified as a validation-monitored diagnostic held-out pilot. Stronger model-quality metrics require structured-output and error-analysis improvements.
Model-family comparison
Compares the three result layers and also groups 1-episode and 128-episode entries by model family: task heads, Qwen3-Omni LoRA, Cosmos3-Nano, and Cosmos3-Super.
128-episode source + features
Maps every selected official Xperience-10M episode id to its gated source tree and the public-safe processed features: Qwen v6 multiscale windows, dense multiscale rows, and metadata matrices.
128-Episode Task Suite Enhancement Pack
No-new-episode plan for denser supervision: `multiscale_20s10_40s20_80s40`, hierarchical action/subtask labels, stronger scoring slices, and raw-feature shard priorities.
task_suite_enhancement_128.jsonFoundation-model plan
Backbone selection matrix covering Qwen3-Omni, Cosmos 3, GR00T, OpenVLA/openpi, Gemini Robotics, Octo, SmolVLA-style policy candidates, and the future Xperience-native pretraining goal.
foundation model planMulti-episode data access
Public data-access path, selected 128-episode pilot plan, and preparation requirements.
data accessQwen3-Omni LoRA group
Separates the 1-episode sensor-adapter smoke test from the current 128-episode LoRA adapter package and older diagnostics.
Qwen groupCosmos3 groups
Shows the verified Nano future-window compatibility package, the Super base-weight Reasoner JSON-task evaluation, and the Super fine-tuned forward-dynamics LoRA branch with separate loss metrics.
Cosmos groupsScale-up requirement
Future runs need validation tracking, held-out predictions, quality-target reporting, and the same public-safe package gate.
training requirementsXperience-native pretraining
Future plan for a domain-specific embodied foundation model trained from scratch over full-corpus video, audio, geometry, motion, inertial, and language streams.
pretraining planProject files behind the research site
These resources are useful after the first pass: they collect the project brief, task evidence, visuals, dataset notes, reproduction path, and public pages.
Project brief
The fastest written overview of the dataset sample, tasks, baselines, and scale-up plan.
briefTask walkthroughs
Human-readable case studies for the original tasks, including input, process modules, output, metric, and limitation.
walkthroughsTask results
Minimal and neural-head metrics for the same sample windows and chronological split.
metricsVisual figures
Task-suite map, modality atlas, pipeline diagram, model architecture figure, and Qwen3-Omni LoRA training-flow figure.
task-suite figureDataset notes
Official dataset links, public sample source, modalities, access boundary, and current project subset.
dataset notesReproducibility
Commands and expected outputs for rebuilding the public-sample task suite and visual artifacts.
reproduceQwen3-Omni status
Data requirements and evaluation boundary for the selected multi-episode LoRA pilot.
training statusFoundation-model plan
Qwen3-Omni, Cosmos 3, GR00T, OpenVLA/openpi, Gemini Robotics, Octo, SmolVLA-style branches, and the Xperience-native pretraining goal by role.
model planHub artifacts
Derived CSV/JSON/Markdown/figure artifacts without redistributing raw Xperience-10M data.
artifact datasetBaseline models
Lightweight minimal and neural task-head model files for the task contracts.
model repoQwen3-Omni diagnostic branch is verified.
The selected pilot uses 128 source-balanced episodes across 128 different session UUIDs. The latest v6 held-out package is verified, and its weak metrics define the next structured-output and error-analysis pass.
Selection
128 complete episodes selected from 128 unique top-level sessions, balanced across episode-size bands and split 96/16/16 for train/val/test.
source/feature indexTransfer
Download raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.
Current LoRA artifact
The current Qwen3-Omni LoRA artifact is the verified v6 selected 128-episode diagnostic adapter. The v5 row remains pinned as the prior release, and the 1-episode Qwen entry is only a sensor-adapter smoke test.
model groups128-Episode Task Suite Enhancement Pack
The next suite push does not need more episodes first: use `multiscale_20s10_40s20_80s40`, hierarchical action/subtask targets, and raw-feature shards while keeping the held-out split fixed.
task_suite_enhancement_128.jsonBackbone branches
Qwen3-Omni uses a separate LoRA model repo; Cosmos3-Nano remains a compatibility package; Cosmos3-Super now has a verified forward-dynamics LoRA branch with weights in a dedicated model repo.
Cosmos3-Super weightsNative foundation model
The long-term goal is a full-corpus Xperience Embodied Foundation Model trained on synchronized perception, geometry, motion, inertial, audio, and language streams after smaller scaling stages validate the approach.
pretraining planReproduce the suite.
Raw Xperience-10M data is not redistributed here. The reproduction guide states the commands, expected outputs, exact-match reproduction record, and multi-episode requirements.
Reproducibility guide
Human-readable commands, expected artifacts, and current scope for the public single-episode pipeline.
reproducibility guideReproducibility matrix
Machine-readable command matrix covering sample download, baselines, the unified 20-task suite, figures, and validation.
reproducibility matrixExact-match reproduction record
The last metric rebuild reproduced the public-sample outputs from a fresh cache and matched the committed metrics.
reproduction auditProject dashboard
The website organizes the dataset sample, tasks, methods, results, directions, and scale-up path in one tabbed reader flow.
project materialsMulti-episode pilot status
The comparison JSON now supports both the three-version reading and model-family grouping, with Qwen3 v5/v6 detail kept as a separate machine-readable audit.
comparisonQwen v5/v6Minimal path: install the toolkit dependencies, download the official sample, run the task suite with neural heads, regenerate tasks 13-20, build the unified 20-task index, regenerate visualizations, then rebuild the supporting project reports.
git clone https://github.com/Ropedia/HOMIE-toolkit.git
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
pip install -r ropedia-xperience-10m-task-suite/requirements.txt
pip install torch
hf download ropedia-ai/xperience-10m-sample \
--repo-type dataset \
--local-dir data/sample/xperience-10m-sample
cd ropedia-xperience-10m-task-suite
export WORKSPACE=/path/to/workspace
python scripts/episode_task_suite.py --workspace "$WORKSPACE" --include-neural
python scripts/research_direction_extension_tasks.py
python scripts/tier2_task_suite.py --workspace "$WORKSPACE"
python scripts/build_unified_task_suite.py
python scripts/task_walkthroughs.py
python scripts/build_evaluation_protocol.py
python scripts/generate_visualizations.py
python scripts/render_overview_figures.py
python scripts/render_task_suite_infographic.py
python scripts/export_modality_atlas_assets.py
python scripts/validate_website_integrity.py
python scripts/validate_scope_claims.py
python scripts/build_artifact_index.py
python scripts/validate_mirror_parity.py
python scripts/validate_publication_package.py