public sample episode / multimodal task lab

Ropedia Xperience-10M Research Task Lab.

This project uses the public Xperience-10M sample from Ropedia to explore embodied-AI task design, multimodal feature construction, lightweight baselines, and future Omni-model fine-tuning. It starts from the sample episode available now, then keeps the same data contracts ready for held-out multi-episode training when more Xperience-10M data is staged.

5,821frames in sample episode
1,16120-frame windows
8,378current feature dimensions
12+12+4core, neural, and extension probes
current feature allocation window vector
mocap
2,121
camera+imu
126
depth
980
video
4,116
language
896
static
139

Project overview and contributions.

The page is organized like a compact research project: motivation and scope, dataset sample, task suite, method, baselines, research directions, interactive walkthroughs, and resources for continuing the work.

Project brief

From one public episode to an extensible embodied-AI task lab.

Xperience-10M is much larger than the public sample. This project focuses on the sample available now, turns it into clear task contracts and baseline artifacts, and keeps the same data contract ready for held-out multi-episode training when more episodes are staged.

What this is

A research-development lab for understanding synchronized egocentric multimodal data, defining embodied-AI tasks, and testing small baselines before omni-model fine-tuning.

What is implemented
  • 1,161 aligned windows from one public sample episode
  • 12 task contracts with minimal and neural heads
  • Four research-direction maps and extension probes
What comes next

The next model-quality stage is not another single-sample score. It is a held-out episode pilot with at least 32 valid episodes, no train/test episode leakage, and a completed omni-model evaluation report.

verified

Multimodal episode pipeline

One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.

frames 5,821 windows 1,161 features 8,378
verified

Task suite and baseline heads

Every core task has a minimal baseline and a compact PyTorch MLP head over the same windows, splits, and labels.

core tasks 12 neural heads 12 extension probes 4
verified

Dataset source alignment

The public description is aligned to the official gated Xperience-10M dataset card, including modalities, scale, access, and current project coverage.

full dataset gated sample scope 1 episode raw data mirrored no
verified

Public research artifacts

Metrics, figures, walkthrough data, baseline weights, and project metadata are packaged across GitHub, GitHub Pages, and Hugging Face.

site integrity pass mirror parity pass live status pass
data-gated

Omni-model scale-up path

The 32-episode LoRA path is prepared; full training results require gated data access, held-out splits, training, and evaluation.

current stage setup checked target gate 32 episodes held-out eval pending
not redistributed

Data governance

Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.

raw Xperience-10M excluded full Qwen weights excluded derived artifacts included

Research roadmap.

The project path is staged from the current public-sample task lab to multi-episode data staging, held-out Qwen3-Omni fine-tuning, robustness runs, and larger foundation/world-model extensions.

implemented

Public-Sample Task Lab

One public episode is converted into aligned windows, task contracts, minimal baselines, neural heads, walkthroughs, and figures.

Entry

Public Xperience-10M sample episode available.

Evidence

Status, protocol, takeaways, summary metrics, and episode-task outputs.

active

Multi-Episode Data Staging

Stage official gated episodes while preserving episode-level separation and recording missing-view coverage.

Entry

Gated data access and enough storage for selected episodes.

Evidence

Data access status, source discovery, and selected episode manifests.

next

32-Episode Qwen3-Omni LoRA Pilot

Train lightweight adapters and evaluate on held-out episodes with committed predictions, metrics, and run reports.

Entry

At least 32 valid staged episodes with no train/test episode leakage.

Evidence

Dataset manifest, training metadata, progress logs, metrics, and predictions.

planned

64-128 Episode Robustness Run

Test whether pilot conclusions survive broader sessions, missing modalities, and stronger ablations.

Entry

32-episode pilot trains and evaluates cleanly.

Evidence

Metrics by session, task, modality, ablation, and failure type.

planned

Foundation and World-Model Extensions

Extend toward audio-visible alignment, reconstruction, SLAM/world modeling, policy-style next action, and affordance reasoning.

Entry

Enough multi-episode data and compute budget for larger multimodal objectives.

Evidence

Task-specific held-out evaluations, qualitative inspection, and updated model cards.

Evaluation protocol is explicit.

The protocol is generated from committed metric artifacts so readers can see the exact data unit, split, task targets, leakage controls, and current limitations before comparing scores.

Data unit

One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by the current 8,378-d feature vector.

protocol JSON

Split policy

Single-episode chronological 70/30 train/test split. This avoids random future-window mixing; cross-episode generalization is measured in the later multi-episode pilot.

protocol doc

Metric contract

All 12 tasks list input, target, primary metric, minimal baseline score, and neural MLP score from committed result files.

summary metrics

Leakage controls

Scalers fit on train windows only; future labels, target feature blocks, caption/object labels, and contact labels stay on the target side unless explicitly queried.

builder script

Next evaluation stage

This public-sample run covers single-episode task development. Cross-episode generalization, audio-visual learning, pixel-depth reconstruction, neural rendering, and full 32-episode Qwen3-Omni training move to the multi-episode stage.

pilot status

Scale-up requirement

The Omni pilot requires at least 32 valid episodes, held-out episode splits, no train/test episode leakage, training metadata, predictions, metrics, and a run report.

data status

Current experiments and next milestones.

The project shows the completed public-sample task suite, then lays out the data requirements for the Qwen3-Omni scale-up path.

verified

12 minimal heads + 12 neural MLP heads

Every task has a minimal interpretable head and a matching neural MLP run over the same windows, splits, and task contract.

verified

Four research directions are mapped by evidence type

The Ropedia directions are labeled as direct, proxy, or diagnostic coverage, plus one coded extension probe per direction.

data-gated

Qwen3-Omni pilot setup

The current Qwen3-Omni artifacts use one episode and 128 train windows. The 32-episode evaluation is still pending.

verified

Multi-episode pilot status is explicit

The pilot status report records setup-stage 32ep paths separately from completed held-out-episode metrics.

verified

Prepared mirrors stay synchronized

The parity report compares critical JSON, figure, and validator files across the repo, HF Space bundle, artifact dataset bundle, and model bundle before upload.

verified

Brand assets are packaged consistently

The generated logo system is packaged into the website header, favicon, README/HF cards, Open Graph preview, and brand-asset manifest.

verified

Public bundles stay lightweight

The bundle report covers required assets, raw-data exclusion, Python cache exclusion, heavy archive exclusion, accidental HF token strings, and public-card figure freshness across GitHub and the HF bundles.

verified

Website references are validated

The site validator checks local links, anchors, JSON bundles, and referenced image dimensions before publishing.

verified

Official dataset card is aligned

The source-alignment note mirrors the public Hugging Face dataset-card facts, sample-card facts, and API metadata: gated access, sample license/tooling, modality coverage, episode layout, intended uses, and current project coverage.

Research reading path.

A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.

02

Inspect one model input

Use the window table and feature manifest to see the exact aligned sample unit, feature blocks, dimensions, and omitted audio feature status.

03

Compare minimal vs neural heads

Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.

04

Check the scale-up gate

The multi-episode Qwen3-Omni path is prepared. The 32-episode result will be added after the data gate and held-out evaluation pass.

Verified nowOne public episode, 5,821 frames, 1,161 windows, 8,378 current features, 12 minimal heads, 12 neural heads, and 4 direction-extension probes.
Next: multi-episodeA 32-episode held-out Qwen3-Omni LoRA pilot is gated on Xperience-10M access and must pass manifest, training, and evaluation checks.
Not redistributedRaw videos, raw annotations, full Qwen weights, and private gated Xperience-10M data are not included in the public repo or HF bundles.

Aligned with the official dataset card.

The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while focusing the implemented artifacts on one public sample episode.

Official scale

About 10M experience units and 10,000 hours, with RGB video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, captions, metadata, and calibration.

alignment JSON

HF file-size display

The live Hugging Face page/API currently shows 31.9 TB hosted. This is recorded separately from the card's about-1PB full-scale storage statement.

source JSON

HF access path

The source dataset is manually gated for approved non-commercial use, with an external agreement step noted by the public HF metadata.

official HF dataset

API listing snapshot

HF API metadata observed 803 session folders and 12,103 episode folders with annotation.hdf5. This snapshot supports planning; it is not a local data inventory.

metadata JSON

Public sample card

The sample repo lists cc-by-nc-4.0, HOMIE Toolkit for videos/annotations, and Rerun 0.29.0 for .rrd visualization.

sample dataset

Source notes

The source notes summarize full-dataset facts, public sample-card facts, API-listing notes, and project coverage across the repo, website, and HF cards.

alignment report

Episode layout

Expected folders contain six MP4 streams and annotation.hdf5; visualization.rrd is treated as a viewer artifact and excluded from training downloads.

alignment note

Current project subset

One public sample episode, 5,821 frames, 1,161 windows, 8,378 current features, audio documented but not yet featurized, and no raw-data redistribution.

modality atlas

Covered now

Action/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, and misalignment.

summary metrics

Responsible use

The official card notes limited diversity and showcase/production quality. This project excludes identity, surveillance, biometric, sensitive-attribute, and safety-critical uses.

use notes

Later milestones

Full audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, and 32-episode Qwen3-Omni evaluation.

data status

Ropedia Xperience-10M 12-task suite.

The task map connects synchronized multimodal windows to 12 research task heads, then the modality atlas shows the sample streams used to build those contracts. Audio is present in the sample MP4 stream, but the current 8,378-d baseline manifest does not featurize it.

Infographic showing all 12 Ropedia Xperience-10M tasks with enlarged full-width modality cards

Readable modality atlas.

Each Xperience-10M stream gets a large thumbnail, a plain sample-content line, and the exact current-baseline use. These are small derived images only; no raw MP4, HDF5, or RRD data is redistributed.

modality_atlas.json
01

Video

visual stream
Public sample fisheye and stereo camera thumbnails
sample contains

6 synchronized camera MP4 streams

current baseline use

RGB/fisheye/stereo frame statistics

02

Audio

acoustic stream
AAC waveform thumbnail from the public sample MP4 stream
sample contains

AAC stream embedded in MP4

current baseline use

Documented, not featurized in the 8,378-d vector

03

Depth

geometry map
Public sample depth and confidence thumbnails
sample contains

Depth map + confidence channel

current baseline use

Spatial geometry feature block

04

Pose / SLAM

camera pose
Public sample camera trajectory and sparse SLAM map thumbnail
sample contains

Trajectory + sparse SLAM map

current baseline use

Position + orientation features

05

Motion Capture

human motion
Public sample body and hand motion capture thumbnail
sample contains

Body + hand joint tracks

current baseline use

3D mocap feature statistics

06

Inertial

wearable sensor
Public sample accelerometer and gyroscope time-series thumbnail
sample contains

Accelerometer + gyroscope

current baseline use

Wearable motion statistics

07

Language

semantic annotation
Public sample object tags and action caption thumbnail
sample contains

Object tags + action captions

current baseline use

Task labels + semantic targets

The atlas redistributes only small derived thumbnails and metadata. Raw MP4, HDF5, and RRD files remain excluded from this repo and the Hugging Face mirrors.

From raw episode to research artifacts.

Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.

Verified Xperience-10M multimodal pipeline diagram

What this project enables

It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.

What still needs more data

General embodied-intelligence model quality requires many episodes and held-out episode splits; the public sample is the development harness for that next stage.

What the current results actually say.

A generated takeaways layer reads the committed metrics, summarizes useful research signals, and identifies what still needs held-out episodes.

One episode becomes a benchmark contract

The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,378-dimensional feature contract.

research_takeaways.json

Chronological split exposes class shift

All-feature action reaches 0.9791 macro-F1 on its local split, while the 12-task chronological action head is 0.0500 macro-F1 with four unseen later action labels.

takeaways

Neural heads help dynamics

Hand MPJPE improves from 0.8223 to 0.1116; temporal-order F1 rises from 0.5487 to 0.8718; misalignment F1 rises from 0.4866 to 0.7335.

metrics

Retrieval and reconstruction remain open

Ridge/cosine retrieval remains stronger than the neural projection here, and cross-modal feature reconstruction still has negative R2.

retrieval metrics

Scale means held-out episodes

The next credible model-quality unit is a 32-episode held-out pilot across 32 sessions, not more adjacent windows from one sample.

scale-up status

Small baselines, no hidden machinery.

Motion-only and current all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. The neural run keeps the same features and splits, then swaps in PyTorch MLP heads.

Motion-only action

0.9688macro-F1, 18 classes

Current all-feature action

0.9791macro-F1, 8,378 features

Motion-only subtask

0.9528macro-F1, 14 classes

Current all-feature subtask

0.9308macro-F1, chronological caveats
Macro-F1 comparison chart

Neural MLP heads, same task contracts.

The neural baseline uses small PyTorch MLP classifiers/regressors on the same 8,378-d window features, chronological splits, and leakage filters. This isolates the value of a nonlinear head before moving to heavier Qwen/Omni experiments.

Neural hand forecast

0.1116MPJPE, down from 0.8223 minimal

Neural temporal order

0.8718F1, adjacent-window diagnostic

Neural misalignment

0.7335F1, shifted motion/visual pairs

Neural cross-modal retrieval

0.1530MRR; ridge remains stronger here
Neural MLP episode task score chart Minimal versus neural MLP episode task score chart

The 12 tasks organized into four research directions.

Each task is mapped as direct, proxy, or diagnostic evidence for the Ropedia research tracks. The mapping uses two current baselines: minimal interpretable heads and neural MLP heads over the same feature contract.

partially implemented

A. Human Modeling & Motion Understanding

Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.

2direct2proxy0diagnostic
proxy tasks only

B. 3D/4D Reconstruction & Neural Rendering

Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.

0direct2proxy1diagnostic
strongest implemented

C. Egocentric Vision & Interaction

Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.

6direct2proxy3diagnostic
early proxy tasks

D. Scene Reconstruction & World Modeling

Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.

0direct6proxy3diagnostic
Coverage of the 12 Xperience-10M tasks across four research directions

Baseline 1: minimal heads

Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.

Baseline 2: neural MLP heads

Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.

Four extra probes make the directions actionable.

These are new data-backed extension tasks computed from the same single-episode feature tensor. They add one concrete input, process, output, and metric for each research direction, while keeping the single-episode limitation explicit.

Four Xperience-10M research-direction extension probes with minimal and neural metrics
A / motion

Body and Hand Motion Intensity

Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.

Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.

Output: high_motion or low_motion.

0.7827minimal macro-F10.7986neural macro-F1
B / views

Multi-View Consistency Retrieval

Case: retrieve the synchronized stereo-left window from a fisheye-camera query.

Input: fisheye_cam0 video features against stereo_left candidate features.

Output: ranked synchronized view candidates.

0.5534minimal MRR0.3469neural MRR
C / phase

Action Phase Progress Estimation

Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.

Input: non-caption multimodal features.

Output: 0-to-1 progress inside the current action.

0.3416minimal MAE0.3038neural MAE
D / world

Short-Horizon Ego-Motion Forecasting

Case: predict how the camera translation changes over the next 20 frames.

Input: current sensors excluding camera translation and captions.

Output: future camera-translation delta vector.

0.1989minimal MAE0.0989neural MAE

What changed

The four research directions now have coded extension probes, prediction/rank CSVs, JSON metrics, a Markdown summary, and a website chart generated from real sample-window features.

What still needs scale

A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.

The 12 tasks share four head families.

The diagram separates the shared episode-window feature pipeline from the task-specific heads, and notes that audio remains dataset context rather than a current baseline feature block.

Verified minimal and neural architecture diagram for all 12 Ropedia Xperience-10M tasks

Interactive task walkthrough.

Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.

Representative sample modality for the selected task
Step 1 / 4 · Input
Action Recognition Egocentric Action Recognition

Input: inspect the 20-frame multimodal window before choosing the target.

01 / 12
supervised multiclass classifier

Action Recognition

In the coffee-making sample, a pouring window maps to the current action label.

    Metric: macro-F1. Minimal 0.0500; neural MLP 0.0263.

    Current limitation: single-episode chronological split.

    Task cards and metrics.

    The 12 task cards use readable research names, representative modality thumbnails, explicit input-process-output contracts, and verified minimal versus neural scores from the committed result files.

    Every feature block has a source.

    The point is not hidden complexity. Every block has a source modality, a dimensional footprint, and a manifest entry.

    All modality feature block chart

    Diagnostics separate memorization from signal.

    The charts make the main lesson visible: within-episode supervised labels are easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.

    Episode task suite score chart Cross modal retrieval chart Neural MLP task score chart Minimal versus neural score chart

    Open the single-episode explorer to inspect window-level labels, predictions, feature-block statistics, object labels, and diagnostic scores.

    Research artifacts for the next experiments.

    Metrics, predictions, manifests, lightweight model weights, and derived window artifacts are organized so the project can be inspected, extended, and scaled before rerunning the full pipeline. Raw Xperience-10M data and Qwen weights are not redistributed.

    Research artifacts

    From one episode to task heads

    Start with the files that define the sample windows, feature blocks, task contracts, metrics, walkthroughs, and research-direction mapping.

    Task-suite report

    One JSON file with every task definition, split detail, feature dimension, and minimal/neural metric.

    summary_report.json

    Windows table

    Window start/end frames and aligned action/subtask labels for the public sample episode.

    windows.csv

    Feature manifest

    Start/end index and dimension for every current feature block in the 8,378-d window vector.

    feature_manifest.json

    Neural MLP task results

    Per-task PyTorch MLP metrics, predictions, histories, and checkpoints for the same 12 task contracts.

    neural_mlp/

    Four-direction taxonomy

    Generated JSON, CSV, Markdown, and website data mapping all 12 tasks to the four research tracks.

    research_directions/

    Direction extension probes

    Four coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.

    research_direction_extensions/

    Task walkthroughs

    Case studies for all 12 tasks, including input, middle process modules, output, metric, limitation, and task-player data.

    task_walkthroughs/

    Single-episode explorer

    Interactive window-level view of labels, predictions, feature-block statistics, object labels, and diagnostics.

    single_episode_explorer.html

    Cross-modal retrieval

    The strongest self-supervised signal from the single episode.

    metrics.json

    Qwen3-Omni pilot is approval-ready.

    The full Xperience-10M Hugging Face dataset is gated. While access is pending, the public plan has selected a 32-episode pilot across 32 different session UUIDs.

    Selection

    Stratified round-robin over 64 top-level sessions; 680 complete candidates scanned; 32 sessions selected.

    Transfer

    Download raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.

    Current LoRA artifact

    The current LoRA artifact uses the locally available sample data. The 32-episode result begins after gated data is staged and held-out evaluation runs.

    Reproduce the suite.

    Raw Xperience-10M data is not redistributed here. The reproduction guide states the commands, expected outputs, exact-match reproduction record, and multi-episode requirements.

    Reproducibility guide

    Human-readable commands, expected artifacts, and current scope for the public single-episode pipeline.

    REPRODUCIBILITY.md

    Reproducibility matrix

    Machine-readable command matrix covering sample download, baselines, 12 tasks, figures, and validation.

    reproducibility_matrix.json

    Exact-match reproduction record

    The last metric rebuild reproduced the public-sample outputs from a fresh cache and matched the committed metrics.

    reproducibility_audit.md

    Website reference report

    Local HTML references, anchors, JSON bundles, and image dimensions are validated before publishing.

    website_integrity.json

    32-Episode pilot status

    The 32-episode Qwen3-Omni pilot is prepared at the code and selection-plan level; final metrics follow gated data access and held-out evaluation.

    DATA_ACCESS_STATUS.md

    Minimal path: install the toolkit dependencies, download the official sample, run the 12-task suite with neural heads, regenerate visualizations, then run the artifact index and publication validator.

    git clone https://github.com/Ropedia/HOMIE-toolkit.git
    python3.12 -m venv .venv
    source .venv/bin/activate
    pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
    git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
    pip install -r ropedia-xperience-10m-task-suite/requirements.txt
    pip install torch
    
    hf download ropedia-ai/xperience-10m-sample \
      --repo-type dataset \
      --local-dir data/sample/xperience-10m-sample
    
    cd ropedia-xperience-10m-task-suite
    export WORKSPACE=/path/to/workspace
    python scripts/episode_task_suite.py --workspace "$WORKSPACE" --include-neural
    python scripts/research_direction_extension_tasks.py
    python scripts/task_walkthroughs.py
    python scripts/generate_visualizations.py
    python scripts/render_overview_figures.py
    python scripts/render_task_suite_infographic.py
    python scripts/export_modality_atlas_assets.py
    python scripts/validate_website_integrity.py
    python scripts/validate_scope_claims.py
    python scripts/build_artifact_index.py
    python scripts/validate_mirror_parity.py
    python scripts/validate_publication_package.py