--- license: mit library_name: pytorch tags: - embodied-ai - robotics - multimodal - xperience-10m - baseline - evaluation - qwen3-omni - cosmos datasets: - ropedia-ai/xperience-10m-sample - ropedia-ai/xperience-10m metrics: - accuracy - f1 - precision - recall ---

Ropedia Xperience-10M Task Suite

A multilingual public research surface for Xperience-10M: sample data, 20 embodied-AI tasks, baselines, Qwen3-Omni and Cosmos3 diagnostics, and foundation-model training directions.

English · 中文 · Español · Français · Deutsch · 日本語 · 한국어 · Português

**Ropedia Xperience-10M Task Suite** has two public evidence lines. **Line 1** is the 1-sample task lab for raw-file inspection, task construction, and reproducibility. **Line 2** is the selected-128 comparison surface for aligned metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window. Every score points to a source artifact and keeps direct-vs-proxy status visible. **Updated:** 2026-06-21. **Scope:** Line 1 uses one public sample episode. Line 2 uses selected 128-episode public-safe artifacts linked back to official gated episode paths. Raw Xperience-10M MP4/HDF5/RRD files, Qwen3 base weights, Cosmos3 base weights, and gated data are not redistributed here. ## Contents - [How To Read This Project](#how-to-read-this-project) - [At A Glance](#at-a-glance) - [Two Evidence Lines](#two-evidence-lines) - [Fast Reader Map](#fast-reader-map) - [Why This Project Exists](#why-this-project-exists) - [Start Here](#start-here) - [Glossary](#glossary) - [Current Research Scope](#current-research-scope) - [Evaluation Protocol](#evaluation-protocol) - [Dataset Context](#dataset-context) - [Reproducibility](#reproducibility) - [Citation](#citation) ## How To Read This Project Use the two evidence lines first, then choose the artifact that answers your question. The dashboard is the best visual overview; the GitHub repo is the source of truth for scripts and generated JSON; Hugging Face mirrors contain public-safe cards, metrics, figures, and model artifacts. Quick rule: use **Line 1** for “can I inspect and reproduce the task?” Use **Line 2** for “how do aligned baselines and model diagnostics compare on the selected 128 episodes?” The multilingual README files are reader guides. The canonical technical evidence is still the committed task contracts, result matrices, validation JSON, and public-safe result packages. ## At A Glance

Signal	Current public state
Project identity	The same logo mark is used across the GitHub README, GitHub Pages dashboard, Hugging Face Space, artifact dataset, model mirrors, favicon, and social preview. Reusable assets: logo mark and social card.
Two-line contract	Line 1: 1 sample episode for task construction and reproducibility. Line 2: 128 selected episodes for same-split metadata/raw baselines, Qwen3-Omni v6, and Cosmos3 diagnostics.
180 method-task records	9 methods x 20 tasks = 180/180 scored records. The ledger separates 174 direct scores from 6 compact-proxy scores.
20 task contracts	Action, procedure, transition, trajectory, contact, objects, language, retrieval, reconstruction, order, sync, long-horizon forecasting, interaction text, action-object binding, sensor bridging, camera sync, and transition timing.
Line 1 methods	Minimal and Neural MLP baselines cover all 20 tasks on the one public sample episode: 40/40 direct scores.
Line 2 methods	Metadata simple/NN, raw-feature simple/NN, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window cover all 20 selected-128 task axes: 140/140 scores.
Foundation directions	Spatial intelligence, human-video world modeling, and vision-language-action pipelines are documented as trainable directions with task mappings and model-evidence requirements.
Public mirrors	GitHub, GitHub Pages, HF Space, HF artifact dataset, HF baseline model repo, Qwen3-Omni and Cosmos3 model repos, and HF collection.

## Two Evidence Lines The public suite is organized around two evidence lines. Keep them separate when reading metrics.

Two evidence-line map: 1 sample episode and 128 selected episodes combine into 180 scored method-task records

Line	Data unit	Score statement	Best use	Read separately from
1 sample episode	One public Xperience-10M sample episode: 5,821 frames, 1,161 aligned 20-frame windows, 8,546 feature dimensions.	40/40 direct scores from Minimal and Neural MLP heads.	Inspect the raw sample, understand file organization, reproduce the 20 task targets, and compare Minimal vs Neural MLP behavior inside one episode.	The selected-128 comparison rows and any broader held-out model behavior.
128 selected episodes	Selected held-out 96/16/16 split: 34,269 exported windows with public-safe processed features linked to official gated episode paths. The Hugging Face artifact dataset exposes these rows separately as `selected_128_windows/selected_128`; it is not mixed with the one-sample `episode_sample/public_sample` viewer.	140/140 selected-128 scores: 134 direct + 6 compact-proxy.	Compare same-split metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano while keeping the 6 compact-proxy cells visible.	Direct raw-target measurements for the proxy-marked cells.

### Result Ledger

Line	Methods	Tasks	Scored records	Direct scores	Proxy scores
1 sample episode	2	20	40/40	40	0
128 selected episodes	7	20	140/140	134	6 compact-proxy scores, each source-linked and reasoned.
Total public matrix	9	20	180/180	174	6

### Method Blocks

Evidence line	Method block	Methods	Score statement	Read as
1 sample episode	Task-head baselines	Minimal; Neural MLP	40/40 direct scores.	Task-lab reproducibility and simple-vs-neural behavior.
128 selected episodes	Aligned baseline heads	Metadata simple/NN; raw-feature simple/NN	80/80 scores: 74 direct + 6 compact-proxy.	Same-split metadata/raw-feature baseline comparison.
128 selected episodes	Qwen3-Omni series	Qwen3-Omni v6 LoRA	20/20 direct scores from verified selected-128 Qwen3-Omni LoRA and task-specific probes.	Trainable Qwen3-Omni diagnostic baseline on the selected-128 surface.
128 selected episodes	Cosmos3 series	Cosmos3-Super Reasoner; Cosmos3-Nano Future Window	40/40 direct scores from verified public-safe reasoner and future-window artifacts.	Cosmos3 reasoner and future-window diagnostics on the selected-128 surface.

Cosmos3-Super Forward-Dynamics LoRA is published as a separate fine-tuned adapter artifact with weights/results; it is not counted as a 20-task matrix method row. ### Qwen3-Omni Run Versions These are Qwen3-Omni run versions inside **Line 2: selected 128 episodes**. They are not the project evidence lines. The 20-task matrix uses **Qwen3-Omni v6 LoRA**; **v5** remains the pinned prior multiscale release; **v1-v4** are lineage and ablation evidence.

Run	Purpose	Main change	Eval signal	Use now
v1	Prove the selected-128 LoRA/eval/package loop.	First verified 96/16/16 selected-episode Qwen3-Omni LoRA run.	448 eval; JSON 0.8750; contact 0.6451.	Lineage only.
v2	Make answers schema-checked.	Structured-JSON contract with full-8-GPU LoRA on the same split.	448 eval; JSON 0.9978; contact 0.7188.	Structured-output ablation.
v3	Separate prompt/eval effects from training.	Strict-label prompt/eval over the v2 adapter; no new adapter training.	448 eval; JSON 1.0000; contact 0.7210.	Prompt/eval ablation.
v4	Test longer structured-JSON LoRA training.	New four-epoch full-8-GPU adapter on the same selected split.	448 eval; JSON 1.0000; contact 0.7299.	Overfit/metric-tradeoff evidence.
v5	Move to denser multiscale evaluation.	Multiscale cap96 export with 4,032 held-out predictions.	4,032 eval; JSON 1.0000; contact 0.7865.	Pinned prior release; stronger on several non-contact metrics.
v6	Publish the current Qwen 20-task row.	Rank64/lr5e-5 multiscale LoRA plus verified task-specific probes.	4,032 eval; JSON 0.9990; contact 0.8177.	Current public 20-task Qwen3-Omni row.

Detailed lineage: [`QWEN3_OMNI_RUN_LINEAGE.md`](QWEN3_OMNI_RUN_LINEAGE.md) and [`qwen3_omni_run_lineage.json`](docs/data/qwen3_omni_run_lineage.json). Result entry points: [`TWO_EVIDENCE_LINES.md`](TWO_EVIDENCE_LINES.md), [`two_evidence_lines.json`](docs/data/two_evidence_lines.json), [`TWO_EVIDENCE_LINE_RESULT_SUMMARY.md`](TWO_EVIDENCE_LINE_RESULT_SUMMARY.md), [`two_evidence_line_result_summary.json`](docs/data/two_evidence_line_result_summary.json), [`QWEN3_OMNI_RUN_LINEAGE.md`](QWEN3_OMNI_RUN_LINEAGE.md), [`qwen3_omni_run_lineage.json`](docs/data/qwen3_omni_run_lineage.json), [`single_episode_task_model_radar.json`](docs/data/single_episode_task_model_radar.json), [`episode128_task_model_radar.json`](docs/data/episode128_task_model_radar.json), [`task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json), and [`xperience10m_128_episode_feature_index.json`](docs/data/xperience10m_128_episode_feature_index.json). ## Fast Reader Map

Reader goal	Start here	Then inspect
Understand quickly	Project brief Project status	Dashboard
Choose the public surface	Public reader map	public_reader_map.json
Decode project terms	Glossary	glossary.json
Inspect the 20 tasks	TASK_SUITE_20.md	task_suite_20.json task walkthroughs
Compare results	Research takeaways	two-line result summary 20-result matrix radar JSON score/proxy audit
Understand one sample	Single-episode explorer	raw sample file map feature manifest
Read foundation directions	Three foundation pipelines	three_foundation_pipelines.json foundation model plan
Reproduce or audit	Reproducibility Evidence contract	quality gates publication audit mirror parity

## Why This Project Exists This project is organized as a compact research artifact around Xperience-10M: start from a real public episode, make every modality and label path inspectable, turn the data into concrete embodied-AI tasks, and keep the evaluation boundary clear while preparing the next multi-episode experiments. The emphasis is on research judgment as much as implementation: what the sample can show, where the selected-128 comparison begins, and what evidence should exist before presenting stronger model quality. The work is designed to demonstrate four capabilities that matter for embodied-AI research infrastructure:

Capability	What this project shows
Multimodal data understanding	Parses the public sample into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals.
Task design	Defines 20 human-readable tasks in one unified public-sample suite, plus four direction-extension probes with inputs, outputs, process modules, metrics, and case-study walkthroughs.
Model and evaluation discipline	Runs minimal and compact neural baselines, records predictions/metrics, keeps chronological split boundaries explicit, and separates the sample readout from held-out comparison rows.
Scale-up planning	Connects the public-sample pipeline to 32/128-episode held-out pilots, Qwen3-Omni LoRA, Cosmos-style world-model tracks, policy/VLA tracks, and the future Xperience-native foundation-model pretraining goal.

## Start Here The public release is split across GitHub, the website, and Hugging Face. Use [`PUBLIC_READER_MAP.md`](PUBLIC_READER_MAP.md) first if you want the shortest route through those surfaces, or use the machine-readable companion [`docs/data/public_reader_map.json`](docs/data/public_reader_map.json). For the one-page project summary, use [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) and [`docs/data/project_brief.json`](docs/data/project_brief.json).

Reader goal	Best entry point
Choose the right public surface	PUBLIC_READER_MAP.md public_reader_map.json
Resolve confusing terms and abbreviations	GLOSSARY.md glossary.json
Understand the whole project quickly	PROJECT_BRIEF.md
See the visual research dashboard	GitHub Pages dashboard
Navigate the unified 20 tasks, four tracks, and scale-up plan	Interactive research roadmap TASK_SUITE_20.md task_suite_20.json research_roadmap_interactive.json
Compare current task metrics	RESEARCH_TAKEAWAYS.md summary_metrics.json
Compare possible foundation backbones	FOUNDATION_MODEL_PLAN.md foundation_model_plan.json
Understand the future native pretraining goal	XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md
See additional concrete project directions	ADDITIONAL_DEVELOPMENT_DIRECTIONS.md additional_development_directions.json
Understand one model input	feature_manifest.json windows.csv
Check multi-episode data status	DATA_ACCESS_STATUS.md

## Glossary Use [`GLOSSARY.md`](GLOSSARY.md) when a term such as evidence line, 20-frame window, direct score, compact-proxy score, raw metric value, normalized radar value, minimal/minimum baseline, simple baseline, Qwen v1-v6, Cosmos3-Super, LoRA adapter, or HF artifact dataset is unclear. The same definitions are mirrored as [`docs/data/glossary.json`](docs/data/glossary.json) for the website and Hugging Face repos. ## Public Surface Map

Surface	What it is for
GitHub repo	Source of truth for docs, scripts, generated JSON, validators, and commit history.
GitHub Pages dashboard	Best visual overview of the sample, 20 tasks, radar results, foundation directions, and resources.
Hugging Face Space	Hub-hosted copy of the dashboard and static app assets.
HF artifact dataset	Public-safe metrics, reports, website JSON, result packages, and derived evidence files.
HF baseline model repo	Minimal/neural baseline weights, figures, metrics, and mirrored task artifacts.
Qwen3-Omni and Cosmos3 model repos	Adapter-specific public weights or package cards when Qwen3-Omni v6, Cosmos3-Super, or Cosmos3-Nano runs are verified and publishable.

Public release checks are exposed as JSON for mirrors and dashboards: [`docs/data/website_integrity.json`](docs/data/website_integrity.json), [`docs/data/rendered_site_check.json`](docs/data/rendered_site_check.json), [`docs/data/task_surface_integrity.json`](docs/data/task_surface_integrity.json), [`docs/data/publication_audit.json`](docs/data/publication_audit.json), [`docs/data/mirror_parity.json`](docs/data/mirror_parity.json), [`docs/data/public_surface_qa.json`](docs/data/public_surface_qa.json), and [`docs/data/research_roadmap.json`](docs/data/research_roadmap.json). ## Research Project Overview

Theme	Current implementation
Dataset slice	One public Xperience-10M sample episode, 5,821 frames, 1,161 windows, and an 8,546-dimensional representation.
Modalities	Video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, calibration, and language annotations.
Task suite	20 human-readable tasks form one embodied-AI public-sample suite with shared windowing, split discipline, leakage controls, and minimal/neural head pattern.
Baselines	Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split; companion simple/NN metadata baselines are also aligned to the selected 128-episode 96/16/16 split.
Research directions	Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling.
Scale-up path	The selected-episode Qwen3-Omni LoRA v6 diagnostic package is verified on the 96/16/16 split with 34,269 exported windows and 4,032 held-out test predictions. v6 improves action macro-F1/contact accuracy versus v5; v5 remains a pinned prior-release row where it is stronger on other metrics. Same-split simple/NN metadata and raw-feature baselines are now reported on the unified 20-task axes, with compact-proxy notes retained where a target is derived from public-safe processed artifacts. The Qwen result proves the multi-episode export/train/eval/package loop and meets the strict-JSON target, but weak action/subtask metrics make it a baseline for error analysis rather than a strong model. Cosmos3 has three verified diagnostics: Nano future-window compatibility, Super base-weight Reasoner evaluation, and Super forward-dynamics LoRA fine-tuning over camera-pose proxy targets.
Public surfaces	GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection.

For the fastest interpretation of the current metrics, start with [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md) and [`docs/data/research_takeaways.json`](docs/data/research_takeaways.json). They summarize what the public sample results actually show: class shift under chronological splits, neural gains on dynamics/order/alignment, harder retrieval/reconstruction probes, and why the next model-quality step needs held-out episodes. Current contributions: - manifested sliding-window features over the currently extracted modalities, - motion-only and current all-feature baseline models, - 20 end-to-end episode-level task contracts, - one shared 20-frame window and chronological split contract across the public-sample task suite, - lightweight neural MLP heads for the same task contracts, - a generated four-direction research taxonomy matching the Ropedia job tracks, - four additional direction-extension probes with minimal and neural baselines, - human-readable research task cards and an interactive scrub/play walkthrough storyboard for every task, - an interactive research roadmap connecting 20 tasks, four research tracks, current sample evidence, the Qwen3-Omni scale-up path, and foundation-model track selection, - a next-milestone track for Qwen3-Omni fine-tuning, Cosmos 3 world modeling, and sensor-bridge evaluation, - a future pretraining plan for an Xperience Embodied Foundation Model over the full corpus after smaller multi-episode stages prove value, - metrics, predictions, model weights, manifests, charts, and a two-level tabbed static research website, - a clear explanation of what is implemented now and what moves to the multi-episode stage. ## Current Research Scope This project is best read as a staged embodied-AI research study:

Layer	Current scope	Where to start
Data understanding	One public Xperience-10M sample episode is converted into 5,821 frames, 1,161 aligned windows, and an 8,546-dimensional multimodal representation.	PROJECT_BRIEF.md PROJECT_STATUS.md
Task suite	Twenty human-readable tasks cover recognition, prediction, retrieval, reconstruction, synchronization, long-horizon forecasting, interaction text, action-object binding, sensor bridging, camera sync, and transition timing. Historical `tier2_task_suite` artifact paths are kept for link stability, but they are provenance paths inside the same suite.	TASK_SUITE_20.md task_suite_20.json RESEARCH_TAKEAWAYS.md summary_report.json TIER2_TASK_BASELINES.md
Baselines	Minimal heads and compact PyTorch MLP heads provide a controlled single-episode comparison on the same chronological split. The selected 128-episode setup adds same-split metadata simple/NN baselines for JSON-supported tasks and raw-feature simple/NN baselines on all 20 task axes. Tasks 15 and 19 are explicitly marked as compact-proxy completions.	neural_mlp/ BASELINE_ALIGNMENT_REPORT.md raw20 run summary
Diagnostics	Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard.	AUDIO_ABLATION_SUMMARY.md single_episode_explorer.html
Scale-up	Qwen3-Omni LoRA v6 is verified on the selected 96/16/16 split with 34,269 exported windows and 4,032 held-out test predictions. v6 improves action macro-F1/contact accuracy versus v5; v5 remains a pinned prior-release row because it is stronger on several other metrics. Same-split simple/NN metadata baselines are published for JSON-supported axes, and the raw-feature run adds simple/NN baselines on 20/20 task axes. Tasks 15 and 19 are documented compact proxies because raw interaction strings and paired video-view embeddings are absent from the 128 export. Cosmos3-Nano has a verified future-window compatibility package; Cosmos3-Super has a 448-window base-weight JSON-task Reasoner evaluation. Cosmos3-Super also has a fine-tuned forward-dynamics LoRA package over camera-pose proxy targets with 2,848 train rows, 512 validation rows, and 448 test rows. The 128-episode enhancement pack records dense-window sizing, hierarchical action/subtask targets, task bottlenecks, and next experiment cards without overwriting existing results.	RESEARCH_ROADMAP.md FOUNDATION_MODEL_PLAN.md XPERIENCE10M_128_EPISODE_FEATURE_INDEX.md xperience10m_128_episode_feature_index.json TASK_SUITE_ENHANCEMENT_128.md task_suite_enhancement_128.json omni_model_comparison.json omni_finetune_verified_result.json qwen3_v5_v6_comparison.json QWEN3_V5_V6_COMPARISON_20260614.md OMNI_MODEL_COMPARISON.md verified_public/ task_suite_enhancement_128_v1_20260608/

Detailed dataset notes, reproduction checks, and generated JSON reports are included for readers who want to inspect the implementation, but they are supporting materials rather than the main reading path. Use [`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) when you want the full file map. Source alignment is tracked in [`SOURCE_ALIGNMENT_AUDIT.md`](SOURCE_ALIGNMENT_AUDIT.md) and [`docs/data/source_alignment_audit.json`](docs/data/source_alignment_audit.json). The official gated `ropedia-ai/xperience-10m` card reports `31.9 TB` on the live HF surface and an `about-1PB` full-scale storage statement; the committed API-listing snapshot records `12,103 episode folders` as upstream `metadata only`, not a local raw-data inventory. In other words, those episode folders are upstream listing metadata only for this project. The public sample remains `ropedia-ai/xperience-10m-sample` under `cc-by-nc-4.0`, with the `HOMIE Toolkit` and `Rerun 0.29.0` noted as source tooling. The official responsible-use note that the data is `limited in diversity` is preserved. ## Project Status If you only have one minute, use [`PROJECT_STATUS.md`](PROJECT_STATUS.md) and [`docs/data/project_status.json`](docs/data/project_status.json). They give the current research state in one compact table:

Area	Current decision
Public-sample pipeline	Verified on one public sample episode: 5,821 frames, 1,161 windows, 8,546 dimensions.
20-task suite	Verified minimal baselines with committed metrics, predictions, and manifests.
Neural heads	Verified compact PyTorch MLP heads over the same task contracts and chronological splits.
Dataset context	Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented.
Evaluation protocol	Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics.
Website and Hub pages	Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links.
Qwen3-Omni multi-episode pilot	Final verified diagnostic result package exists for the selected 96/16/16 episode split; JSON validity meets the target, while action/subtask metrics remain weak.
Raw data / full Qwen weights	Raw Xperience-10M data and full Qwen weights are not redistributed.

## 90-Second Research Project Path If you are reading the project cold, open these in order:

Step	Question	Primary artifacts	What should be true
1	What is this project?	PROJECT_BRIEF.md PROJECT_STATUS.md Dashboard	A public-sample Xperience-10M research project with 20 tasks, baselines, and a scale-up plan.
2	What data is used?	Dataset-card alignment Official HF dataset Sample HF dataset	The implemented suite uses one public sample episode; the gated dataset is reserved for selected multi-episode training.
3	What does one model input contain?	windows.csv feature_manifest.json available_modalities.json	Each window is an aligned multimodal unit with video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals.
4	What are the 20 tasks?	TASK_SUITE_20.md task_suite_20.json task walkthroughs task_walkthroughs.json	Every task has a human-readable name, input, output, metric, baseline scores, and an explicit artifact path.
5	How are tasks evaluated?	EVALUATION_PROTOCOL.md evaluation_protocol.json	The window unit, chronological split, leakage controls, task metrics, and current limitations are explicit.
6	What do current results mean?	RESEARCH_TAKEAWAYS.md research_takeaways.json summary_metrics.json	Current metrics describe sample-level task behavior and identify which signals need larger held-out experiments.
7	Which models are implemented?	summary_report.json neural_mlp/ HF baseline repo	Each task has minimal and neural-head evidence over the same feature windows.
8	What research directions does this support?	RESEARCH_ROADMAP.md research_directions.json research_direction_extensions.json task_suite_20.json	The unified tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling.
9	Which foundation model comes next?	FOUNDATION_MODEL_PLAN.md foundation_model_plan.json Native pretraining plan	Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 has Nano compatibility and Super forward-dynamics LoRA; policy models wait for robot-compatible action targets.
10	How can the 128-episode suite be pushed without more data?	TASK_SUITE_ENHANCEMENT_128.md task_suite_enhancement_128.json	The enhancement pack proposes dense windows, hierarchical action/subtask labels, raw-feature shard priorities, and `multiscale_20s10_40s20_80s40` as the next export target.
11	How do I reproduce it?	REPRODUCIBILITY.md reproducibility_audit.md	Public commands and expected outputs are documented for the sample-episode task suite.
12	What is still pending?	omni_finetune_verified_result.json DATA_ACCESS_STATUS.md MULTI_EPISODE_ACCESS_STATUS.md	The final held-out diagnostic Qwen pass is verified and JSON-validity target is met; strong action/subtask model quality remains pending.

A compact reader-path summary is available at [`docs/data/project_packet.json`](docs/data/project_packet.json). ## Supporting Files [`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) is the human-readable map for readers who want to inspect the project files after the first pass. It groups the main briefs, task outputs, baseline results, visual assets, data notes, and scale-up documents. [`docs/data/artifact_index.json`](docs/data/artifact_index.json) is the compact machine-readable companion used by the website and Hugging Face artifact dataset. ## Evaluation Protocol [`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md) and [`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) are generated from committed metric artifacts. They define: - the 20-frame window unit, stride, feature dimension, and raw-data policy, - the chronological 70/30 single-episode split and its generalization limit, - the per-task input, target, primary metric, minimal score, and neural score, - leakage controls for future labels, target-side signals, caption/object labels, and train-only normalization, - current limitations, including cross-episode generalization, audio-visual learning, pixel-depth reconstruction, and real held-out multi-episode Qwen3-Omni quality. ## Dataset Context The official [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m) dataset is a gated large-scale egocentric multimodal dataset for embodied AI, robotics, spatial intelligence, and world modeling. The public [`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) repo provides the sample episode used for the implemented task suite here. This project keeps two evidence lines separate. Line 1 uses the public sample for raw-file inspection, task construction, and local reproducibility. Line 2 uses selected 128-episode public-safe artifacts for same-split method comparison, Qwen3-Omni v6 diagnostics, and Cosmos3 diagnostics. Raw Xperience-10M MP4/HDF5/RRD files are not redistributed in this repo or in the Hugging Face mirrors. The current verified public-sample subset is: - one public sample episode, 5,821 frames, and 1,161 aligned windows, - raw sample files with six MP4 video streams and audio streams, - `annotation.hdf5` carrying depth, SLAM/camera pose, hand/body mocap, IMU, language/caption annotations, calibration, metadata, and timing records, - an 8,546-dimensional baseline representation using video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals. Detailed dataset notes are available in [`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md) and [`docs/data/xperience10m_dataset_card_alignment.json`](docs/data/xperience10m_dataset_card_alignment.json) for readers who need the full upstream-card and access-term context. The practical reading rule is simple: Line 1 is the task lab, Line 2 is the selected-128 comparison surface, and compact-proxy cells stay explicitly marked where direct raw targets are missing. Start with the visual dashboard: **[chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/)** Hugging Face Space app: **[cy0307-ropedia-xperience-10m-task-suite.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.hf.space/)** ## Read This Project By Evidence View

View	What to inspect	Why it matters
Project status	PROJECT_STATUS.md project_status.json	Gives a one-table current project summary before reading the full artifact trail.
Data contract	windows.csv feature_manifest.json modality manifests	Confirms what each sample window contains before modeling.
Dataset context	XPERIENCE10M_DATASET_CARD_ALIGNMENT.md official dataset links	Explains the official dataset, public sample, modalities, access boundary, and what this repo uses.
Visual assets	FIGURE_INDEX.md docs/assets/	Shows the task-suite graphic, modality thumbnails, pipeline diagrams, charts, and logo assets.
Evaluation protocol	EVALUATION_PROTOCOL.md evaluation_protocol.json	Defines the task unit, split, metrics, leakage controls, and current limitations.
Research roadmap	RESEARCH_ROADMAP.md research_roadmap.json	Shows the path from sample-level task development to multi-episode work, larger model tracks, and the future native-pretraining goal.
Additional development directions	ADDITIONAL_DEVELOPMENT_DIRECTIONS.md additional_development_directions.json	Records concrete non-backbone tracks: taxonomy, benchmark protocol, representation learning, skill graphs, affordances, 3D/4D memory, QA, and policy transfer.
Xperience Embodied Foundation Model plan	XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md	Describes the long-term full-corpus pretraining goal, target modules, objectives, staged scale-up, hardware ranges, and evaluation protocol.
Minimal heads	softmax ridge projection/regression multi-label logistic heads	Keeps every input/output contract visible and inspectable.
Neural heads	PyTorch MLP classifiers/regressors under neural_mlp/	Checks whether nonlinear heads improve each task without changing features.
Evidence	metrics predictions confusion matrices diagrams dashboard	Makes the single-episode task development inspectable without rerunning first.
Artifact guide	ARTIFACT_GUIDE.md	Groups the public evidence into reader-facing views after the first-pass overview.
Reproducibility contract	REPRODUCIBILITY.md reproducibility_matrix.json	States public commands, expected outputs, exact-match reproduction evidence, and non-reproducible boundaries.
Citation metadata	CITATION.cff codemeta.json LICENSE	Makes the repo easier to cite, index, and reuse without confusing code license and dataset terms.

## Links

Resource	Link
This GitHub repo	github.com/ChaoYue0307/ropedia-xperience-10m-task-suite
This project website	chaoyue0307.github.io/ropedia-xperience-10m-task-suite
This Hugging Face Space	huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite
Live Hugging Face app	cy0307-ropedia-xperience-10m-task-suite.hf.space
GitHub Container package	ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite
Derived artifacts on Hugging Face	huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts
Minimal and neural task baselines on Hugging Face	huggingface.co/cy0307/ropedia-xperience-10m-task-baselines
Consolidated weights, results, and analysis package	huggingface.co/cy0307/ropedia-xperience-10m-weights-results
Qwen3-Omni 128-episode LoRA adapter	huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep
Cosmos3-Super forward-dynamics LoRA adapter	huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep
Hugging Face collection	huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite
Xperience-10M dataset website	ropedia.com/dataset
Xperience-10M release page	ropedia.com/blog/20260316_xperience_10m
Ropedia GitHub organization	github.com/Ropedia
HOMIE Toolkit	github.com/Ropedia/HOMIE-toolkit
Xperience-10M Hugging Face dataset	huggingface.co/datasets/ropedia-ai/xperience-10m
Xperience-10M sample on Hugging Face	huggingface.co/datasets/ropedia-ai/xperience-10m-sample
Ropedia Hugging Face organization	huggingface.co/ropedia-ai

## Citation, License, And Metadata Use [`CITATION.cff`](CITATION.cff) when citing this project. The repository also includes [`codemeta.json`](codemeta.json) for machine-readable software metadata and [`docs/data/project_manifest.json`](docs/data/project_manifest.json) for website/Hugging Face surface metadata. The code files are MIT-licensed. Raw Xperience-10M data is not redistributed here, and dataset use remains governed by the official Ropedia/Xperience-10M terms. See [`LICENSE`](LICENSE) and [`DATA_NOTICE.md`](DATA_NOTICE.md). ![Ropedia Xperience-10M task-suite infographic](docs/assets/task_suite_infographic.png?v=xperience10m-taskfirst-v14-modality-compact) The infographic uses a custom text-free research background and puts the shared processing contract plus all 20 unified task families in one figure. Public sample stream thumbnails remain available through the raw sample browser and derived modality assets instead of a separate repeated atlas panel. The task names, input/output summaries, and metrics are overlaid from [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) with [`scripts/render_task_suite_infographic.py`](scripts/render_task_suite_infographic.py), so the published PNG is a presentation graphic with verified labels and metrics, not a hallucinated metric sheet. The complete unified task list is documented in [`TASK_SUITE_20.md`](TASK_SUITE_20.md) and [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json). Historical `tier2_task_suite` paths remain only as stable provenance links inside the same suite. ![Unified 20-task model radar](docs/assets/charts/unified_task_model_radar.svg) The unified radar is now a grouped small-multiple comparison board instead of a nine-method overlay. It keeps all 20 task axes and all 9 method rows visible, but separates the methods into single-episode, 128-episode metadata/text, 128-episode raw-feature, and foundation-model panels. Every method has 20 explicit result records in the public matrix. Tasks 15 and 19 are marked as compact-proxy completions where the 128 export lacks raw interaction strings or paired video-view embeddings; those six proxy cells stay explicitly marked instead of being blended into direct-target metrics. The SVG uses `sqrt(normalized_score)` only for visual radius so small but real differences are readable; raw metrics and exact linear normalized scores remain in JSON and the table. Cosmos3-Super forward-dynamics LoRA remains a separate artifact card because its camera-pose proxy MSE is not one of the 20 task metrics. The machine-readable copies are [`docs/data/unified_task_model_radar.json`](docs/data/unified_task_model_radar.json) and [`docs/data/task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json); the explicit score/proxy ledger is [`docs/data/task_method_20_gap_audit.json`](docs/data/task_method_20_gap_audit.json) and [`TASK_METHOD_20_GAP_AUDIT.md`](TASK_METHOD_20_GAP_AUDIT.md); the reader-facing matrix is [`TASK_METHOD_20_RESULT_MATRIX.md`](TASK_METHOD_20_RESULT_MATRIX.md). The website Results section also renders the same 180 cells as a wide, source-linked table with raw values, normalized radar values, metric keys, and direct/proxy badges. For easier reading, the same source data is also split into two focused radars: ![Single-episode 20-task model radar](docs/assets/charts/single_episode_task_model_radar.svg) ![128-episode 20-task model radar](docs/assets/charts/episode128_task_model_radar.svg) The single-episode radar uses one enlarged panel for Minimal vs Neural MLP, both with 20/20 scored public-sample axes. The 128-episode radar uses three grouped panels for metadata/text baselines, raw-feature baselines, and foundation-model rows: metadata and raw-feature simple/NN baselines are now complete 20/20 multi-episode records, and Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window each carry 20 scored task records. The current matrix has 180/180 scored method-task records. The website raw sample browser includes a concise stream-to-feature ledger backed by [`docs/data/modality_atlas.json`](docs/data/modality_atlas.json) and [`docs/assets/modalities/`](docs/assets/modalities/). Those assets are small derived thumbnails from the public sample, not raw Xperience-10M files. ![Verified Pipeline](docs/assets/pipeline_diagram.png?v=xperience10m-nn) ![Qwen3-Omni LoRA training pipeline](docs/assets/qwen3_omni_lora_pipeline.png?v=qwen3-lora-v1) ![Minimal and neural task model architectures](docs/assets/task_architectures.png?v=xperience10m-nn) The pipeline and architecture figures use the same pattern: text-free visual backgrounds carry the composition, while [`scripts/render_overview_figures.py`](scripts/render_overview_figures.py) overlays exact labels, dimensions, and metrics from the committed result files. ## Scope This is a learning, inspection, and pipeline-validation repo with two public evidence lines. Line 1 is built from one public sample episode. Line 2 uses a selected 96/16/16 split over 128 episode paths, public-safe processed features, and verified Qwen3-Omni/Cosmos3 diagnostic artifacts. ## What Is Inside ```text scripts/ train_min_action_model.py # motion/IMU baseline train_all_modalities_model.py # current all-feature lightweight baseline episode_task_suite.py # public-sample task definitions neural_task_models.py # optional PyTorch MLP heads for task contracts research_direction_taxonomy.py # maps walkthrough-backed tasks to the four research tracks research_direction_extension_tasks.py # one extra data-backed probe per track tier2_task_suite.py # historical-name provenance builder for unified task rows build_unified_task_suite.py # builds TASK_SUITE_20.md and task_suite_20.json build_unified_task_model_radar.py # builds grouped 20-axis model comparison radars build_task_method_20_gap_audit.py # builds the explicit 180/180 scored-cell ledger task_walkthroughs.py # human-readable task-card and walkthrough-storyboard metadata generate_visualizations.py # refreshes SVG charts + summary JSON render_task_suite_infographic.py # renders the task-suite presentation PNG export_modality_atlas_assets.py # exports responsive modality-card assets render_overview_figures.py # renders polished pipeline/architecture PNGs build_brand_assets.py # derives logo sizes, favicon, social card build_artifact_index.py # builds the compact artifact guide data build_quality_gates.py # builds release checks validate_mirror_parity.py # checks prepared GitHub/HF mirror file parity validate_scope_claims.py # separates setup artifacts from completed model metrics validate_task_surface.py # checks readable task cards and interactive storyboard wiring validate_website_integrity.py # checks local site links, anchors, and images validate_publication_package.py # checks public repo + HF bundle contents publish_hf_bundles.py # uploads prepared HF Space/artifact/model bundles omni/ download_sample_modelscope.py # ModelScope sample download helper build_episode_manifest.py # metadata-only multi-episode scanner plan_finetune_sample_budget.py # storage/sample-count planner qwen3_omni_adapter_smoke.py # real-data Qwen3-Omni adapter setup check score_existing_model_output_task_probes.py # scores task targets already present in verified model outputs collect_qwen3_v4_release_artifacts.py # pulls verified v4 results after remote eval results/ min_action_model/ # motion-only action baseline artifacts min_subtask_model/ # motion-only subtask baseline artifacts min_all_modalities_action_model/ # current all-feature action artifacts min_all_modalities_subtask_model/ # current all-feature subtask artifacts episode_task_suite/ # task-suite metrics and predictions neural_mlp/ # optional neural baseline artifacts per task research_directions/ # four-track taxonomy, CSV, and summary research_direction_extensions/ # four extra direction probes + predictions tier2_task_suite/ # provenance baseline tasks + predictions; historical path task_walkthroughs/ # case-study walkthroughs for walkthrough-backed tasks omni_exploration/ # ModelScope readiness-check artifacts omni_finetune/model_output_task_probes_20260616/ # task-13/task-16 probes derived from verified model JSON docs/ index.html # GitHub Pages dashboard data/additional_development_directions.json # concrete non-backbone project directions data/summary_metrics.json # website-readable metrics bundle data/task_suite_20.json # unified 20-task suite bundle data/unified_task_model_radar.json # 20-task radar values, groups, and sources data/single_episode_task_model_radar.json # 1-episode grouped radar values data/episode128_task_model_radar.json # 128-episode grouped radar values data/task_method_20_result_matrix.json # 9-method x 20-task result matrix data/task_method_20_gap_audit.json # explicit 180/180 scored-cell ledger data/task_icon_manifest.json # assigned icon asset map for all 20 tasks data/evidence_contract.json # machine-readable project scope data/artifact_index.json # compact project-artifact catalog data/live_publication_status.json # live GitHub/HF publication verification data/quality_gates.json # machine-readable release checks data/task_suite_enhancement_128.json # no-new-episode 128-suite enhancement pack data/task_surface_integrity.json # machine-readable task-card/storyboard integrity check data/project_manifest.json # machine-readable public-surface metadata data/project_packet.json # compact project path and scope summary data/research_roadmap.json # multi-episode and omni-model roadmap data/research_directions.json # four-track website data bundle data/research_direction_extensions.json # four extra probe data bundle assets/task-icons/*.svg # one crisp assigned icon per task assets/task-icons/task-icon-atlas.png # generated overview atlas for the 20-task visual language data/tier2_task_suite.json # provenance baseline bundle; historical path data/task_walkthroughs.json # human-readable task-card and walkthrough-storyboard data data/modality_atlas.json # responsive modality-card data assets/brand/*.png # project logo, favicon, social card assets/task_suite_infographic.png # task-suite presentation graphic assets/modalities/ # public-sample derived modality thumbnails assets/pipeline_diagram.png # verified episode pipeline graphic assets/qwen3_omni_lora_pipeline.png # Qwen3-Omni LoRA training-flow figure assets/task_architectures.png # verified task-head architecture map assets/charts/unified_task_model_radar.svg # 9-method grouped small-multiple radar board assets/charts/single_episode_task_model_radar.svg # 1-episode enlarged radar panel assets/charts/episode128_task_model_radar.svg # 128-episode grouped radar panels assets/charts/*.svg # regenerated visualizations notes/ min_action_model.md all_modalities_model.md episode_task_suite.md ``` Raw Xperience-10M data is **not** committed. Download it from the official Ropedia distribution and follow the dataset terms. ## GitHub Package The public dashboard is packaged as a static-site container on GitHub Container Registry. It contains the `docs/` site plus the main reader documents; it does not include raw Xperience-10M videos, raw annotations, gated data, or model weights. ```bash docker pull ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest docker run --rm -p 8080:80 ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest ``` Then open `http://localhost:8080`. ## Data Expected The scripts expect a workspace with the Ropedia HOMIE toolkit and the Xperience-10M sample episode: ```text / HOMIE-toolkit/ data/sample/xperience-10m-sample/ annotation.hdf5 fisheye_cam0.mp4 fisheye_cam1.mp4 fisheye_cam2.mp4 fisheye_cam3.mp4 stereo_left.mp4 stereo_right.mp4 ``` The public website also includes a Raw Sample Browser that lists every official sample file, plays compact browser-preview clips derived from the official MP4 streams, exposes the audio track embedded in `fisheye_cam0.mp4`, links the full raw Hugging Face source for each MP4/HDF5/RRD file, and describes the `annotation.hdf5` group organization without copying large raw files into this repository. The public sample dataset identifier is: ```text ropedia-ai/xperience-10m-sample ``` Hugging Face URL: ```text https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample ``` ## Quickstart From a workspace folder: ```bash git clone https://github.com/Ropedia/HOMIE-toolkit.git python3.12 -m venv .venv source .venv/bin/activate pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet ``` Download the sample: ```bash hf download ropedia-ai/xperience-10m-sample \ --repo-type dataset \ --local-dir data/sample/xperience-10m-sample ``` If Hugging Face access is unavailable in your environment, use ModelScope: ```bash python scripts/omni/download_sample_modelscope.py \ --output-dir data/sample/xperience-10m-sample \ --mode minimal ``` `--mode minimal` downloads `annotation.hdf5`, `README.md`, and `fisheye_cam0.mp4`. Use `--mode all-training` to add all six MP4 streams while still skipping `visualization.rrd`. Clone and run this repo: ```bash git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git cd ropedia-xperience-10m-task-suite python scripts/episode_task_suite.py --workspace /path/to/workspace ``` Run the public-sample task definitions with lightweight neural heads: ```bash pip install torch python scripts/episode_task_suite.py \ --workspace /path/to/workspace \ --include-neural ``` Then rebuild the unified 20-task index after the historical provenance bundle is regenerated: ```bash python scripts/tier2_task_suite.py --workspace /path/to/workspace python scripts/build_unified_task_suite.py python scripts/build_evaluation_protocol.py ``` Run the smaller baselines: ```bash python scripts/train_min_action_model.py --workspace /path/to/workspace python scripts/train_all_modalities_model.py --workspace /path/to/workspace ``` ## Xperience-10M Fine-Tuning Exploration This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The repository separates public-sample evidence from multi-episode fine-tuning artifacts. The selected-episode held-out package is now verified as a diagnostic result, not a strong final action/subtask model. The useful distinction is: - direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language prompts, - adapter-required Xperience-10M sensor inputs: depth, pose/SLAM, hand/body mocap, contacts, and IMU. ![Xperience-10M to Qwen3-Omni LoRA training flow](docs/assets/qwen3_omni_lora_pipeline.png?v=qwen3-lora-v1) The figure shows the intended end-to-end training flow: raw valid episodes enter episode-level split validation, parallel media/sensor export creates Qwen-style JSONL records, Qwen3-Omni receives video/audio/text directly, the sensor bridge adds depth/pose/mocap/IMU features, LoRA adapters are trained on prepared train/val episodes, and sealed held-out test evaluation produces predictions, metrics, run reports, and upload-ready adapter artifacts. The scale-up path requires valid prepared episodes, held-out episode splits, training metadata, predictions, metrics, and a run report. A result is ready for public README, website, or Hugging Face updates only after the validator passes and `scripts/omni/package_verified_omni_result.py` creates a public-safe derived-artifact package. The current verified package is listed in [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json). The current cross-version comparison is generated at [`docs/data/omni_model_comparison.json`](docs/data/omni_model_comparison.json) and [`results/omni_finetune/OMNI_MODEL_COMPARISON.md`](results/omni_finetune/OMNI_MODEL_COMPARISON.md); it separates the single-episode task suite, 128-episode aligned simple/NN baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window packages. The same generated files also include `model_groups`: a model-first view that pairs 1-episode and 128-episode entries for the same family. Use that section when comparing task heads against task heads, Qwen3-Omni smoke/LoRA against Qwen3-Omni LoRA, or Cosmos3-Nano compatibility against future Cosmos weight releases. For Qwen3-Omni specifically, read `QWEN3_OMNI_RUN_LINEAGE.md`: v1-v4 are pipeline-hardening and ablation evidence, v5 is the pinned prior multiscale release, and v6 is the current public 20-task Qwen row. The no-new-episode enhancement plan is recorded in [`docs/data/task_suite_enhancement_128.json`](docs/data/task_suite_enhancement_128.json) and [`TASK_SUITE_ENHANCEMENT_128.md`](TASK_SUITE_ENHANCEMENT_128.md). It keeps the current Qwen3-Omni v6 and Cosmos3 packages as baselines, then defines dense-window scenarios, hierarchical action/subtask targets, task bottlenecks, and experiment cards for stronger selected-128 runs without overwriting earlier results. ### Sample Count Decision Do not treat "10M" as a reason to start with the entire dataset. The engineering unit that matters first is diverse held-out episodes, not adjacent windows from one session. | Phase | Episodes/samples | Approx windows at stride 5 | Purpose | | --- | ---: | ---: | --- | | Readiness | 1-3 | 1k-3k | Verify loaders, token alignment, and task heads | | Pilot | 16-32 | 18k-37k | First held-out-episode evaluation | | Useful LoRA run | 64-128 | 74k-149k | Train sensor adapters plus selected Qwen3-Omni LoRA | | Storage-heavy run | 256+ | 297k+ | Only after download layout and checkpoint size are stable | Use the budget helper before downloading: ```bash python scripts/omni/plan_finetune_sample_budget.py \ --storage-root /path/to/storage \ --target-free-after-download-gb 800 \ --all-training-per-episode-gb 2.4 \ --full-preview-per-episode-gb 5.1 ``` ### Multi-Episode Readiness Gate ```bash python scripts/omni/discover_xperience10m_sources.py \ --workspace /path/to/ropedia-xperience-10m-task-suite \ --data-root /path/to/xperience10m_data \ --output results/omni_finetune/source_discovery.json ``` Current status in this repo: - public_sample_valid_episodes: 1 (degraded-valid: annotation + fisheye_cam0.mp4) - gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions - selected_episode_plan: 128 source-balanced episodes, 96/16/16 train/val/test - selected_download_size: 277.71 GiB excluding `visualization.rrd` - selected_source_feature_index: `XPERIENCE10M_128_EPISODE_FEATURE_INDEX.md` and `docs/data/xperience10m_128_episode_feature_index.json` - processed_128_feature_artifacts: 34,269 Qwen3-Omni v6 multiscale windows, 106,095 dense multiscale compact rows, and 34,269 x 394 metadata/text matrix rows, all linked back to official gated `ropedia-ai/xperience-10m` episode paths - verified_final_diagnostic_package: true - selected_split: 96 train / 16 validation / 16 held-out test episodes - exported_windows: 2,848 train / 512 validation / 448 test - validation_samples_used: 512 - held_out_eval: 448 test windows from 14 exported test episodes - final_train_loss / final_val_loss: 0.0277 / 0.0278 - current_quality_target: strict-label JSON validity 100.00%, meeting the 98% target; action/subtask quality remains weak - qwen3_lora_adapter_repo: https://huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep - cosmos3_super_lora_adapter_repo: https://huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep - 128_aligned_baselines: unified 20-task axes for simple and neural baselines, including metadata/text rows and public-safe compact-proxy rows where raw-feature targets are required - cosmos3_nano: verified Cosmos3-Nano future-window compatibility package, 378 held-out future-window predictions from 14 test episodes - cosmos3_super_reasoner: verified Cosmos3-Super Reasoner base-weight JSON-task evaluation, 448 held-out predictions from 14 test episodes; JSON validity 51.12%, action macro-F1 0.0008, contact accuracy 32.14%, transition accuracy 36.83% - cosmos3_super_forward_dynamics_lora: verified 8-GPU FSDP LoRA artifact over camera-pose proxy targets; 2,848 train rows, 512 val rows, 448 test rows, 26.2M adapter parameters, val MSE 4.0082, test MSE 3.6853; public package excludes safetensors - gated dataset: available for selected multi-episode data preparation - source_discovery: `results/omni_finetune/source_discovery.json` - data_status: `results/omni_finetune/DATA_ACCESS_STATUS.md` - access_status: `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md` Use this gate before scheduling any full fine-tune run. The pilot should use balanced held-out selection, not the first paths in repository order. The current 128-episode selection filters for complete leaf episodes, excludes `visualization.rrd`, balances episode-size bands, and preserves one selected episode per top-level session UUID. ### Progressive Train/Validation Pilot The selected 128-episode plan can be used before every episode has arrived by training only on prepared `train` episodes and monitoring prepared `val` episodes. The final `test` episodes stay sealed until the end, so early development does not contaminate held-out evaluation. ```bash python scripts/omni/build_selection_episode_manifest.py \ --workspace /path/to/ropedia-xperience-10m-task-suite \ --data-root /path/to/xperience10m_128 \ --selection-json results/omni_finetune/xperience10m_128_episode_selection.json \ --output results/omni_finetune/trainval_progressive/episode_manifest_trainval.json \ --include-split train \ --include-split val ``` `scripts/omni/run_trainval_progressive_128.sh` wraps the same guard, exports a train/val-only Qwen3-Omni JSONL dataset, and launches LoRA training without running final test evaluation. The exporter uses session-qualified episode IDs and path-based split matching so repeated folder names such as `ep1` cannot collide across different sessions. For larger prepared subsets, `scripts/omni/run_trainval_parallel_export_8gpu.sh` uses the same split guard, exports episodes in parallel CPU shards, skips and reports episodes that contain no labeled windows under the configured label rule, then launches Qwen3-Omni LoRA with `NUM_PROCESSES=8`. ### Full 128-Episode Held-Out Pilot Once all selected episodes are complete, use the fixed selected-episode split: - 96 train episodes, - 16 validation episodes, - 16 held-out test episodes. The clean full-run launcher validates the selected split, exports all splits in parallel, trains Qwen3-Omni LoRA on train episodes while optionally monitoring validation loss, then evaluates on the held-out test split: ```bash RUN_ID=xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ DATA_ROOT=/path/to/xperience10m_128 \ SELECTION_JSON=results/omni_finetune/xperience10m_128_episode_selection.json \ MODEL_DIR=/path/to/Qwen__Qwen3-Omni-30B-A3B-Instruct \ NUM_PROCESSES=8 \ TRAIN_VAL_SPLIT=val \ MAX_VAL_SAMPLES=512 \ scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh ``` The latest verified diagnostic package uses the same selected split and 8-GPU training path, includes the full held-out evaluation with 4,032 predictions and 99.90% JSON validity, and keeps raw data plus full Qwen weights out of the public repos. The next pass should keep this package contract while improving action/subtask target quality and error analysis. Monitor the run with: ```bash python scripts/omni/monitor_omni_progress.py \ --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu ``` The monitor reads training `progress.jsonl`, new evaluator partial-prediction progress, and legacy generation logs, so long held-out evals can still expose sample-level progress even before final metrics are written. Validate the run artifacts stage by stage: ```bash python scripts/omni/validate_omni_finetune_run.py \ --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ --require-stage manifest python scripts/omni/validate_omni_finetune_run.py \ --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ --require-stage eval \ --min-json-validity 0.98 ``` After the eval validator passes, create the public-safe result package: ```bash python scripts/omni/package_verified_omni_result.py \ --dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ --train-run-id \ --eval-run-id ``` For long-running remote jobs, the packaging step can be watched automatically: ```bash python scripts/omni/watch_verified_omni_package.py \ --dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ --train-run-id \ --eval-run-id ``` While waiting, the watcher can append `eval_progress_observed` events from partial prediction files or legacy generation logs. This keeps the package status file useful during long held-out evaluations. The package copies only small derived artifacts such as metrics, predictions, confusion matrices, run reports, manifests, validation summaries, and training metadata. The exact required eval files and primary metrics come from the selected backbone contract in `configs/omni_backbones`, so Qwen3-Omni, Cosmos-style world models, and VLA/policy tracks can share the same verified publication gate once their model-specific evaluators exist. The package excludes raw Xperience-10M files, base-model weights, adapter or checkpoint weights, full checkpoints, and large archives. For hardware setups that can run multiple eval workers, the Qwen evaluator also supports deterministic sample shards: ```bash CUDA_DEVICE_GROUPS="0,1 2,3 4,5 6,7" \ SHARDS=4 \ RUN_ID= \ scripts/omni/run_qwen3_omni_lora_eval_sharded.sh ``` Only the merged eval directory should be validated and reported publicly, because the merger checks coverage and recomputes the metrics from all held-out predictions. After dataset export, a model-neutral window index can be created for future backbones: ```bash python scripts/omni/export_model_neutral_window_index.py \ --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl ``` This produces `window_index.jsonl` and `window_index_manifest.json` so Cosmos- style world models and VLA/policy tracks can reuse the same split-checked windows without depending on Qwen chat-message records. ### Uploading Qwen3-Omni LoRA artifacts The public-safe verified package intentionally excludes raw data, base Qwen weights, LoRA weights, and full checkpoints. Adapter upload is a separate step: use it only when the intended adapter directory is present and the model card clearly distinguishes older smoke weights from the final selected-episode diagnostic run. Keep weight-bearing repositories model-specific: the final 128-episode Qwen3-Omni adapter belongs in `cy0307/ropedia-qwen3-omni-lora-128ep`, older Qwen smoke material remains historical. Cosmos3-Nano remains an artifacts-only compatibility result; Cosmos3-Super Forward-Dynamics now has a separate weight-bearing model repo at `cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep`. Metrics, predictions, audits, and reports stay in the artifact dataset. ```bash python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \ --repo-id cy0307/ropedia-qwen3-omni-lora-128ep \ --source-dir /path/to/adapter_upload_package \ --message "Upload Xperience-10M Qwen3-Omni LoRA pilot" ``` This script requires a valid Hugging Face token via `HF_TOKEN` or `--token`. Network availability to `huggingface.co` is required. ### Foundation Backbone Plan The next modeling plan tracks several foundation-model tracks instead of assuming one backbone solves every Xperience-10M objective. | Branch | Current role | When to use it | | --- | --- | --- | | Qwen3-Omni | First trainable multimodal LoRA pilot | Use for the selected 128-episode held-out baseline over video/audio/language plus sensor-bridge features. | | Cosmos 3 | First world-model/action-generation track | Use now for future-window compatibility analysis and the verified Cosmos3-Super forward-dynamics LoRA artifact; compare its loss metrics separately from Qwen JSON-task accuracy. | | GR00T | Humanoid/action-policy track | Use after mocap/contact retargeting creates well-defined humanoid action targets. | | OpenVLA / openpi | Open VLA/policy baselines | Use after the project defines robot-compatible or action-token targets. | | Gemini Robotics | External reasoning reference | Use only for qualitative comparison or annotation support unless local trainable access exists. | | Xperience Embodied Foundation Model | Future Xperience-native pretraining goal | Use only after multi-episode pilots, full-corpus storage, distributed training infrastructure, and scaling evidence justify a from-scratch domain model. | See [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md) and [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json) for the full selection matrix, source links, and model-specific evaluation additions. See [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) for the long-term full-corpus pretraining plan. The three headline foundation directions are also separated as pipeline tracks so each track is easy to read without mixing current results and future work: | Pipeline track | First concrete pipeline | Claim boundary | | --- | --- | --- | | Spatial intelligence models | Build scene/object memory targets from multiview RGB, depth, pose, calibration, object cues, and language prompts. | Ready as a geometry/reasoning pipeline; the next readout is held-out spatial QA, pose consistency, counting, and scene-memory metrics. | | Human-video world models | Predict next action, next subtask, future object set, contact transition, and future state from observed interaction windows. | Partially evidenced by future-task probes and Cosmos-style artifacts; visual/latent future quality still needs stronger metrics. | | Vision-language-action models | Convert egocentric video, captions, hand/body motion, contacts, and objects into action chunks or policy-compatible targets. | Feasible, but gated by action-token conversion, normalization, retargeting evidence, and held-out policy metrics. | For the single public sample, each direction is now shown as an explicit training-pair recipe: | Direction | One-sample input | One-sample output target | | --- | --- | --- | | Spatial intelligence | 20-frame windows from `windows.csv` / `shared_windows.npz`, joined with six MP4 camera streams plus `annotation.hdf5` depth, pose, SLAM/calibration, object/contact cues, and optional language questions. | Camera-view match, object relevance, object-set memory, depth/pose reconstruction proxy, caption-grounded retrieval, and spatial QA targets. | | Human-video world model | Current observed window at time `t`: RGB/audio/sensor summaries, hand/body motion, camera pose, current object/contact state, and current action/subtask context only. | Shifted future targets: next action, next subtask, future object set, contact transition, time-to-transition, camera-motion delta, or latent/future feature. | | Vision-language-action | Egocentric/fisheye video, caption/object context, hand/body mocap, contact state, and current subtask text as observation-language input. | Action-token proxies: current/next action, object-conditioned action relation, contact state, interaction-text class, subtask transition, or hand-trajectory/action-chunk proxy. | High-resolution slide diagrams for the three tracks are published in [`docs/assets/foundation-pipelines`](docs/assets/foundation-pipelines). Spatial intelligence and human-video world modeling use the clean slide PNGs supplied for publication and are exported as 2560-pixel public images. The 2026-06-19 refresh verified that the latest uploaded Spatial and Human-video PNGs are byte-identical to the committed clean source cache. The VLA card now uses the clean VLA slide PNG supplied afterward and is exported through the same 2560-pixel public path. These images are communication assets, not completed model-quality evidence; the exact task, training, and evaluation contracts remain in the Markdown and JSON files. **Spatial intelligence models** ![High-resolution slide diagram for the Spatial intelligence models direction](docs/assets/foundation-pipelines/spatial-intelligence-pipeline.png) **Human-video world models** ![High-resolution slide diagram for the Human-video world models direction](docs/assets/foundation-pipelines/human-video-world-model-pipeline.png) **Vision-language-action models** ![High-resolution slide diagram for the Vision-language-action models direction](docs/assets/foundation-pipelines/vision-language-action-pipeline.png) See [`THREE_FOUNDATION_PIPELINES.md`](THREE_FOUNDATION_PIPELINES.md) and [`docs/data/three_foundation_pipelines.json`](docs/data/three_foundation_pipelines.json). Backbone-specific contracts now live in [`configs/omni_backbones`](configs/omni_backbones). The extension contract is documented in [`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md), and the registry can be checked with: ```bash python scripts/omni/backbone_registry.py --validate --json ``` Verify that every configured backbone can pass the public-safe packaging contract on synthetic derived artifacts: ```bash python scripts/omni/smoke_test_backbone_packaging.py ``` After a real held-out package is created, audit it before updating README, website, or Hugging Face pages: ```bash python scripts/omni/audit_verified_omni_package.py \ --package-dir results/omni_finetune/verified_public/ ``` Create a new planned backbone track from an existing contract template with: ```bash python scripts/omni/scaffold_omni_backbone.py \ --template-backbone policy_vla_branch \ --id new_policy_branch \ --display-name "New Policy Branch" \ --model-family "Model family name" \ --dataset-contract xperience10m_observation_action_v1 \ --training-objective observation_to_action_policy \ --checkpoint-gate policy_checkpoint_action_space_and_normalizer \ --dry-run ``` Each backbone config declares the checkpoint gate, required train/eval files, allowed public artifacts, and forbidden private or heavyweight artifacts. This keeps Qwen3-Omni, Cosmos-style world models, and policy/VLA tracks on the same split, validation, and publication discipline even though their training targets are different. ## Additional Development Directions Beyond backbone selection and fine-tuning, Xperience-10M supports several concrete research-development tracks: | Direction | First useful artifact | Role in the project | | --- | --- | --- | | Episode taxonomy and data engine | Episode atlas, balance report, and split builder | Select representative data before training. | | Standardized benchmark protocol | Versioned train/val/test manifests and metric scripts | Make future model results comparable. | | Multimodal representation learning | Contrastive and masked-window encoder objectives | Learn reusable video/audio/depth/pose/mocap/IMU/language features. | | Skill and procedure graph mining | Step graph, transitions, preconditions, and effects | Connect perception to planning and long-horizon reasoning. | | Human-object affordance modeling | Contact, reachable-object, tool-use, and next-affordance tasks | Model what actions the scene makes possible. | | 3D/4D scene and object memory | Persistent scene/object maps from depth, pose, multiview video, and objects | Track world state beyond single frames. | | Data-quality and synchronization diagnostics | Per-episode QA for drift, missing streams, calibration, and corrupted files | Keep large multimodal training trustworthy. | | Policy, retargeting, and simulation transfer | Action-token conversion and robot-compatible imitation examples | Bridge human egocentric experience to robot policy work. | See [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md) and [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json). ## Four Research Directions The walkthrough-backed task contracts are organized against the four Ropedia research directions in a generated artifact, not only in prose: - [`research_direction_taxonomy.json`](results/episode_task_suite/research_directions/research_direction_taxonomy.json) - [`research_direction_task_map.csv`](results/episode_task_suite/research_directions/research_direction_task_map.csv) - [`research_direction_summary.md`](results/episode_task_suite/research_directions/research_direction_summary.md) - [`docs/data/research_directions.json`](docs/data/research_directions.json) The taxonomy uses two current baselines for every task: | Baseline | Role | | --- | --- | | Minimal interpretable heads | Softmax, logistic, ridge, and retrieval heads over the 8,546-dimensional multimodal representation. These expose the input/output contract cleanly. | | Neural MLP heads | Small PyTorch MLP classifiers/regressors on the same features and splits. These check whether nonlinear heads help before moving to Qwen/Omni fine-tuning. | Current direction-level coverage: | Direction | Current status | Covered task evidence | What is not solved yet | | --- | --- | --- | --- | | A. Human Modeling & Motion Understanding | Partially implemented | Hand Trajectory Forecasting and Contact State Prediction are direct; Action Recognition and Object Relevance Prediction are proxies. Neural MLP improves hand forecasting from `0.8647` to `0.1079` MPJPE. | No full body/shape model, SMPL/MANO target, deformation prior, or multi-episode motion-generation evaluation yet. | | B. 3D/4D Reconstruction & Neural Rendering | Proxy tasks only | Cross-Modal Retrieval, Cross-Modal Reconstruction, and Multimodal Synchronization Detection test alignment/reconstruction prerequisites. | No NeRF, Gaussian Splatting, TSDF, mesh, novel-view synthesis, or calibrated 4D reconstruction model yet. | | C. Egocentric Vision & Interaction | Strongest implemented track | 6 direct tasks: action, subtask, transition, next-action, object relevance, and caption grounding, plus alignment/order diagnostics and audio ablation. | Single-episode chronological split limits generalization; stronger audio and video-language backbones still need multi-episode testing. | | D. Scene Reconstruction & World Modeling | Early proxy tasks | Procedure Step Recognition, Next-Action Prediction, Object Relevance Prediction, Cross-Modal Retrieval, Cross-Modal Reconstruction, Temporal Order Verification, and Multimodal Synchronization Detection provide state/world-model probes. | No persistent scene graph, object permanence task, long-term map, or held-out-episode world model yet. | The important interpretation is that all four directions can be **started** from the Xperience-10M sample modalities, but only direction C is strongly represented by the current task evidence. Directions A, B, and D need additional targets and multi-episode training before they become full research deliverables. ## Four Direction Probes Alongside the unified 20-task suite, the repo includes one data-backed probe for each research direction. These probes are computed from the same `shared_windows.npz`, `windows.csv`, and `feature_manifest.json` artifacts, so the reported numbers are computed from sample-derived features and saved metric artifacts. - [`research_direction_extension_results.json`](results/episode_task_suite/research_direction_extensions/research_direction_extension_results.json) - [`research_direction_extension_summary.md`](results/episode_task_suite/research_direction_extensions/research_direction_extension_summary.md) - [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) - [`research_direction_extension_tasks.svg`](docs/assets/charts/research_direction_extension_tasks.svg) ![Four direction extension probes](docs/assets/charts/research_direction_extension_tasks.svg) | Direction | New extension task | Input | Output | Minimal | Neural MLP | Why it matters | | --- | --- | --- | --- | ---: | ---: | --- | | A. Human Modeling & Motion Understanding | Body and Hand Motion Intensity | non-mocap video/depth/pose/IMU/SLAM/language features | high vs low body/hand motion | `0.7827` macro-F1 | `0.7986` macro-F1 | Starts a human-motion-energy target without leaking mocap input. | | B. 3D/4D Reconstruction & Neural Rendering | Multi-View Consistency Retrieval | fisheye camera feature query | synchronized stereo-left view rank | `0.5534` MRR | `0.3469` MRR | Tests whether multi-view features preserve synchronized 4D scene identity. | | C. Egocentric Vision & Interaction | Action Phase Progress Estimation | non-caption multimodal window | progress inside current action segment | `0.3416` MAE | `0.3038` MAE | Adds a task-structure/intent-style target beyond class labels. | | D. Scene Reconstruction & World Modeling | Short-Horizon Ego-Motion Forecasting | current sensors excluding camera translation and captions | future camera-translation delta | `0.1989` MAE | `0.0989` MAE | Starts a short-horizon world-model target over wearer motion. | Run: ```bash python scripts/research_direction_extension_tasks.py ``` These four probes make the four-direction mapping more concrete, but they are still single-episode extension baselines. Full research conclusions still require multi-episode training, held-out episode evaluation, and stronger task-specific models. ## Unified 20-Task Suite The sample task surface is presented as 20 tasks in one suite. All task rows share the same 20-frame window unit, 5-frame stride, chronological split, and minimal/neural comparison style, with task-specific leakage rules when a target would otherwise leak through caption, object, contact, or future features. The historical `tier2_task_suite` file and directory names remain only for stable artifact links. They should be read as provenance bundles inside the unified 20-task suite, not as a separate benchmark tier. - [`TASK_SUITE_20.md`](TASK_SUITE_20.md) - [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json) - [`docs/data/unified_task_model_radar.json`](docs/data/unified_task_model_radar.json) - [`docs/data/single_episode_task_model_radar.json`](docs/data/single_episode_task_model_radar.json) - [`docs/data/episode128_task_model_radar.json`](docs/data/episode128_task_model_radar.json) - [`docs/data/task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json) - [`docs/data/task_method_20_gap_audit.json`](docs/data/task_method_20_gap_audit.json) - [`TASK_METHOD_20_GAP_AUDIT.md`](TASK_METHOD_20_GAP_AUDIT.md) - [`TIER2_TASK_BASELINES.md`](results/episode_task_suite/tier2_task_suite/TIER2_TASK_BASELINES.md) - [`tier2_task_suite_results.json`](results/episode_task_suite/tier2_task_suite/tier2_task_suite_results.json) - [`docs/data/tier2_task_suite.json`](docs/data/tier2_task_suite.json) - [`unified_task_model_radar.svg`](docs/assets/charts/unified_task_model_radar.svg) - [`single_episode_task_model_radar.svg`](docs/assets/charts/single_episode_task_model_radar.svg) - [`episode128_task_model_radar.svg`](docs/assets/charts/episode128_task_model_radar.svg) - [`tier2_task_suite.svg`](docs/assets/charts/tier2_task_suite.svg) ![Unified 20-task model radar](docs/assets/charts/unified_task_model_radar.svg) ![Single-episode 20-task model radar](docs/assets/charts/single_episode_task_model_radar.svg) ![128-episode 20-task model radar](docs/assets/charts/episode128_task_model_radar.svg) The all-task table, including every input/output contract and minimal/neural metric, is in [`TASK_SUITE_20.md`](TASK_SUITE_20.md). Historical provenance links remain listed above for exact source tracing, but the public task surface should be read as one integrated 20-task suite. Run: ```bash /path/to/python-with-h5py scripts/tier2_task_suite.py ``` Regeneration needs either `HOMIE-toolkit` or an environment with `h5py` because the interaction/object targets come from the raw public-sample `annotation.hdf5`. The raw HDF5 and MP4 files remain excluded from the public repo and Hugging Face mirrors. ## Task Walkthroughs For Juniors Every task now has a beginner-facing explanation with: - a concrete coffee-episode case study, - exact input contract, - middle process modules, - output contract, - minimal and neural metric, - one important limitation. Primary files: - [`TASK_WALKTHROUGHS.md`](results/episode_task_suite/task_walkthroughs/TASK_WALKTHROUGHS.md) - [`task_walkthroughs.json`](results/episode_task_suite/task_walkthroughs/task_walkthroughs.json) - [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json) - [`docs/data/task_surface_integrity.json`](docs/data/task_surface_integrity.json) Compact map: | Task | Case study | Input -> process -> output | | --- | --- | --- | | Action Recognition | A pouring window should be named as the current action. | all-modality window -> action label builder + classifier -> action class | | Procedure Step Recognition | A fine action is grouped into a broader drink-preparation stage. | all-modality window -> subtask label builder + classifier -> subtask label | | Action Boundary Detection | Detect the change from preparing to pouring. | window -> boundary builder + binary classifier -> boundary/steady | | Next-Action Prediction | A preparing window predicts what happens 20 frames later. | current window -> future-label shift + classifier -> next action | | Hand Trajectory Forecasting | A hand moving toward a cup becomes a future 3D hand path. | current window -> future mocap target + regressor -> hand trajectory | | Contact State Prediction | Decide whether hand/body contact is happening. | non-contact features -> contact target + binary classifier -> contact label | | Object Relevance Prediction | Infer milk, cup, coffee, or related objects during pouring. | non-caption features -> multi-hot object target + sigmoid heads -> object set | | Language Grounding | Query Pour milk into coffee and retrieve the matching moment. | text-like query + candidates -> projection + cosine ranker -> ranked windows | | Cross-Modal Retrieval | Motion/IMU from pouring retrieves matching depth/video. | motion/IMU/camera -> projection + candidate index -> ranked depth/video windows | | Cross-Modal Reconstruction | Infer depth/video features from motion, IMU, and camera pose. | source modalities -> scaler + regressor -> target modality vector | | Temporal Order Verification | Tell whether reaching then pouring was reversed. | adjacent window pair -> pair combiner + binary classifier -> correct/reversed | | Multimodal Synchronization Detection | Catch motion paired with visual/depth features shifted in time. | motion side + visual side -> aligned/shifted pair builder + classifier -> aligned/shifted | ## Core Architecture Families in the 20-Task Suite These are deliberately minimal baselines. They are useful because every input/output contract is explicit, not because they are strong embodied-AI models. Shared setup: ```text raw episode -> 20-frame windows, stride 5 -> 8,546-dimensional multimodal representation chronological split: first 70% train, last 30% test scalers are fit on train windows only ``` There are four reusable head families: | Head family | Used by | What it means | | --- | --- | --- | | Linear softmax classifier | Action Recognition, Procedure Step Recognition, Action Boundary Detection, Next-Action Prediction, Contact State Prediction, Temporal Order Verification, Multimodal Synchronization Detection | z-score features, then `XW+b`, softmax, cross-entropy, L2 | | Dual ridge regression/projection | Hand Trajectory Forecasting, Cross-Modal Reconstruction | z-score input/target, solve ridge regression with L2=10 | | Ridge + cosine ranking | Language Grounding, Cross-Modal Retrieval | project one modality into another feature space, then rank candidates by cosine | | Multi-label logistic regression | Object Relevance Prediction | z-score non-caption features, sigmoid object heads, threshold at 0.5 | The optional neural run keeps the same window representation, leakage filters, chronological splits, and metrics, but replaces the task heads with small PyTorch MLP classifiers or regressors. Its outputs live under [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), and the rollup is stored in the `neural_tasks` section of [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json). The walkthrough-backed task heads are: | Task | Input | Minimal head | Output | | --- | --- | --- | --- | | Action Recognition | all featurized modalities | linear softmax | current action class | | Procedure Step Recognition | all featurized modalities | linear softmax | current subtask class | | Action Boundary Detection | all featurized modalities | linear softmax | steady vs action boundary | | Next-Action Prediction | all featurized modalities at `t` | linear softmax | action at `t+20` frames | | Hand Trajectory Forecasting | all featurized modalities at `t` | ridge regression | future 10-frame left/right hand joints | | Contact State Prediction | non-contact and non-caption signals | linear softmax | any body contact | | Object Relevance Prediction | non-caption signals | multi-label logistic | relevant object set | | Language Grounding | sensor windows projected to text space | ridge projection + cosine ranking | matching time window for text query | | Cross-Modal Retrieval | motion/IMU/camera projected to visual space | ridge projection + cosine ranking | matching depth/video window | | Cross-Modal Reconstruction | motion/IMU/camera | ridge regression | compressed depth/video target | | Temporal Order Verification | `[x_t, x_t+1, x_t+1-x_t]` | binary linear softmax | correct vs reversed order | | Multimodal Synchronization Detection | motion plus visual pair | binary linear softmax | aligned vs shifted by 8 windows | ## Key Results | Experiment | Main score | Accuracy | Notes | | --- | ---: | ---: | --- | | Motion-only action | 0.9688 macro-F1 | 0.9828 | Uses motion/IMU features only | | Current all-feature action | 0.9829 macro-F1 | 0.9863 | 8,546-dimensional multimodal representation | | Motion-only subtask | 0.9528 macro-F1 | 0.9759 | Strong within-episode subtask signal | | Current all-feature subtask | 0.9173 macro-F1 | 0.9828 | High accuracy, lower class-balanced score | | Cross-modal retrieval | 0.3678 top-5 | n/a | Motion/IMU/camera/audio retrieves matching depth/video | | Transition detection | 0.6118 macro-F1 | 0.9080 | Boundary F1 is 0.1250 | | Hand trajectory forecast | 0.8647 MPJPE | n/a | Predicts future hand-joint trajectory | | Neural MLP hand forecast | 0.1079 MPJPE | n/a | Same features/split, nonlinear regression head | | Neural MLP temporal order | 0.8520 F1 | 0.8578 | Strong improvement on adjacent-window ordering | | Neural MLP misalignment | 0.7153 F1 | 0.7009 | Detects shifted motion/visual/audio pairs better than the linear head | | Audio ablation | +0.0418 mean delta | n/a | Current audio variant improves the primary metric on 6 walkthrough-backed task contracts | | Alternate audio representation | +0.0936 mean delta | n/a | Alternate audio-window representation improves over the baseline audio variant on 6 walkthrough-backed task contracts | ## Audio Contribution Study The audio ablation keeps the same windows and task labels, then compares input variants under the same chronological split. The script [`scripts/audio_ablation_and_raw_upgrade.py`](scripts/audio_ablation_and_raw_upgrade.py) reuses the real task-suite windows and evaluates six variants for every task: current inputs, no audio, audio-only, alternate audio-only, audio representation replacement, and all inputs plus the alternate audio representation. The measured single-episode result is task-specific: | Readout | Value | | --- | ---: | | Tasks where current audio improves the primary metric | 6 / 12 original contracts | | Mean current-audio delta | +0.0418 | | Tasks where alternate audio representation improves over baseline audio | 6 / 12 original contracts | | Mean alternate-representation delta vs baseline audio | +0.0936 | Full files: - [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md) - [`results/audio_ablation/audio_ablation_metrics.csv`](results/audio_ablation/audio_ablation_metrics.csv) - [`results/audio_ablation/audio_delta_summary.csv`](results/audio_ablation/audio_delta_summary.csv) - [`docs/data/audio_ablation_summary.json`](docs/data/audio_ablation_summary.json) - [`docs/assets/charts/audio_ablation_delta.svg`](docs/assets/charts/audio_ablation_delta.svg) ## Neural MLP Results The neural baseline was run locally with `--include-neural` for the original core task contracts using 80 epochs, hidden size 128, batch size 128, and CPU execution. It is not a foundation model result; it is a controlled nonlinear-head comparison over the same 8,546-dimensional multimodal representation. | Task | Neural metric | Minimal metric | Readout | | --- | ---: | ---: | --- | | Action Recognition | 0.0148 macro-F1 | 0.0500 macro-F1 | Still blocked by unseen future classes | | Procedure Step Recognition | 0.0281 macro-F1 | 0.0506 macro-F1 | Same single-episode split limitation | | Action Boundary Detection | 0.5862 macro-F1 | 0.6118 macro-F1 | Similar to the linear baseline | | Next-Action Prediction | 0.0419 macro-F1 | 0.0593 macro-F1 | Same unseen-label issue | | Hand Trajectory Forecasting | 0.1079 MPJPE | 0.8647 MPJPE | Neural regression improves this target | | Contact State Prediction | 1.0000 macro-F1 | 1.0000 macro-F1 | Degenerate one-class sample | | Object Relevance Prediction | 0.1679 micro-F1 | 0.1803 micro-F1 | Similar weak object signal | | Language Grounding | 0.0168 MRR | 0.0160 MRR | Similar ranking behavior | | Cross-Modal Retrieval | 0.1300 MRR | 0.2693 MRR | Linear ridge remains stronger here | | Cross-Modal Reconstruction | -0.0102 R2 | -0.0153 R2 | Small improvement but still weak | | Temporal Order Verification | 0.8520 F1 | 0.5400 F1 | Neural head captures local temporal structure | | Multimodal Synchronization Detection | 0.7153 F1 | 0.5052 F1 | Neural head improves alignment detection | The strongest single-episode self-supervised signal is cross-modal retrieval: motion/IMU/camera/audio features retrieve matching depth/video windows substantially better than random. ## Single-Episode Diagnostics and Explorer While waiting for broader Xperience-10M access, the repo now includes an artifact-driven diagnostics pass over the public sample episode: - `results/single_episode_diagnostics/object_labels/window_object_labels.csv` exports 1,161 real window-level object-label sets from `annotation.hdf5`. - `results/single_episode_diagnostics/modality_ablation/ablation_metrics.csv` recomputes all 96 task/modality cells, including object relevance. - `results/single_episode_diagnostics/timeline_overlay/timeline_overlay.csv` aligns 2,079 existing prediction rows back to the episode timeline. - `results/single_episode_diagnostics/alignment_stress/alignment_shift_metrics.csv` evaluates cross-modal retrieval under explicit time shifts. - `docs/single_episode_explorer.html` is a static interactive page for inspecting window labels, objects, predictions, modality statistics, and diagnostic scores. These are single-episode research diagnostics. They are useful for studying task definitions, feature behavior, and model errors before scaling to more episodes; they are not reported as multi-episode benchmark results. ## Reproducibility Check I re-ran the full pipeline from the local raw public sample into a temporary local workspace and compared regenerated metrics with the committed artifacts. The baseline metrics, task metrics, feature manifest, and available modality manifest matched exactly after float normalization. See [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) for the commands and verification evidence. ## Why Some Scores Are Low The task suite intentionally uses a chronological split: ```text first 70% of the episode -> train last 30% of the episode -> test ``` The test segment contains some action/subtask labels never seen during training. Timeline and next-action classifiers therefore expose the core limitation of single-episode learning instead of hiding it behind random splits. ## Modalities Used The current public-sample pipeline uses: - hand/body mocap joints and contact labels, - camera translation and rotation, - IMU acceleration and gyroscope traces, - depth confidence features, - six video streams, - audio from the sample MP4 stream, - caption/object/interaction text features, - SLAM point-cloud summary features, - calibration parameters. The full technical source manifest is stored in [`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json). ## Data Notice Xperience-10M data belongs to its original authors and is subject to the official Ropedia dataset license and access terms. This repo contains code and derived experiment artifacts only; it does not redistribute the raw videos or raw annotation dataset.