Publish validation-aware Qwen3-Omni diagnostic mirrors

2bd8497 verified 20 days ago

67.3 kB

	# Ropedia Xperience-10M Task Suite

	[![Website](https://img.shields.io/badge/site-GitHub%20Pages-1f63e9)](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/)
	[![HF Space](https://img.shields.io/badge/Hugging%20Face-Space-ffb000)](https://huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite)
	[![Dataset](https://img.shields.io/badge/dataset-Xperience--10M%20by%20Ropedia-008b9a)](https://huggingface.co/datasets/ropedia-ai/xperience-10m)
	[![GitHub Package](https://img.shields.io/badge/package-GHCR-2496ed)](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite/pkgs/container/ropedia-xperience-10m-task-suite)
	[![Scope](https://img.shields.io/badge/scope-single%20public%20sample-b65b04)](#scope)
	[![Citation](https://img.shields.io/badge/citation-CFF-7ae5c3)](CITATION.cff)
	[![License](https://img.shields.io/badge/license-code%20MIT%20%2B%20data%20terms-ccffa0)](LICENSE)

	<p align="center">
	<img src="docs/assets/brand/xperience10m-logo-social-card.png" alt="Ropedia Xperience-10M Task Suite logo card" width="760">
	</p>

	A research-development project built on the public Xperience-10M sample episode
	released by Ropedia. The goal is to make one richly multimodal egocentric
	episode understandable, turn it into concrete embodied-AI task definitions, and
	prepare the same pipeline for future held-out multi-episode training.

	The central research questions are:

	- What can be learned from one aligned Xperience-10M episode while separating
	sample-specific observations from later multi-episode questions?
	- Which input/output tasks are meaningful for embodied AI when video, depth,
	pose, mocap, IMU, and language annotations are synchronized?
	- What baseline models and evaluation files should exist before scaling to
	Qwen3-Omni or other multimodal foundation-model fine-tuning?

	## Why This Project Exists

	This project is organized as a compact research artifact around Xperience-10M:
	start from a real public episode, make every modality and label path inspectable,
	turn the data into concrete embodied-AI tasks, and keep the evaluation boundary
	clear while preparing the next multi-episode experiments. The emphasis is on
	research judgment as much as implementation: what the sample can show, what it
	cannot show, and what evidence should exist before claiming model quality.

	The work is designed to demonstrate four capabilities that matter for
	embodied-AI research infrastructure:

	\| Capability \| What this project shows \|
	\| --- \| --- \|
	\| Multimodal data understanding \| Parses the public sample into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals \|
	\| Task design \| Defines 12 human-readable tasks plus four direction-extension probes with inputs, outputs, process modules, metrics, and case-study walkthroughs \|
	\| Model and evaluation discipline \| Runs minimal and compact neural baselines, records predictions/metrics, keeps chronological split boundaries explicit, and separates sample evidence from held-out claims \|
	\| Scale-up planning \| Connects the public-sample pipeline to 32/128-episode held-out pilots, Qwen3-Omni LoRA, Cosmos-style world-model branches, policy-model branches, and the future Xperience-native foundation-model pretraining goal \|

	## Start Here

	For a first pass, use [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) or the
	machine-readable [`docs/data/project_brief.json`](docs/data/project_brief.json).
	They give the project shape in one page: what exists now, what the public
	sample can support, where the 12 tasks and baselines live, and what must happen
	before the multi-episode omni-model stage becomes a real held-out evaluation.

	\| Reader goal \| Best entry point \|
	\| --- \| --- \|
	\| Understand the whole project quickly \| [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) \|
	\| See the visual research dashboard \| [GitHub Pages dashboard](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) \|
	\| Navigate the 12 tasks, four tracks, and scale-up plan \| [Interactive research roadmap](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/research_roadmap.html), [`docs/data/research_roadmap_interactive.json`](docs/data/research_roadmap_interactive.json) \|
	\| Compare current task metrics \| [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`docs/data/summary_metrics.json`](docs/data/summary_metrics.json) \|
	\| Compare possible foundation backbones \| [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json) \|
	\| Understand the future native pretraining goal \| [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) \|
	\| See additional concrete project directions \| [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md), [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json) \|
	\| Understand one model input \| [`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json), [`results/episode_task_suite/windows.csv`](results/episode_task_suite/windows.csv) \|
	\| Check multi-episode data status \| [`results/omni_finetune/DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md) \|

	## Research Project Overview

	\| Theme \| Current implementation \|
	\| --- \| --- \|
	\| Dataset slice \| One public Xperience-10M sample episode, 5,821 frames, 1,161 windows, and an 8,546-dimensional representation \|
	\| Modalities \| Video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, calibration, and language annotations \|
	\| Task suite \| 12 human-readable embodied-AI task contracts with input, process, output, metrics, predictions, and case-study walkthroughs \|
	\| Baselines \| Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split \|
	\| Research directions \| Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling \|
	\| Scale-up path \| A first selected-episode Qwen3-Omni LoRA diagnostic pilot has completed on the 96/16/16 split; it proves the multi-episode export/train/eval/package loop, but the weak held-out metrics make it a baseline for error analysis rather than a strong model. Cosmos 3/world-model and VLA/policy branches reuse the same split and package contract after their targets are implemented. \|
	\| Public surfaces \| GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection \|

	For the fastest interpretation of the current metrics, start with
	[`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md) and
	[`docs/data/research_takeaways.json`](docs/data/research_takeaways.json).
	They summarize what the public sample results actually show: class shift under
	chronological splits, neural gains on dynamics/order/alignment, harder
	retrieval/reconstruction probes, and why the next model-quality step needs
	held-out episodes.

	Current contributions:

	- manifested sliding-window features over the currently extracted modalities,
	- motion-only and current all-feature baseline models,
	- 12 end-to-end episode-level tasks,
	- lightweight neural MLP heads for the same 12 task contracts,
	- a generated four-direction research taxonomy matching the Ropedia job tracks,
	- four additional direction-extension probes with minimal and neural baselines,
	- human-readable research task cards and an interactive scrub/play walkthrough storyboard for every task,
	- an interactive research roadmap connecting 12 tasks, four research tracks, current sample evidence, the Qwen3-Omni scale-up path, and foundation-model branch selection,
	- a next-milestone track for Qwen3-Omni fine-tuning, Cosmos 3 world modeling, and sensor-bridge evaluation,
	- a future pretraining plan for an Xperience Embodied Foundation Model over the full corpus after smaller multi-episode stages prove value,
	- metrics, predictions, model weights, manifests, charts, and a two-level
	tabbed static research website,
	- a clear explanation of what is implemented now and what moves to the multi-episode stage.

	## Current Research Scope

	This project is best read as a staged embodied-AI research study:

	\| Layer \| Current scope \| Where to start \|
	\| --- \| --- \| --- \|
	\| Data understanding \| One public Xperience-10M sample episode is converted into 5,821 frames, 1,161 aligned windows, and an 8,546-dimensional multimodal representation. \| [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md), [`PROJECT_STATUS.md`](PROJECT_STATUS.md) \|
	\| Task suite \| Twelve human-readable tasks cover action, procedure, contact, object, language, retrieval, reconstruction, order, and synchronization questions. \| [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) \|
	\| Baselines \| Minimal heads and compact PyTorch MLP heads provide a first controlled comparison on the same chronological split. \| [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/) \|
	\| Diagnostics \| Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard. \| [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md), [`docs/single_episode_explorer.html`](docs/single_episode_explorer.html) \|
	\| Scale-up \| The selected 128-episode Qwen3-Omni LoRA diagnostic pilot has a verified validation-aware held-out package: 96/16/16 selected episodes, 3,808 exported windows, 512 validation windows, 448 held-out test windows, and public-safe metrics/predictions. JSON validity is 87.50%, below the 98% target, so the next pass focuses on structured-output reliability and task-quality error analysis. \| [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`results/omni_finetune/verified_public/`](results/omni_finetune/verified_public/) \|

	Detailed dataset notes, reproduction checks, and generated JSON reports are
	included for readers who want to inspect the implementation, but they are
	supporting materials rather than the main reading path. Use
	[`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) when you want the full file map.

	## Project Status

	If you only have one minute, use
	[`PROJECT_STATUS.md`](PROJECT_STATUS.md) and
	[`docs/data/project_status.json`](docs/data/project_status.json).
	They give the current research state in one compact table:

	\| Area \| Current decision \|
	\| --- \| --- \|
	\| Public-sample pipeline \| Verified on one public sample episode: 5,821 frames, 1,161 windows, 8,546 dimensions \|
	\| 12-task suite \| Verified minimal baselines with committed metrics, predictions, and manifests \|
	\| Neural heads \| Verified compact PyTorch MLP heads over the same task contracts and chronological splits \|
	\| Dataset context \| Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented \|
	\| Evaluation protocol \| Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics \|
	\| Website and Hub pages \| Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links \|
	\| Qwen3-Omni multi-episode pilot \| Verified diagnostic result package exists for the selected 96/16/16 episode split; current held-out metrics are weak and below the JSON-validity quality target \|
	\| Raw Xperience-10M data / full Qwen weights \| Not redistributed \|

	## 90-Second Research Project Path

	If you are reading the project cold, open these in order:

	\| Step \| Question \| Primary artifacts \| What should be true \|
	\| --- \| --- \| --- \| --- \|
	\| 1 \| What is this project? \| [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md), [`PROJECT_STATUS.md`](PROJECT_STATUS.md), [dashboard](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) \| A public-sample Xperience-10M research project with 12 tasks, baselines, and a scale-up plan. \|
	\| 2 \| What data is used? \| [`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md), [official HF dataset](https://huggingface.co/datasets/ropedia-ai/xperience-10m), [sample HF dataset](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) \| The implemented suite uses one public sample episode; the gated dataset is reserved for selected multi-episode training. \|
	\| 3 \| What does one model input contain? \| [`windows.csv`](results/episode_task_suite/windows.csv), [`feature_manifest.json`](results/episode_task_suite/feature_manifest.json), [`available_modalities.json`](results/episode_task_suite/available_modalities.json) \| Each window is an aligned multimodal unit with video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals. \|
	\| 4 \| What are the 12 tasks? \| [`results/episode_task_suite/task_walkthroughs/`](results/episode_task_suite/task_walkthroughs/), [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json) \| Every task has a human-readable name, case study, input, process modules, output, metric, and limitation. \|
	\| 5 \| How are tasks evaluated? \| [`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md), [`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) \| The window unit, chronological split, leakage controls, task metrics, and current limitations are explicit. \|
	\| 6 \| What do the current results mean? \| [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`docs/data/research_takeaways.json`](docs/data/research_takeaways.json), [`docs/data/summary_metrics.json`](docs/data/summary_metrics.json) \| Current metrics describe sample-level task behavior and identify which signals need larger held-out experiments. \|
	\| 7 \| Which models are implemented? \| [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json), [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), [HF baseline repo](https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines) \| Each task has minimal and neural-head evidence over the same feature windows. \|
	\| 8 \| What research directions does this support? \| [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`docs/data/research_directions.json`](docs/data/research_directions.json), [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) \| The tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling. \|
	\| 9 \| Which foundation model comes next? \| [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json), [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) \| Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 is the first world-model branch; policy models wait for explicit action targets; Xperience-native pretraining is the full-corpus future goal. \|
	\| 10 \| How do I reproduce it? \| [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md), [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) \| Public commands and expected outputs are documented for the sample-episode task suite. \|
	\| 11 \| What is still pending? \| [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md), [`MULTI_EPISODE_ACCESS_STATUS.md`](results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md) \| The first held-out diagnostic pilot is verified; strong model quality remains pending because JSON validity is 87.50% and action/subtask metrics remain weak. \|

	A compact reader-path summary is available at
	[`docs/data/project_packet.json`](docs/data/project_packet.json).

	## Supporting Files

	[`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) is the human-readable map for readers
	who want to inspect the project files after the first pass. It groups the main
	briefs, task outputs, baseline results, visual assets, data notes, and
	scale-up documents.

	[`docs/data/artifact_index.json`](docs/data/artifact_index.json) is the compact
	machine-readable companion used by the website and Hugging Face artifact
	dataset.

	## Evaluation Protocol

	[`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md) and
	[`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) are
	generated from committed metric artifacts. They define:

	- the 20-frame window unit, stride, feature dimension, and raw-data policy,
	- the chronological 70/30 single-episode split and its generalization limit,
	- the per-task input, target, primary metric, minimal score, and neural score,
	- leakage controls for future labels, target-side signals, caption/object
	labels, and train-only normalization,
	- current limitations, including cross-episode generalization,
	audio-visual learning, pixel-depth reconstruction, and real held-out
	multi-episode Qwen3-Omni quality.

	## Dataset Context

	The official [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m)
	dataset is a gated large-scale egocentric multimodal dataset for embodied AI,
	robotics, spatial intelligence, and world modeling. The public
	[`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample)
	repo provides the sample episode used for the implemented task suite here.

	This project keeps those layers separate: the public sample supports the
	current 12-task study, while the gated full dataset is used only for the
	selected multi-episode Qwen3-Omni pilot. Raw Xperience-10M MP4/HDF5/RRD files
	are not redistributed in this repo or in the Hugging Face mirrors.

	The current verified public-sample subset is:

	- one public sample episode, 5,821 frames, and 1,161 aligned windows,
	- raw sample files with six MP4 video streams and audio streams,
	- `annotation.hdf5` carrying depth, SLAM/camera pose, hand/body mocap, IMU,
	language/caption annotations, calibration, metadata, and timing records,
	- an 8,546-dimensional baseline representation using video, audio, depth,
	pose/SLAM, mocap, IMU, calibration, and language-derived signals.

	Detailed dataset notes are available in
	[`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md)
	for readers who need the full upstream-card and access-term context. The
	practical boundary is simple: current task-suite results come from the public
	sample, and the first multi-episode Qwen3-Omni diagnostic pilot is verified but
	not yet strong model quality.

	Start with the visual dashboard:

	[chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/)

	Hugging Face Space app:

	[cy0307-ropedia-xperience-10m-task-suite.static.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.static.hf.space/)

	## Read This Project In Three Layers

	\| Layer \| What to inspect \| Why it matters \|
	\| --- \| --- \| --- \|
	\| Project status \| `PROJECT_STATUS.md`, `docs/data/project_status.json` \| Gives a one-table current project summary before reading the full artifact trail \|
	\| Data contract \| `windows.csv`, `feature_manifest.json`, modality manifests \| Confirms what each sample window contains before modeling \|
	\| Dataset context \| `XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`, official dataset links \| Explains the official dataset, public sample, modalities, access boundary, and what this repo uses \|
	\| Visual assets \| `FIGURE_INDEX.md`, `docs/assets/` \| Shows the task-suite graphic, modality thumbnails, pipeline diagrams, charts, and logo assets \|
	\| Evaluation protocol \| `EVALUATION_PROTOCOL.md`, `docs/data/evaluation_protocol.json` \| Defines the task unit, split, metrics, leakage controls, and current limitations \|
	\| Research roadmap \| `RESEARCH_ROADMAP.md`, `docs/data/research_roadmap.json` \| Shows the path from sample-level task development to multi-episode work, larger model branches, and the future native-pretraining goal \|
	\| Additional development directions \| `ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`, `docs/data/additional_development_directions.json` \| Records concrete non-backbone tracks: taxonomy, benchmark protocol, representation learning, skill graphs, affordances, 3D/4D memory, QA, and policy transfer \|
	\| Xperience Embodied Foundation Model plan \| `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` \| Describes the long-term full-corpus pretraining goal, target modules, objectives, staged scale-up, hardware ranges, and evaluation protocol \|
	\| Minimal heads \| softmax, ridge projection/regression, multi-label logistic heads \| Keeps every input/output contract visible and inspectable \|
	\| Neural heads \| PyTorch MLP classifiers/regressors under `neural_mlp/` \| Checks whether nonlinear heads improve each task without changing features \|
	\| Evidence \| metrics, predictions, confusion matrices, diagrams, dashboard \| Makes the single-episode task development inspectable without rerunning first \|
	\| Artifact guide \| `ARTIFACT_GUIDE.md` \| Groups the public evidence into research-project layers after the first-pass overview \|
	\| Reproducibility contract \| `REPRODUCIBILITY.md`, `docs/data/reproducibility_matrix.json` \| States public commands, expected outputs, exact-match reproduction evidence, and non-reproducible boundaries \|
	\| Citation metadata \| `CITATION.cff`, `codemeta.json`, `LICENSE` \| Makes the repo easier to cite, index, and reuse without confusing code license and dataset terms \|

	## Links

	\| Resource \| Link \|
	\| --- \| --- \|
	\| This GitHub repo \| [github.com/ChaoYue0307/ropedia-xperience-10m-task-suite](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite) \|
	\| This project website \| [chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) \|
	\| This Hugging Face Space \| [huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite](https://huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite) \|
	\| Live Hugging Face static app \| [cy0307-ropedia-xperience-10m-task-suite.static.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.static.hf.space/) \|
	\| GitHub Container package \| [ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite/pkgs/container/ropedia-xperience-10m-task-suite) \|
	\| Derived artifacts on Hugging Face \| [huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts](https://huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts) \|
	\| Minimal and neural task baselines on Hugging Face \| [huggingface.co/cy0307/ropedia-xperience-10m-task-baselines](https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines) \|
	\| Hugging Face collection \| [huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite](https://huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite) \|
	\| Xperience-10M dataset website \| [ropedia.com/dataset](https://ropedia.com/dataset) \|
	\| Xperience-10M release page \| [ropedia.com/blog/20260316_xperience_10m](https://ropedia.com/blog/20260316_xperience_10m) \|
	\| Ropedia GitHub organization \| [github.com/Ropedia](https://github.com/Ropedia) \|
	\| HOMIE Toolkit \| [github.com/Ropedia/HOMIE-toolkit](https://github.com/Ropedia/HOMIE-toolkit) \|
	\| Xperience-10M Hugging Face dataset \| [huggingface.co/datasets/ropedia-ai/xperience-10m](https://huggingface.co/datasets/ropedia-ai/xperience-10m) \|
	\| Xperience-10M sample on Hugging Face \| [huggingface.co/datasets/ropedia-ai/xperience-10m-sample](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) \|
	\| Ropedia Hugging Face organization \| [huggingface.co/ropedia-ai](https://huggingface.co/ropedia-ai) \|

	## Citation, License, And Metadata

	Use [`CITATION.cff`](CITATION.cff) when citing this project. The repository
	also includes [`codemeta.json`](codemeta.json) for machine-readable software
	metadata and [`docs/data/project_manifest.json`](docs/data/project_manifest.json)
	for website/Hugging Face surface metadata.

	The code files are MIT-licensed. Raw Xperience-10M data is not redistributed
	here, and dataset use remains governed by the official Ropedia/Xperience-10M
	terms. See [`LICENSE`](LICENSE) and [`DATA_NOTICE.md`](DATA_NOTICE.md).

	![Ropedia Xperience-10M 12-task infographic](docs/assets/task_suite_infographic.png?v=xperience10m-taskfirst-v13-modality-xl)

	The infographic uses a custom text-free research background and puts the shared
	processing contract plus all 12 task families before the modality atlas.
	Public-sample modality thumbnails remain enlarged below the task map. The task
	names, input/output summaries, and metrics are overlaid from
	[`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json)
	with [`scripts/render_task_suite_infographic.py`](scripts/render_task_suite_infographic.py),
	so the published PNG is a presentation graphic with verified labels and metrics,
	not a hallucinated metric sheet.

	The website also includes a responsive native modality atlas backed by
	[`docs/data/modality_atlas.json`](docs/data/modality_atlas.json) and
	[`docs/assets/modalities/`](docs/assets/modalities/). Those assets are small
	derived thumbnails from the public sample, not raw Xperience-10M files.

	![Verified Pipeline](docs/assets/pipeline_diagram.png?v=xperience10m-nn)

	![Qwen3-Omni LoRA training pipeline](docs/assets/qwen3_omni_lora_pipeline.png?v=qwen3-lora-v1)

	![Minimal and neural 12-task model architectures](docs/assets/task_architectures.png?v=xperience10m-nn)

	The pipeline and architecture figures use the same pattern: text-free visual
	backgrounds carry the composition, while
	[`scripts/render_overview_figures.py`](scripts/render_overview_figures.py)
	overlays exact labels, dimensions, and metrics from the committed result files.

	## Scope

	This is a learning, inspection, and pipeline-validation repo built from one
	public sample episode. The next model-quality stage is to run the same suite
	over many episodes and split train/test by held-out episode.

	## What Is Inside

	```text
	scripts/
	train_min_action_model.py # motion/IMU baseline
	train_all_modalities_model.py # current all-feature lightweight baseline
	episode_task_suite.py # 12 end-to-end task definitions
	neural_task_models.py # optional PyTorch MLP heads for all 12 tasks
	research_direction_taxonomy.py # maps 12 tasks to the four research tracks
	research_direction_extension_tasks.py # one extra data-backed probe per track
	task_walkthroughs.py # human-readable task-card and walkthrough-storyboard metadata
	generate_visualizations.py # refreshes SVG charts + summary JSON
	render_task_suite_infographic.py # renders the task-suite presentation PNG
	export_modality_atlas_assets.py # exports responsive modality-card assets
	render_overview_figures.py # renders polished pipeline/architecture PNGs
	build_brand_assets.py # derives logo sizes, favicon, social card
	build_artifact_index.py # builds the compact artifact guide data
	build_quality_gates.py # builds release checks
	validate_mirror_parity.py # checks prepared GitHub/HF mirror file parity
	validate_scope_claims.py # separates setup artifacts from completed model metrics
	validate_task_surface.py # checks readable task cards and interactive storyboard wiring
	validate_website_integrity.py # checks local site links, anchors, and images
	validate_publication_package.py # checks public repo + HF bundle contents
	publish_hf_bundles.py # uploads prepared HF Space/artifact/model bundles
	omni/
	download_sample_modelscope.py # ModelScope sample download helper
	build_episode_manifest.py # metadata-only multi-episode scanner
	plan_finetune_sample_budget.py # storage/sample-count planner
	qwen3_omni_adapter_smoke.py # real-data Qwen3-Omni adapter setup check

	results/
	min_action_model/ # motion-only action baseline artifacts
	min_subtask_model/ # motion-only subtask baseline artifacts
	min_all_modalities_action_model/ # current all-feature action artifacts
	min_all_modalities_subtask_model/ # current all-feature subtask artifacts
	episode_task_suite/ # 12-task suite metrics and predictions
	neural_mlp/ # optional neural baseline artifacts per task
	research_directions/ # four-track taxonomy, CSV, and summary
	research_direction_extensions/ # four extra direction probes + predictions
	task_walkthroughs/ # case-study walkthroughs for all 12 tasks
	omni_exploration/ # ModelScope readiness-check artifacts

	docs/
	index.html # GitHub Pages dashboard
	data/additional_development_directions.json # concrete non-backbone project directions
	data/summary_metrics.json # website-readable metrics bundle
	data/evidence_contract.json # machine-readable project scope
	data/artifact_index.json # compact project-artifact catalog
	data/live_publication_status.json # live GitHub/HF publication verification
	data/quality_gates.json # machine-readable release checks
	data/task_surface_integrity.json # machine-readable task-card/storyboard integrity check
	data/project_manifest.json # machine-readable public-surface metadata
	data/project_packet.json # compact project path and scope summary
	data/research_roadmap.json # multi-episode and omni-model roadmap
	data/research_directions.json # four-track website data bundle
	data/research_direction_extensions.json # four extra probe data bundle
	data/task_walkthroughs.json # human-readable task-card and walkthrough-storyboard data
	data/modality_atlas.json # responsive modality-card data
	assets/brand/*.png # project logo, favicon, social card
	assets/task_suite_infographic.png # 12-task presentation graphic
	assets/modalities/ # public-sample derived modality thumbnails
	assets/pipeline_diagram.png # verified episode pipeline graphic
	assets/qwen3_omni_lora_pipeline.png # Qwen3-Omni LoRA training-flow figure
	assets/task_architectures.png # verified 12-task minimal architecture map
	assets/charts/*.svg # regenerated visualizations

	notes/
	min_action_model.md
	all_modalities_model.md
	episode_task_suite.md
	```

	Raw Xperience-10M data is not committed. Download it from the official
	Ropedia distribution and follow the dataset terms.

	## GitHub Package

	The public dashboard is packaged as a static-site container on GitHub Container
	Registry. It contains the `docs/` site plus the main reader documents; it does
	not include raw Xperience-10M videos, raw annotations, gated data, or model
	weights.

	```bash
	docker pull ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest
	docker run --rm -p 8080:80 ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest
	```

	Then open `http://localhost:8080`.

	## Data Expected

	The scripts expect a workspace with the Ropedia HOMIE toolkit and the
	Xperience-10M sample episode:

	```text
	<workspace>/
	HOMIE-toolkit/
	data/sample/xperience-10m-sample/
	annotation.hdf5
	fisheye_cam0.mp4
	fisheye_cam1.mp4
	fisheye_cam2.mp4
	fisheye_cam3.mp4
	stereo_left.mp4
	stereo_right.mp4
	```

	The public sample dataset identifier is:

	```text
	ropedia-ai/xperience-10m-sample
	```

	Hugging Face URL:

	```text
	https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample
	```

	## Quickstart

	From a workspace folder:

	```bash
	git clone https://github.com/Ropedia/HOMIE-toolkit.git
	python3.12 -m venv .venv
	source .venv/bin/activate
	pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
	```

	Download the sample:

	```bash
	hf download ropedia-ai/xperience-10m-sample \
	--repo-type dataset \
	--local-dir data/sample/xperience-10m-sample
	```

	If Hugging Face access is unavailable in your environment, use ModelScope:

	```bash
	python scripts/omni/download_sample_modelscope.py \
	--output-dir data/sample/xperience-10m-sample \
	--mode minimal
	```

	`--mode minimal` downloads `annotation.hdf5`, `README.md`, and
	`fisheye_cam0.mp4`. Use `--mode all-training` to add all six MP4 streams while
	still skipping `visualization.rrd`.

	Clone and run this repo:

	```bash
	git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
	cd ropedia-xperience-10m-task-suite
	python scripts/episode_task_suite.py --workspace /path/to/workspace
	```

	Run the same 12-task suite with lightweight neural heads:

	```bash
	pip install torch
	python scripts/episode_task_suite.py \
	--workspace /path/to/workspace \
	--include-neural
	```

	Run the smaller baselines:

	```bash
	python scripts/train_min_action_model.py --workspace /path/to/workspace
	python scripts/train_all_modalities_model.py --workspace /path/to/workspace
	```

	## Xperience-10M Fine-Tuning Exploration

	This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The
	repository separates public-sample evidence from multi-episode fine-tuning
	artifacts. The validation-aware selected-episode held-out package is now verified as a
	diagnostic pilot, not a strong final model.
	The useful distinction is:

	- direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language
	prompts,
	- adapter-required Xperience-10M sensor inputs: depth, pose/SLAM, hand/body
	mocap, contacts, and IMU.

	![Xperience-10M to Qwen3-Omni LoRA training flow](docs/assets/qwen3_omni_lora_pipeline.png?v=qwen3-lora-v1)

	The figure shows the intended end-to-end training flow: raw valid episodes enter
	episode-level split validation, parallel media/sensor export creates Qwen-style
	JSONL records, Qwen3-Omni receives video/audio/text directly, the sensor bridge
	adds depth/pose/mocap/IMU features, LoRA adapters are trained on prepared
	train/val episodes, and sealed held-out test evaluation produces predictions,
	metrics, run reports, and upload-ready adapter artifacts.

	The scale-up path requires valid prepared episodes, held-out episode splits,
	training metadata, predictions, metrics, and a run report. A result is ready
	for public README, website, or Hugging Face updates only after the validator
	passes and `scripts/omni/package_verified_omni_result.py` creates a
	public-safe derived-artifact package. The current verified package is listed in
	[`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json).

	### Sample Count Decision

	Do not treat "10M" as a reason to start with the entire dataset. The engineering
	unit that matters first is diverse held-out episodes, not adjacent windows from
	one session.

	\| Phase \| Episodes/samples \| Approx windows at stride 5 \| Purpose \|
	\| --- \| ---: \| ---: \| --- \|
	\| Readiness \| 1-3 \| 1k-3k \| Verify loaders, token alignment, and task heads \|
	\| Pilot \| 16-32 \| 18k-37k \| First held-out-episode evaluation \|
	\| Useful LoRA run \| 64-128 \| 74k-149k \| Train sensor adapters plus selected Qwen3-Omni LoRA \|
	\| Storage-heavy run \| 256+ \| 297k+ \| Only after download layout and checkpoint size are stable \|

	Use the budget helper before downloading:

	```bash
	python scripts/omni/plan_finetune_sample_budget.py \
	--storage-root /path/to/storage \
	--target-free-after-download-gb 800 \
	--all-training-per-episode-gb 2.4 \
	--full-preview-per-episode-gb 5.1
	```

	### Multi-Episode Readiness Gate

	```bash
	python scripts/omni/discover_xperience10m_sources.py \
	--workspace /path/to/ropedia-xperience-10m-task-suite \
	--data-root /path/to/xperience10m_data \
	--output results/omni_finetune/source_discovery.json
	```

	Current status in this repo:

	- public_sample_valid_episodes: 1 (degraded-valid: annotation + fisheye_cam0.mp4)
	- gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions
	- selected_episode_plan: 128 source-balanced episodes, 96/16/16 train/val/test
	- selected_download_size: 277.71 GiB excluding `visualization.rrd`
	- verified_validation_aware_diagnostic_package: true
	- selected_split: 96 train / 16 validation / 16 held-out test episodes
	- exported_windows: 2,848 train / 512 validation / 448 test
	- validation_samples_used: 512
	- held_out_eval: 448 test windows from 14 exported test episodes
	- train_loss / val_loss: 0.4130 / 0.0331
	- current_quality_target: JSON validity 87.50%, below the 98% target
	- gated dataset: available for selected multi-episode data preparation
	- source_discovery: `results/omni_finetune/source_discovery.json`
	- data_status: `results/omni_finetune/DATA_ACCESS_STATUS.md`
	- access_status: `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md`

	Use this gate before scheduling any full fine-tune run. The pilot should use
	balanced held-out selection, not the first paths in repository order. The
	current 128-episode selection filters for complete leaf episodes, excludes
	`visualization.rrd`, balances episode-size bands, and preserves one selected
	episode per top-level session UUID.

	### Progressive Train/Validation Pilot

	The selected 128-episode plan can be used before every episode has arrived by
	training only on prepared `train` episodes and monitoring prepared `val` episodes.
	The final `test` episodes stay sealed until the end, so early development does
	not contaminate held-out evaluation.

	```bash
	python scripts/omni/build_selection_episode_manifest.py \
	--workspace /path/to/ropedia-xperience-10m-task-suite \
	--data-root /path/to/xperience10m_128 \
	--selection-json results/omni_finetune/xperience10m_128_episode_selection.json \
	--output results/omni_finetune/trainval_progressive/episode_manifest_trainval.json \
	--include-split train \
	--include-split val
	```

	`scripts/omni/run_trainval_progressive_128.sh` wraps the same guard, exports a
	train/val-only Qwen3-Omni JSONL dataset, and launches LoRA training without
	running final test evaluation. The exporter uses session-qualified episode IDs
	and path-based split matching so repeated folder names such as `ep1` cannot
	collide across different sessions.

	For larger prepared subsets, `scripts/omni/run_trainval_parallel_export_8gpu.sh`
	uses the same split guard, exports episodes in parallel CPU shards, skips and
	reports episodes that contain no labeled windows under the configured label
	rule, then launches Qwen3-Omni LoRA with `NUM_PROCESSES=8`.

	### Full 128-Episode Held-Out Pilot

	Once all selected episodes are complete, use the fixed selected-episode split:

	- 96 train episodes,
	- 16 validation episodes,
	- 16 held-out test episodes.

	The clean full-run launcher validates the selected split, exports all splits in
	parallel, trains Qwen3-Omni LoRA on train episodes while optionally monitoring
	validation loss, then evaluates on the held-out test split:

	```bash
	RUN_ID=xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
	DATA_ROOT=/path/to/xperience10m_128 \
	SELECTION_JSON=results/omni_finetune/xperience10m_128_episode_selection.json \
	MODEL_DIR=/path/to/Qwen__Qwen3-Omni-30B-A3B-Instruct \
	NUM_PROCESSES=8 \
	TRAIN_VAL_SPLIT=val \
	MAX_VAL_SAMPLES=512 \
	scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh
	```

	The current verified diagnostic package uses the same selected split and 8-GPU
	training path, records validation loss over 512 validation windows, and keeps
	the held-out test split sealed for final evaluation. The next pass should keep
	this package contract while tightening JSON decoding, target formatting, and
	action/subtask error analysis.

	Monitor the run with:

	```bash
	python scripts/omni/monitor_omni_progress.py \
	--run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu
	```

	The monitor reads training `progress.jsonl`, new evaluator partial-prediction
	progress, and legacy generation logs, so long held-out evals can still expose
	sample-level progress even before final metrics are written.

	Validate the run artifacts stage by stage:

	```bash
	python scripts/omni/validate_omni_finetune_run.py \
	--run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
	--require-stage manifest

	python scripts/omni/validate_omni_finetune_run.py \
	--run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
	--require-stage eval \
	--min-json-validity 0.98
	```

	After the eval validator passes, create the public-safe result package:

	```bash
	python scripts/omni/package_verified_omni_result.py \
	--dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
	--train-run-id <train_run_id> \
	--eval-run-id <eval_run_id>
	```

	For long-running remote jobs, the packaging step can be watched automatically:

	```bash
	python scripts/omni/watch_verified_omni_package.py \
	--dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
	--train-run-id <train_run_id> \
	--eval-run-id <eval_run_id>
	```

	While waiting, the watcher can append `eval_progress_observed` events from
	partial prediction files or legacy generation logs. This keeps the package
	status file useful during long held-out evaluations.

	The package copies only small derived artifacts such as metrics, predictions,
	confusion matrices, run reports, manifests, validation summaries, and training
	metadata. The exact required eval files and primary metrics come from the
	selected backbone contract in `configs/omni_backbones`, so Qwen3-Omni,
	Cosmos-style world models, and VLA/policy branches can share the same verified
	publication gate once their model-specific evaluators exist. The package
	excludes raw Xperience-10M files, base-model weights, adapter or checkpoint
	weights, full checkpoints, and large archives.

	For hardware setups that can run multiple eval workers, the Qwen evaluator also
	supports deterministic sample shards:

	```bash
	python scripts/omni/eval_qwen3_omni_lora.py \
	--dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl \
	--adapter-dir checkpoints/<train_run_id>/adapter_lora \
	--run-id <eval_shard_0> \
	--eval-split test \
	--sample-offset 0 \
	--sample-stride 4

	python scripts/omni/merge_qwen3_omni_eval_shards.py \
	--dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl \
	--output-dir results/omni_finetune/<merged_eval_run_id> \
	--shard-dir results/omni_finetune/<eval_shard_0> \
	--shard-dir results/omni_finetune/<eval_shard_1> \
	--shard-dir results/omni_finetune/<eval_shard_2> \
	--shard-dir results/omni_finetune/<eval_shard_3>
	```

	Only the merged eval directory should be validated and reported publicly,
	because the merger checks coverage and recomputes the metrics from all
	held-out predictions.

	After dataset export, a model-neutral window index can be created for future
	backbones:

	```bash
	python scripts/omni/export_model_neutral_window_index.py \
	--dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl
	```

	This produces `window_index.jsonl` and `window_index_manifest.json` so Cosmos-
	style world models and VLA/policy branches can reuse the same split-checked
	windows without depending on Qwen chat-message records.

	### Uploading Qwen3-Omni LoRA artifacts

	The public-safe verified package intentionally excludes raw data, base Qwen
	weights, LoRA weights, and full checkpoints. Adapter upload is a separate step:
	use it only when the intended adapter directory is present and the model card
	clearly distinguishes older smoke weights from the selected-episode diagnostic
	or validation-aware run.

	```bash
	python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \
	--repo-id cy0307/ropedia-qwen3-omni-lora-smoke \
	--source-dir /path/to/adapter_upload_package \
	--message "Upload Xperience-10M Qwen3-Omni LoRA pilot"
	```

	This script requires a valid Hugging Face token via `HF_TOKEN` or `--token`.
	Network availability to `huggingface.co` is required.

	### Foundation Backbone Plan

	The next modeling plan tracks several foundation-model branches instead of
	assuming one backbone solves every Xperience-10M objective.

	\| Branch \| Current role \| When to use it \|
	\| --- \| --- \| --- \|
	\| Qwen3-Omni \| First trainable multimodal LoRA pilot \| Use for the selected 128-episode held-out baseline over video/audio/language plus sensor-bridge features. \|
	\| Cosmos 3 \| First world-model/action-generation branch \| Use after data preparation for future-window prediction, action-conditioned world modeling, and synthetic-data usefulness tests. \|
	\| GR00T \| Humanoid/action-policy branch \| Use after mocap/contact retargeting creates well-defined humanoid action targets. \|
	\| OpenVLA / openpi \| Open VLA/policy baselines \| Use after the project defines robot-compatible or action-token targets. \|
	\| Gemini Robotics \| External reasoning reference \| Use only for qualitative comparison or annotation support unless local trainable access exists. \|
	\| Xperience Embodied Foundation Model \| Future Xperience-native pretraining goal \| Use only after multi-episode pilots, full-corpus storage, distributed training infrastructure, and scaling evidence justify a from-scratch domain model. \|

	See [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md) and
	[`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json)
	for the full selection matrix, source links, and model-specific evaluation
	additions. See
	[`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md)
	for the long-term full-corpus pretraining plan.

	Backbone-specific contracts now live in [`configs/omni_backbones`](configs/omni_backbones).
	The extension contract is documented in
	[`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md), and the
	registry can be checked with:

	```bash
	python scripts/omni/backbone_registry.py --validate --json
	```

	Verify that every configured backbone can pass the public-safe packaging
	contract on synthetic derived artifacts:

	```bash
	python scripts/omni/smoke_test_backbone_packaging.py
	```

	After a real held-out package is created, audit it before updating README,
	website, or Hugging Face pages:

	```bash
	python scripts/omni/audit_verified_omni_package.py \
	--package-dir results/omni_finetune/verified_public/<eval_run_id>
	```

	Create a new planned backbone branch from an existing contract template with:

	```bash
	python scripts/omni/scaffold_omni_backbone.py \
	--template-backbone policy_vla_branch \
	--id new_policy_branch \
	--display-name "New Policy Branch" \
	--model-family "Model family name" \
	--dataset-contract xperience10m_observation_action_v1 \
	--training-objective observation_to_action_policy \
	--checkpoint-gate policy_checkpoint_action_space_and_normalizer \
	--dry-run
	```

	Each backbone config declares the checkpoint gate, required train/eval files,
	allowed public artifacts, and forbidden private or heavyweight artifacts. This
	keeps Qwen3-Omni, Cosmos-style world models, and policy/VLA branches on the same
	split, validation, and publication discipline even though their training targets
	are different.

	## Additional Development Directions

	Beyond backbone selection and fine-tuning, Xperience-10M supports several
	concrete research-development tracks:

	\| Direction \| First useful artifact \| Role in the project \|
	\| --- \| --- \| --- \|
	\| Episode taxonomy and data engine \| Episode atlas, balance report, and split builder \| Select representative data before training. \|
	\| Standardized benchmark protocol \| Versioned train/val/test manifests and metric scripts \| Make future model results comparable. \|
	\| Multimodal representation learning \| Contrastive and masked-window encoder objectives \| Learn reusable video/audio/depth/pose/mocap/IMU/language features. \|
	\| Skill and procedure graph mining \| Step graph, transitions, preconditions, and effects \| Connect perception to planning and long-horizon reasoning. \|
	\| Human-object affordance modeling \| Contact, reachable-object, tool-use, and next-affordance tasks \| Model what actions the scene makes possible. \|
	\| 3D/4D scene and object memory \| Persistent scene/object maps from depth, pose, multiview video, and objects \| Track world state beyond single frames. \|
	\| Data-quality and synchronization diagnostics \| Per-episode QA for drift, missing streams, calibration, and corrupted files \| Keep large multimodal training trustworthy. \|
	\| Policy, retargeting, and simulation transfer \| Action-token conversion and robot-compatible imitation examples \| Bridge human egocentric experience to robot policy work. \|

	See [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md)
	and [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json).

	## Four Research Directions

	The 12 tasks are now organized against the four Ropedia research directions in
	a generated artifact, not only in prose:

	- [`research_direction_taxonomy.json`](results/episode_task_suite/research_directions/research_direction_taxonomy.json)
	- [`research_direction_task_map.csv`](results/episode_task_suite/research_directions/research_direction_task_map.csv)
	- [`research_direction_summary.md`](results/episode_task_suite/research_directions/research_direction_summary.md)
	- [`docs/data/research_directions.json`](docs/data/research_directions.json)

	The taxonomy uses two current baselines for every task:

	\| Baseline \| Role \|
	\| --- \| --- \|
	\| Minimal interpretable heads \| Softmax, logistic, ridge, and retrieval heads over the 8,546-dimensional multimodal representation. These expose the input/output contract cleanly. \|
	\| Neural MLP heads \| Small PyTorch MLP classifiers/regressors on the same features and splits. These check whether nonlinear heads help before moving to Qwen/Omni fine-tuning. \|

	Current direction-level coverage:

	\| Direction \| Current status \| Covered task evidence \| What is not solved yet \|
	\| --- \| --- \| --- \| --- \|
	\| A. Human Modeling & Motion Understanding \| Partially implemented \| Hand Trajectory Forecasting and Contact State Prediction are direct; Action Recognition and Object Relevance Prediction are proxies. Neural MLP improves hand forecasting from `0.8647` to `0.1079` MPJPE. \| No full body/shape model, SMPL/MANO target, deformation prior, or multi-episode motion-generation evaluation yet. \|
	\| B. 3D/4D Reconstruction & Neural Rendering \| Proxy tasks only \| Cross-Modal Retrieval, Cross-Modal Reconstruction, and Multimodal Synchronization Detection test alignment/reconstruction prerequisites. \| No NeRF, Gaussian Splatting, TSDF, mesh, novel-view synthesis, or calibrated 4D reconstruction model yet. \|
	\| C. Egocentric Vision & Interaction \| Strongest implemented track \| 6 direct tasks: action, subtask, transition, next-action, object relevance, and caption grounding, plus alignment/order diagnostics and audio ablation. \| Single-episode chronological split limits generalization; stronger audio and video-language backbones still need multi-episode testing. \|
	\| D. Scene Reconstruction & World Modeling \| Early proxy tasks \| Procedure Step Recognition, Next-Action Prediction, Object Relevance Prediction, Cross-Modal Retrieval, Cross-Modal Reconstruction, Temporal Order Verification, and Multimodal Synchronization Detection provide state/world-model probes. \| No persistent scene graph, object permanence task, long-term map, or held-out-episode world model yet. \|

	The important interpretation is that all four directions can be started from
	the Xperience-10M sample modalities, but only direction C is strongly represented
	by the current 12-task suite. Directions A, B, and D need additional targets and
	multi-episode training before they become full research deliverables.

	## Four Direction-Extension Probes

	Beyond the original 12 core tasks, the repo now includes one extra data-backed
	probe for each research direction. These probes are computed from the same
	`shared_windows.npz`, `windows.csv`, and `feature_manifest.json` artifacts, so
	the reported numbers are computed from sample-derived features and saved metric artifacts.

	- [`research_direction_extension_results.json`](results/episode_task_suite/research_direction_extensions/research_direction_extension_results.json)
	- [`research_direction_extension_summary.md`](results/episode_task_suite/research_direction_extensions/research_direction_extension_summary.md)
	- [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json)
	- [`research_direction_extension_tasks.svg`](docs/assets/charts/research_direction_extension_tasks.svg)

	![Four direction extension probes](docs/assets/charts/research_direction_extension_tasks.svg)

	\| Direction \| New extension task \| Input \| Output \| Minimal \| Neural MLP \| Why it matters \|
	\| --- \| --- \| --- \| --- \| ---: \| ---: \| --- \|
	\| A. Human Modeling & Motion Understanding \| Body and Hand Motion Intensity \| non-mocap video/depth/pose/IMU/SLAM/language features \| high vs low body/hand motion \| `0.7827` macro-F1 \| `0.7986` macro-F1 \| Starts a human-motion-energy target without leaking mocap input. \|
	\| B. 3D/4D Reconstruction & Neural Rendering \| Multi-View Consistency Retrieval \| fisheye camera feature query \| synchronized stereo-left view rank \| `0.5534` MRR \| `0.3469` MRR \| Tests whether multi-view features preserve synchronized 4D scene identity. \|
	\| C. Egocentric Vision & Interaction \| Action Phase Progress Estimation \| non-caption multimodal window \| progress inside current action segment \| `0.3416` MAE \| `0.3038` MAE \| Adds a task-structure/intent-style target beyond class labels. \|
	\| D. Scene Reconstruction & World Modeling \| Short-Horizon Ego-Motion Forecasting \| current sensors excluding camera translation and captions \| future camera-translation delta \| `0.1989` MAE \| `0.0989` MAE \| Starts a short-horizon world-model target over wearer motion. \|

	Run:

	```bash
	python scripts/research_direction_extension_tasks.py
	```

	These four probes make the four-direction mapping more concrete, but they are
	still single-episode extension baselines. Full research conclusions still require
	multi-episode training, held-out episode evaluation, and stronger task-specific
	models.

	## Task Walkthroughs For Juniors

	Every task now has a beginner-facing explanation with:

	- a concrete coffee-episode case study,
	- exact input contract,
	- middle process modules,
	- output contract,
	- minimal and neural metric,
	- one important limitation.

	Primary files:

	- [`TASK_WALKTHROUGHS.md`](results/episode_task_suite/task_walkthroughs/TASK_WALKTHROUGHS.md)
	- [`task_walkthroughs.json`](results/episode_task_suite/task_walkthroughs/task_walkthroughs.json)
	- [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json)
	- [`docs/data/task_surface_integrity.json`](docs/data/task_surface_integrity.json)

	Compact map:

	\| Task \| Case study \| Input -> process -> output \|
	\| --- \| --- \| --- \|
	\| Action Recognition \| A pouring window should be named as the current action. \| all-modality window -> action label builder + classifier -> action class \|
	\| Procedure Step Recognition \| A fine action is grouped into a broader drink-preparation stage. \| all-modality window -> subtask label builder + classifier -> subtask label \|
	\| Action Boundary Detection \| Detect the change from preparing to pouring. \| window -> boundary builder + binary classifier -> boundary/steady \|
	\| Next-Action Prediction \| A preparing window predicts what happens 20 frames later. \| current window -> future-label shift + classifier -> next action \|
	\| Hand Trajectory Forecasting \| A hand moving toward a cup becomes a future 3D hand path. \| current window -> future mocap target + regressor -> hand trajectory \|
	\| Contact State Prediction \| Decide whether hand/body contact is happening. \| non-contact features -> contact target + binary classifier -> contact label \|
	\| Object Relevance Prediction \| Infer milk, cup, coffee, or related objects during pouring. \| non-caption features -> multi-hot object target + sigmoid heads -> object set \|
	\| Language Grounding \| Query Pour milk into coffee and retrieve the matching moment. \| text-like query + candidates -> projection + cosine ranker -> ranked windows \|
	\| Cross-Modal Retrieval \| Motion/IMU from pouring retrieves matching depth/video. \| motion/IMU/camera -> projection + candidate index -> ranked depth/video windows \|
	\| Cross-Modal Reconstruction \| Infer depth/video features from motion, IMU, and camera pose. \| source modalities -> scaler + regressor -> target modality vector \|
	\| Temporal Order Verification \| Tell whether reaching then pouring was reversed. \| adjacent window pair -> pair combiner + binary classifier -> correct/reversed \|
	\| Multimodal Synchronization Detection \| Catch motion paired with visual/depth features shifted in time. \| motion side + visual side -> aligned/shifted pair builder + classifier -> aligned/shifted \|

	## Minimal 12-Task Architectures

	These are deliberately minimal baselines. They are useful because every
	input/output contract is explicit, not because they are strong embodied-AI
	models.

	Shared setup:

	```text
	raw episode -> 20-frame windows, stride 5 -> 8,546-dimensional multimodal representation
	chronological split: first 70% train, last 30% test
	scalers are fit on train windows only
	```

	There are four reusable head families:

	\| Head family \| Used by \| What it means \|
	\| --- \| --- \| --- \|
	\| Linear softmax classifier \| Action Recognition, Procedure Step Recognition, Action Boundary Detection, Next-Action Prediction, Contact State Prediction, Temporal Order Verification, Multimodal Synchronization Detection \| z-score features, then `XW+b`, softmax, cross-entropy, L2 \|
	\| Dual ridge regression/projection \| Hand Trajectory Forecasting, Cross-Modal Reconstruction \| z-score input/target, solve ridge regression with L2=10 \|
	\| Ridge + cosine ranking \| Language Grounding, Cross-Modal Retrieval \| project one modality into another feature space, then rank candidates by cosine \|
	\| Multi-label logistic regression \| Object Relevance Prediction \| z-score non-caption features, sigmoid object heads, threshold at 0.5 \|

	The optional neural run keeps the same window representation, leakage filters,
	chronological splits, and metrics, but replaces the task heads with small
	PyTorch MLP classifiers or regressors. Its outputs live under
	[`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/),
	and the rollup is stored in the `neural_tasks` section of
	[`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json).

	The task-specific heads are:

	\| Task \| Input \| Minimal head \| Output \|
	\| --- \| --- \| --- \| --- \|
	\| Action Recognition \| all featurized modalities \| linear softmax \| current action class \|
	\| Procedure Step Recognition \| all featurized modalities \| linear softmax \| current subtask class \|
	\| Action Boundary Detection \| all featurized modalities \| linear softmax \| steady vs action boundary \|
	\| Next-Action Prediction \| all featurized modalities at `t` \| linear softmax \| action at `t+20` frames \|
	\| Hand Trajectory Forecasting \| all featurized modalities at `t` \| ridge regression \| future 10-frame left/right hand joints \|
	\| Contact State Prediction \| non-contact and non-caption signals \| linear softmax \| any body contact \|
	\| Object Relevance Prediction \| non-caption signals \| multi-label logistic \| relevant object set \|
	\| Language Grounding \| sensor windows projected to text space \| ridge projection + cosine ranking \| matching time window for text query \|
	\| Cross-Modal Retrieval \| motion/IMU/camera projected to visual space \| ridge projection + cosine ranking \| matching depth/video window \|
	\| Cross-Modal Reconstruction \| motion/IMU/camera \| ridge regression \| compressed depth/video target \|
	\| Temporal Order Verification \| `[x_t, x_t+1, x_t+1-x_t]` \| binary linear softmax \| correct vs reversed order \|
	\| Multimodal Synchronization Detection \| motion plus visual pair \| binary linear softmax \| aligned vs shifted by 8 windows \|

	## Key Results

	\| Experiment \| Main score \| Accuracy \| Notes \|
	\| --- \| ---: \| ---: \| --- \|
	\| Motion-only action \| 0.9688 macro-F1 \| 0.9828 \| Uses motion/IMU features only \|
	\| Current all-feature action \| 0.9829 macro-F1 \| 0.9863 \| 8,546-dimensional multimodal representation \|
	\| Motion-only subtask \| 0.9528 macro-F1 \| 0.9759 \| Strong within-episode subtask signal \|
	\| Current all-feature subtask \| 0.9173 macro-F1 \| 0.9828 \| High accuracy, lower class-balanced score \|
	\| Cross-modal retrieval \| 0.3678 top-5 \| n/a \| Motion/IMU/camera/audio retrieves matching depth/video \|
	\| Transition detection \| 0.6118 macro-F1 \| 0.9080 \| Boundary F1 is 0.1250 \|
	\| Hand trajectory forecast \| 0.8647 MPJPE \| n/a \| Predicts future hand-joint trajectory \|
	\| Neural MLP hand forecast \| 0.1079 MPJPE \| n/a \| Same features/split, nonlinear regression head \|
	\| Neural MLP temporal order \| 0.8520 F1 \| 0.8578 \| Strong improvement on adjacent-window ordering \|
	\| Neural MLP misalignment \| 0.7153 F1 \| 0.7009 \| Detects shifted motion/visual/audio pairs better than the linear head \|
	\| Audio ablation \| +0.0418 mean delta \| n/a \| Current audio variant improves the primary metric on 6 of 12 task contracts \|
	\| Alternate audio representation \| +0.0936 mean delta \| n/a \| Alternate audio-window representation improves over the baseline audio variant on 6 of 12 task contracts \|

	## Audio Contribution Study

	The audio ablation keeps the same windows and task labels, then compares input
	variants under the same chronological split. The script
	[`scripts/audio_ablation_and_raw_upgrade.py`](scripts/audio_ablation_and_raw_upgrade.py)
	reuses the real task-suite windows and evaluates six variants for
	every task: current inputs, no audio, audio-only, alternate audio-only, audio
	representation replacement, and all inputs plus the alternate audio representation.

	The measured single-episode result is task-specific:

	\| Readout \| Value \|
	\| --- \| ---: \|
	\| Tasks where current audio improves the primary metric \| 6 / 12 \|
	\| Mean current-audio delta \| +0.0418 \|
	\| Tasks where alternate audio representation improves over baseline audio \| 6 / 12 \|
	\| Mean alternate-representation delta vs baseline audio \| +0.0936 \|

	Full files:

	- [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md)
	- [`results/audio_ablation/audio_ablation_metrics.csv`](results/audio_ablation/audio_ablation_metrics.csv)
	- [`results/audio_ablation/audio_delta_summary.csv`](results/audio_ablation/audio_delta_summary.csv)
	- [`docs/data/audio_ablation_summary.json`](docs/data/audio_ablation_summary.json)
	- [`docs/assets/charts/audio_ablation_delta.svg`](docs/assets/charts/audio_ablation_delta.svg)

	## Neural MLP Results

	The neural baseline was run locally with `--include-neural` for all 12 tasks
	using 80 epochs, hidden size 128, batch size 128, and CPU execution. It is not a
	foundation model result; it is a controlled nonlinear-head comparison over the
	same 8,546-dimensional multimodal representation.

	\| Task \| Neural metric \| Minimal metric \| Readout \|
	\| --- \| ---: \| ---: \| --- \|
	\| Action Recognition \| 0.0148 macro-F1 \| 0.0500 macro-F1 \| Still blocked by unseen future classes \|
	\| Procedure Step Recognition \| 0.0281 macro-F1 \| 0.0506 macro-F1 \| Same single-episode split limitation \|
	\| Action Boundary Detection \| 0.5862 macro-F1 \| 0.6118 macro-F1 \| Similar to the linear baseline \|
	\| Next-Action Prediction \| 0.0419 macro-F1 \| 0.0593 macro-F1 \| Same unseen-label issue \|
	\| Hand Trajectory Forecasting \| 0.1079 MPJPE \| 0.8647 MPJPE \| Neural regression improves this target \|
	\| Contact State Prediction \| 1.0000 macro-F1 \| 1.0000 macro-F1 \| Degenerate one-class sample \|
	\| Object Relevance Prediction \| 0.1679 micro-F1 \| 0.1803 micro-F1 \| Similar weak object signal \|
	\| Language Grounding \| 0.0168 MRR \| 0.0160 MRR \| Similar ranking behavior \|
	\| Cross-Modal Retrieval \| 0.1300 MRR \| 0.2693 MRR \| Linear ridge remains stronger here \|
	\| Cross-Modal Reconstruction \| -0.0102 R2 \| -0.0153 R2 \| Small improvement but still weak \|
	\| Temporal Order Verification \| 0.8520 F1 \| 0.5400 F1 \| Neural head captures local temporal structure \|
	\| Multimodal Synchronization Detection \| 0.7153 F1 \| 0.5052 F1 \| Neural head improves alignment detection \|

	The strongest single-episode self-supervised signal is cross-modal retrieval:
	motion/IMU/camera/audio features retrieve matching depth/video windows substantially
	better than random.

	## Single-Episode Diagnostics and Explorer

	While waiting for broader Xperience-10M access, the repo now includes an
	artifact-driven diagnostics pass over the public sample episode:

	- `results/single_episode_diagnostics/object_labels/window_object_labels.csv`
	exports 1,161 real window-level object-label sets from `annotation.hdf5`.
	- `results/single_episode_diagnostics/modality_ablation/ablation_metrics.csv`
	recomputes all 96 task/modality cells, including object relevance.
	- `results/single_episode_diagnostics/timeline_overlay/timeline_overlay.csv`
	aligns 2,079 existing prediction rows back to the episode timeline.
	- `results/single_episode_diagnostics/alignment_stress/alignment_shift_metrics.csv`
	evaluates cross-modal retrieval under explicit time shifts.
	- `docs/single_episode_explorer.html` is a static interactive page for
	inspecting window labels, objects, predictions, modality statistics, and
	diagnostic scores.

	These are single-episode research diagnostics. They are useful for studying
	task definitions, feature behavior, and model errors before scaling to more
	episodes; they are not reported as multi-episode benchmark results.

	## Reproducibility Check

	I re-ran the full pipeline from the local raw public sample into a temporary
	local workspace and compared regenerated metrics with the committed
	artifacts. The baseline metrics, 12 task metrics, feature manifest, and
	available modality manifest matched exactly after float normalization.

	See [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) for the
	commands and verification evidence.

	## Why Some Scores Are Low

	The task suite intentionally uses a chronological split:

	```text
	first 70% of the episode -> train
	last 30% of the episode -> test
	```

	The test segment contains some action/subtask labels never seen during training.
	Timeline and next-action classifiers therefore expose the core limitation of
	single-episode learning instead of hiding it behind random splits.

	## Modalities Used

	The current public-sample pipeline uses:

	- hand/body mocap joints and contact labels,
	- camera translation and rotation,
	- IMU acceleration and gyroscope traces,
	- depth confidence features,
	- six video streams,
	- audio from the sample MP4 stream,
	- caption/object/interaction text features,
	- SLAM point-cloud summary features,
	- calibration parameters.

	The full technical source manifest is stored in
	[`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json).

	## Data Notice

	Xperience-10M data belongs to its original authors and is subject to the
	official Ropedia dataset license and access terms. This repo contains code and
	derived experiment artifacts only; it does not redistribute the raw videos or
	raw annotation dataset.