| # Ropedia Xperience-10M Task Suite |
|
|
| [](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) |
| [](https://huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite) |
| [](https://huggingface.co/datasets/ropedia-ai/xperience-10m) |
| [](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite/pkgs/container/ropedia-xperience-10m-task-suite) |
| [](#scope) |
| [](CITATION.cff) |
| [](LICENSE) |
|
|
| <p align="center"> |
| <img src="docs/assets/brand/xperience10m-logo-social-card.png" alt="Ropedia Xperience-10M Task Suite logo card" width="760"> |
| </p> |
|
|
| A research-development project built on the public Xperience-10M sample episode |
| released by Ropedia. The goal is to make one richly multimodal egocentric |
| episode understandable, turn it into concrete embodied-AI task definitions, and |
| prepare the same pipeline for future held-out multi-episode training. |
|
|
| The central research questions are: |
|
|
| - What can be learned from one aligned Xperience-10M episode while separating |
| sample-specific observations from later multi-episode questions? |
| - Which input/output tasks are meaningful for embodied AI when video, depth, |
| pose, mocap, IMU, and language annotations are synchronized? |
| - What baseline models and evaluation files should exist before scaling to |
| Qwen3-Omni or other multimodal foundation-model fine-tuning? |
|
|
| ## Why This Project Exists |
|
|
| This project is organized as a compact research artifact around Xperience-10M: |
| start from a real public episode, make every modality and label path inspectable, |
| turn the data into concrete embodied-AI tasks, and keep the evaluation boundary |
| clear while preparing the next multi-episode experiments. The emphasis is on |
| research judgment as much as implementation: what the sample can show, what it |
| cannot show, and what evidence should exist before claiming model quality. |
|
|
| The work is designed to demonstrate four capabilities that matter for |
| embodied-AI research infrastructure: |
|
|
| | Capability | What this project shows | |
| | --- | --- | |
| | Multimodal data understanding | Parses the public sample into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals | |
| | Task design | Defines 12 human-readable tasks plus four direction-extension probes with inputs, outputs, process modules, metrics, and case-study walkthroughs | |
| | Model and evaluation discipline | Runs minimal and compact neural baselines, records predictions/metrics, keeps chronological split boundaries explicit, and separates sample evidence from held-out claims | |
| | Scale-up planning | Connects the public-sample pipeline to 32/128-episode held-out pilots, Qwen3-Omni LoRA, Cosmos-style world-model branches, policy-model branches, and the future Xperience-native foundation-model pretraining goal | |
|
|
| ## Start Here |
|
|
| For a first pass, use [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) or the |
| machine-readable [`docs/data/project_brief.json`](docs/data/project_brief.json). |
| They give the project shape in one page: what exists now, what the public |
| sample can support, where the 12 tasks and baselines live, and what must happen |
| before the multi-episode omni-model stage becomes a real held-out evaluation. |
|
|
| | Reader goal | Best entry point | |
| | --- | --- | |
| | Understand the whole project quickly | [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) | |
| | See the visual research dashboard | [GitHub Pages dashboard](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) | |
| | Navigate the 12 tasks, four tracks, and scale-up plan | [Interactive research roadmap](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/research_roadmap.html), [`docs/data/research_roadmap_interactive.json`](docs/data/research_roadmap_interactive.json) | |
| | Compare current task metrics | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`docs/data/summary_metrics.json`](docs/data/summary_metrics.json) | |
| | Compare possible foundation backbones | [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json) | |
| | Understand the future native pretraining goal | [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) | |
| | See additional concrete project directions | [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md), [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json) | |
| | Understand one model input | [`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json), [`results/episode_task_suite/windows.csv`](results/episode_task_suite/windows.csv) | |
| | Check multi-episode data status | [`results/omni_finetune/DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md) | |
|
|
| ## Research Project Overview |
|
|
| | Theme | Current implementation | |
| | --- | --- | |
| | Dataset slice | One public Xperience-10M sample episode, 5,821 frames, 1,161 windows, and an 8,546-dimensional representation | |
| | Modalities | Video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, calibration, and language annotations | |
| | Task suite | 12 human-readable embodied-AI task contracts with input, process, output, metrics, predictions, and case-study walkthroughs | |
| | Baselines | Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split | |
| | Research directions | Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling | |
| | Scale-up path | A first selected-episode Qwen3-Omni LoRA diagnostic pilot has completed on the 96/16/16 split; it proves the multi-episode export/train/eval/package loop, but the weak held-out metrics make it a baseline for error analysis rather than a strong model. Cosmos 3/world-model and VLA/policy branches reuse the same split and package contract after their targets are implemented. | |
| | Public surfaces | GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection | |
|
|
| For the fastest interpretation of the current metrics, start with |
| [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md) and |
| [`docs/data/research_takeaways.json`](docs/data/research_takeaways.json). |
| They summarize what the public sample results actually show: class shift under |
| chronological splits, neural gains on dynamics/order/alignment, harder |
| retrieval/reconstruction probes, and why the next model-quality step needs |
| held-out episodes. |
|
|
| Current contributions: |
|
|
| - manifested sliding-window features over the currently extracted modalities, |
| - motion-only and current all-feature baseline models, |
| - 12 end-to-end episode-level tasks, |
| - lightweight neural MLP heads for the same 12 task contracts, |
| - a generated four-direction research taxonomy matching the Ropedia job tracks, |
| - four additional direction-extension probes with minimal and neural baselines, |
| - human-readable research task cards and an interactive scrub/play walkthrough storyboard for every task, |
| - an interactive research roadmap connecting 12 tasks, four research tracks, current sample evidence, the Qwen3-Omni scale-up path, and foundation-model branch selection, |
| - a next-milestone track for Qwen3-Omni fine-tuning, Cosmos 3 world modeling, and sensor-bridge evaluation, |
| - a future pretraining plan for an Xperience Embodied Foundation Model over the full corpus after smaller multi-episode stages prove value, |
| - metrics, predictions, model weights, manifests, charts, and a two-level |
| tabbed static research website, |
| - a clear explanation of what is implemented now and what moves to the multi-episode stage. |
|
|
| ## Current Research Scope |
|
|
| This project is best read as a staged embodied-AI research study: |
|
|
| | Layer | Current scope | Where to start | |
| | --- | --- | --- | |
| | Data understanding | One public Xperience-10M sample episode is converted into 5,821 frames, 1,161 aligned windows, and an 8,546-dimensional multimodal representation. | [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md), [`PROJECT_STATUS.md`](PROJECT_STATUS.md) | |
| | Task suite | Twelve human-readable tasks cover action, procedure, contact, object, language, retrieval, reconstruction, order, and synchronization questions. | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) | |
| | Baselines | Minimal heads and compact PyTorch MLP heads provide a first controlled comparison on the same chronological split. | [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/) | |
| | Diagnostics | Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard. | [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md), [`docs/single_episode_explorer.html`](docs/single_episode_explorer.html) | |
| | Scale-up | The selected 128-episode Qwen3-Omni LoRA diagnostic pilot has a verified validation-aware held-out package: 96/16/16 selected episodes, 3,808 exported windows, 512 validation windows, 448 held-out test windows, and public-safe metrics/predictions. JSON validity is 87.50%, below the 98% target, so the next pass focuses on structured-output reliability and task-quality error analysis. | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`results/omni_finetune/verified_public/`](results/omni_finetune/verified_public/) | |
|
|
| Detailed dataset notes, reproduction checks, and generated JSON reports are |
| included for readers who want to inspect the implementation, but they are |
| supporting materials rather than the main reading path. Use |
| [`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) when you want the full file map. |
|
|
| ## Project Status |
|
|
| If you only have one minute, use |
| [`PROJECT_STATUS.md`](PROJECT_STATUS.md) and |
| [`docs/data/project_status.json`](docs/data/project_status.json). |
| They give the current research state in one compact table: |
|
|
| | Area | Current decision | |
| | --- | --- | |
| | Public-sample pipeline | Verified on one public sample episode: 5,821 frames, 1,161 windows, 8,546 dimensions | |
| | 12-task suite | Verified minimal baselines with committed metrics, predictions, and manifests | |
| | Neural heads | Verified compact PyTorch MLP heads over the same task contracts and chronological splits | |
| | Dataset context | Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented | |
| | Evaluation protocol | Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics | |
| | Website and Hub pages | Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links | |
| | Qwen3-Omni multi-episode pilot | Verified diagnostic result package exists for the selected 96/16/16 episode split; current held-out metrics are weak and below the JSON-validity quality target | |
| | Raw Xperience-10M data / full Qwen weights | Not redistributed | |
|
|
| ## 90-Second Research Project Path |
|
|
| If you are reading the project cold, open these in order: |
|
|
| | Step | Question | Primary artifacts | What should be true | |
| | --- | --- | --- | --- | |
| | 1 | What is this project? | [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md), [`PROJECT_STATUS.md`](PROJECT_STATUS.md), [dashboard](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) | A public-sample Xperience-10M research project with 12 tasks, baselines, and a scale-up plan. | |
| | 2 | What data is used? | [`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md), [official HF dataset](https://huggingface.co/datasets/ropedia-ai/xperience-10m), [sample HF dataset](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) | The implemented suite uses one public sample episode; the gated dataset is reserved for selected multi-episode training. | |
| | 3 | What does one model input contain? | [`windows.csv`](results/episode_task_suite/windows.csv), [`feature_manifest.json`](results/episode_task_suite/feature_manifest.json), [`available_modalities.json`](results/episode_task_suite/available_modalities.json) | Each window is an aligned multimodal unit with video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals. | |
| | 4 | What are the 12 tasks? | [`results/episode_task_suite/task_walkthroughs/`](results/episode_task_suite/task_walkthroughs/), [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json) | Every task has a human-readable name, case study, input, process modules, output, metric, and limitation. | |
| | 5 | How are tasks evaluated? | [`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md), [`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) | The window unit, chronological split, leakage controls, task metrics, and current limitations are explicit. | |
| | 6 | What do the current results mean? | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`docs/data/research_takeaways.json`](docs/data/research_takeaways.json), [`docs/data/summary_metrics.json`](docs/data/summary_metrics.json) | Current metrics describe sample-level task behavior and identify which signals need larger held-out experiments. | |
| | 7 | Which models are implemented? | [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json), [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), [HF baseline repo](https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines) | Each task has minimal and neural-head evidence over the same feature windows. | |
| | 8 | What research directions does this support? | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`docs/data/research_directions.json`](docs/data/research_directions.json), [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) | The tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling. | |
| | 9 | Which foundation model comes next? | [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json), [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) | Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 is the first world-model branch; policy models wait for explicit action targets; Xperience-native pretraining is the full-corpus future goal. | |
| | 10 | How do I reproduce it? | [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md), [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) | Public commands and expected outputs are documented for the sample-episode task suite. | |
| | 11 | What is still pending? | [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md), [`MULTI_EPISODE_ACCESS_STATUS.md`](results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md) | The first held-out diagnostic pilot is verified; strong model quality remains pending because JSON validity is 87.50% and action/subtask metrics remain weak. | |
|
|
| A compact reader-path summary is available at |
| [`docs/data/project_packet.json`](docs/data/project_packet.json). |
|
|
| ## Supporting Files |
|
|
| [`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) is the human-readable map for readers |
| who want to inspect the project files after the first pass. It groups the main |
| briefs, task outputs, baseline results, visual assets, data notes, and |
| scale-up documents. |
|
|
| [`docs/data/artifact_index.json`](docs/data/artifact_index.json) is the compact |
| machine-readable companion used by the website and Hugging Face artifact |
| dataset. |
|
|
| ## Evaluation Protocol |
|
|
| [`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md) and |
| [`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) are |
| generated from committed metric artifacts. They define: |
|
|
| - the 20-frame window unit, stride, feature dimension, and raw-data policy, |
| - the chronological 70/30 single-episode split and its generalization limit, |
| - the per-task input, target, primary metric, minimal score, and neural score, |
| - leakage controls for future labels, target-side signals, caption/object |
| labels, and train-only normalization, |
| - current limitations, including cross-episode generalization, |
| audio-visual learning, pixel-depth reconstruction, and real held-out |
| multi-episode Qwen3-Omni quality. |
|
|
| ## Dataset Context |
|
|
| The official [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m) |
| dataset is a gated large-scale egocentric multimodal dataset for embodied AI, |
| robotics, spatial intelligence, and world modeling. The public |
| [`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) |
| repo provides the sample episode used for the implemented task suite here. |
|
|
| This project keeps those layers separate: the public sample supports the |
| current 12-task study, while the gated full dataset is used only for the |
| selected multi-episode Qwen3-Omni pilot. Raw Xperience-10M MP4/HDF5/RRD files |
| are not redistributed in this repo or in the Hugging Face mirrors. |
|
|
| The current verified public-sample subset is: |
|
|
| - one public sample episode, 5,821 frames, and 1,161 aligned windows, |
| - raw sample files with six MP4 video streams and audio streams, |
| - `annotation.hdf5` carrying depth, SLAM/camera pose, hand/body mocap, IMU, |
| language/caption annotations, calibration, metadata, and timing records, |
| - an 8,546-dimensional baseline representation using video, audio, depth, |
| pose/SLAM, mocap, IMU, calibration, and language-derived signals. |
|
|
| Detailed dataset notes are available in |
| [`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md) |
| for readers who need the full upstream-card and access-term context. The |
| practical boundary is simple: current task-suite results come from the public |
| sample, and the first multi-episode Qwen3-Omni diagnostic pilot is verified but |
| not yet strong model quality. |
|
|
| Start with the visual dashboard: |
|
|
| **[chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/)** |
|
|
| Hugging Face Space app: |
|
|
| **[cy0307-ropedia-xperience-10m-task-suite.static.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.static.hf.space/)** |
|
|
| ## Read This Project In Three Layers |
|
|
| | Layer | What to inspect | Why it matters | |
| | --- | --- | --- | |
| | Project status | `PROJECT_STATUS.md`, `docs/data/project_status.json` | Gives a one-table current project summary before reading the full artifact trail | |
| | Data contract | `windows.csv`, `feature_manifest.json`, modality manifests | Confirms what each sample window contains before modeling | |
| | Dataset context | `XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`, official dataset links | Explains the official dataset, public sample, modalities, access boundary, and what this repo uses | |
| | Visual assets | `FIGURE_INDEX.md`, `docs/assets/` | Shows the task-suite graphic, modality thumbnails, pipeline diagrams, charts, and logo assets | |
| | Evaluation protocol | `EVALUATION_PROTOCOL.md`, `docs/data/evaluation_protocol.json` | Defines the task unit, split, metrics, leakage controls, and current limitations | |
| | Research roadmap | `RESEARCH_ROADMAP.md`, `docs/data/research_roadmap.json` | Shows the path from sample-level task development to multi-episode work, larger model branches, and the future native-pretraining goal | |
| | Additional development directions | `ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`, `docs/data/additional_development_directions.json` | Records concrete non-backbone tracks: taxonomy, benchmark protocol, representation learning, skill graphs, affordances, 3D/4D memory, QA, and policy transfer | |
| | Xperience Embodied Foundation Model plan | `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` | Describes the long-term full-corpus pretraining goal, target modules, objectives, staged scale-up, hardware ranges, and evaluation protocol | |
| | Minimal heads | softmax, ridge projection/regression, multi-label logistic heads | Keeps every input/output contract visible and inspectable | |
| | Neural heads | PyTorch MLP classifiers/regressors under `neural_mlp/` | Checks whether nonlinear heads improve each task without changing features | |
| | Evidence | metrics, predictions, confusion matrices, diagrams, dashboard | Makes the single-episode task development inspectable without rerunning first | |
| | Artifact guide | `ARTIFACT_GUIDE.md` | Groups the public evidence into research-project layers after the first-pass overview | |
| | Reproducibility contract | `REPRODUCIBILITY.md`, `docs/data/reproducibility_matrix.json` | States public commands, expected outputs, exact-match reproduction evidence, and non-reproducible boundaries | |
| | Citation metadata | `CITATION.cff`, `codemeta.json`, `LICENSE` | Makes the repo easier to cite, index, and reuse without confusing code license and dataset terms | |
|
|
| ## Links |
|
|
| | Resource | Link | |
| | --- | --- | |
| | This GitHub repo | [github.com/ChaoYue0307/ropedia-xperience-10m-task-suite](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite) | |
| | This project website | [chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) | |
| | This Hugging Face Space | [huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite](https://huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite) | |
| | Live Hugging Face static app | [cy0307-ropedia-xperience-10m-task-suite.static.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.static.hf.space/) | |
| | GitHub Container package | [ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite/pkgs/container/ropedia-xperience-10m-task-suite) | |
| | Derived artifacts on Hugging Face | [huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts](https://huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts) | |
| | Minimal and neural task baselines on Hugging Face | [huggingface.co/cy0307/ropedia-xperience-10m-task-baselines](https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines) | |
| | Hugging Face collection | [huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite](https://huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite) | |
| | Xperience-10M dataset website | [ropedia.com/dataset](https://ropedia.com/dataset) | |
| | Xperience-10M release page | [ropedia.com/blog/20260316_xperience_10m](https://ropedia.com/blog/20260316_xperience_10m) | |
| | Ropedia GitHub organization | [github.com/Ropedia](https://github.com/Ropedia) | |
| | HOMIE Toolkit | [github.com/Ropedia/HOMIE-toolkit](https://github.com/Ropedia/HOMIE-toolkit) | |
| | Xperience-10M Hugging Face dataset | [huggingface.co/datasets/ropedia-ai/xperience-10m](https://huggingface.co/datasets/ropedia-ai/xperience-10m) | |
| | Xperience-10M sample on Hugging Face | [huggingface.co/datasets/ropedia-ai/xperience-10m-sample](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) | |
| | Ropedia Hugging Face organization | [huggingface.co/ropedia-ai](https://huggingface.co/ropedia-ai) | |
|
|
| ## Citation, License, And Metadata |
|
|
| Use [`CITATION.cff`](CITATION.cff) when citing this project. The repository |
| also includes [`codemeta.json`](codemeta.json) for machine-readable software |
| metadata and [`docs/data/project_manifest.json`](docs/data/project_manifest.json) |
| for website/Hugging Face surface metadata. |
|
|
| The code files are MIT-licensed. Raw Xperience-10M data is not redistributed |
| here, and dataset use remains governed by the official Ropedia/Xperience-10M |
| terms. See [`LICENSE`](LICENSE) and [`DATA_NOTICE.md`](DATA_NOTICE.md). |
|
|
|  |
|
|
| The infographic uses a custom text-free research background and puts the shared |
| processing contract plus all 12 task families before the modality atlas. |
| Public-sample modality thumbnails remain enlarged below the task map. The task |
| names, input/output summaries, and metrics are overlaid from |
| [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) |
| with [`scripts/render_task_suite_infographic.py`](scripts/render_task_suite_infographic.py), |
| so the published PNG is a presentation graphic with verified labels and metrics, |
| not a hallucinated metric sheet. |
|
|
| The website also includes a responsive native modality atlas backed by |
| [`docs/data/modality_atlas.json`](docs/data/modality_atlas.json) and |
| [`docs/assets/modalities/`](docs/assets/modalities/). Those assets are small |
| derived thumbnails from the public sample, not raw Xperience-10M files. |
|
|
|  |
|
|
|  |
|
|
|  |
|
|
| The pipeline and architecture figures use the same pattern: text-free visual |
| backgrounds carry the composition, while |
| [`scripts/render_overview_figures.py`](scripts/render_overview_figures.py) |
| overlays exact labels, dimensions, and metrics from the committed result files. |
|
|
| ## Scope |
|
|
| This is a learning, inspection, and pipeline-validation repo built from one |
| public sample episode. The next model-quality stage is to run the same suite |
| over many episodes and split train/test by held-out episode. |
|
|
| ## What Is Inside |
|
|
| ```text |
| scripts/ |
| train_min_action_model.py # motion/IMU baseline |
| train_all_modalities_model.py # current all-feature lightweight baseline |
| episode_task_suite.py # 12 end-to-end task definitions |
| neural_task_models.py # optional PyTorch MLP heads for all 12 tasks |
| research_direction_taxonomy.py # maps 12 tasks to the four research tracks |
| research_direction_extension_tasks.py # one extra data-backed probe per track |
| task_walkthroughs.py # human-readable task-card and walkthrough-storyboard metadata |
| generate_visualizations.py # refreshes SVG charts + summary JSON |
| render_task_suite_infographic.py # renders the task-suite presentation PNG |
| export_modality_atlas_assets.py # exports responsive modality-card assets |
| render_overview_figures.py # renders polished pipeline/architecture PNGs |
| build_brand_assets.py # derives logo sizes, favicon, social card |
| build_artifact_index.py # builds the compact artifact guide data |
| build_quality_gates.py # builds release checks |
| validate_mirror_parity.py # checks prepared GitHub/HF mirror file parity |
| validate_scope_claims.py # separates setup artifacts from completed model metrics |
| validate_task_surface.py # checks readable task cards and interactive storyboard wiring |
| validate_website_integrity.py # checks local site links, anchors, and images |
| validate_publication_package.py # checks public repo + HF bundle contents |
| publish_hf_bundles.py # uploads prepared HF Space/artifact/model bundles |
| omni/ |
| download_sample_modelscope.py # ModelScope sample download helper |
| build_episode_manifest.py # metadata-only multi-episode scanner |
| plan_finetune_sample_budget.py # storage/sample-count planner |
| qwen3_omni_adapter_smoke.py # real-data Qwen3-Omni adapter setup check |
| |
| results/ |
| min_action_model/ # motion-only action baseline artifacts |
| min_subtask_model/ # motion-only subtask baseline artifacts |
| min_all_modalities_action_model/ # current all-feature action artifacts |
| min_all_modalities_subtask_model/ # current all-feature subtask artifacts |
| episode_task_suite/ # 12-task suite metrics and predictions |
| neural_mlp/ # optional neural baseline artifacts per task |
| research_directions/ # four-track taxonomy, CSV, and summary |
| research_direction_extensions/ # four extra direction probes + predictions |
| task_walkthroughs/ # case-study walkthroughs for all 12 tasks |
| omni_exploration/ # ModelScope readiness-check artifacts |
| |
| docs/ |
| index.html # GitHub Pages dashboard |
| data/additional_development_directions.json # concrete non-backbone project directions |
| data/summary_metrics.json # website-readable metrics bundle |
| data/evidence_contract.json # machine-readable project scope |
| data/artifact_index.json # compact project-artifact catalog |
| data/live_publication_status.json # live GitHub/HF publication verification |
| data/quality_gates.json # machine-readable release checks |
| data/task_surface_integrity.json # machine-readable task-card/storyboard integrity check |
| data/project_manifest.json # machine-readable public-surface metadata |
| data/project_packet.json # compact project path and scope summary |
| data/research_roadmap.json # multi-episode and omni-model roadmap |
| data/research_directions.json # four-track website data bundle |
| data/research_direction_extensions.json # four extra probe data bundle |
| data/task_walkthroughs.json # human-readable task-card and walkthrough-storyboard data |
| data/modality_atlas.json # responsive modality-card data |
| assets/brand/*.png # project logo, favicon, social card |
| assets/task_suite_infographic.png # 12-task presentation graphic |
| assets/modalities/ # public-sample derived modality thumbnails |
| assets/pipeline_diagram.png # verified episode pipeline graphic |
| assets/qwen3_omni_lora_pipeline.png # Qwen3-Omni LoRA training-flow figure |
| assets/task_architectures.png # verified 12-task minimal architecture map |
| assets/charts/*.svg # regenerated visualizations |
| |
| notes/ |
| min_action_model.md |
| all_modalities_model.md |
| episode_task_suite.md |
| ``` |
|
|
| Raw Xperience-10M data is **not** committed. Download it from the official |
| Ropedia distribution and follow the dataset terms. |
|
|
| ## GitHub Package |
|
|
| The public dashboard is packaged as a static-site container on GitHub Container |
| Registry. It contains the `docs/` site plus the main reader documents; it does |
| not include raw Xperience-10M videos, raw annotations, gated data, or model |
| weights. |
|
|
| ```bash |
| docker pull ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest |
| docker run --rm -p 8080:80 ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest |
| ``` |
|
|
| Then open `http://localhost:8080`. |
|
|
| ## Data Expected |
|
|
| The scripts expect a workspace with the Ropedia HOMIE toolkit and the |
| Xperience-10M sample episode: |
|
|
| ```text |
| <workspace>/ |
| HOMIE-toolkit/ |
| data/sample/xperience-10m-sample/ |
| annotation.hdf5 |
| fisheye_cam0.mp4 |
| fisheye_cam1.mp4 |
| fisheye_cam2.mp4 |
| fisheye_cam3.mp4 |
| stereo_left.mp4 |
| stereo_right.mp4 |
| ``` |
|
|
| The public sample dataset identifier is: |
|
|
| ```text |
| ropedia-ai/xperience-10m-sample |
| ``` |
|
|
| Hugging Face URL: |
|
|
| ```text |
| https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample |
| ``` |
|
|
| ## Quickstart |
|
|
| From a workspace folder: |
|
|
| ```bash |
| git clone https://github.com/Ropedia/HOMIE-toolkit.git |
| python3.12 -m venv .venv |
| source .venv/bin/activate |
| pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet |
| ``` |
|
|
| Download the sample: |
|
|
| ```bash |
| hf download ropedia-ai/xperience-10m-sample \ |
| --repo-type dataset \ |
| --local-dir data/sample/xperience-10m-sample |
| ``` |
|
|
| If Hugging Face access is unavailable in your environment, use ModelScope: |
|
|
| ```bash |
| python scripts/omni/download_sample_modelscope.py \ |
| --output-dir data/sample/xperience-10m-sample \ |
| --mode minimal |
| ``` |
|
|
| `--mode minimal` downloads `annotation.hdf5`, `README.md`, and |
| `fisheye_cam0.mp4`. Use `--mode all-training` to add all six MP4 streams while |
| still skipping `visualization.rrd`. |
|
|
| Clone and run this repo: |
|
|
| ```bash |
| git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git |
| cd ropedia-xperience-10m-task-suite |
| python scripts/episode_task_suite.py --workspace /path/to/workspace |
| ``` |
|
|
| Run the same 12-task suite with lightweight neural heads: |
|
|
| ```bash |
| pip install torch |
| python scripts/episode_task_suite.py \ |
| --workspace /path/to/workspace \ |
| --include-neural |
| ``` |
|
|
| Run the smaller baselines: |
|
|
| ```bash |
| python scripts/train_min_action_model.py --workspace /path/to/workspace |
| python scripts/train_all_modalities_model.py --workspace /path/to/workspace |
| ``` |
|
|
| ## Xperience-10M Fine-Tuning Exploration |
|
|
| This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The |
| repository separates public-sample evidence from multi-episode fine-tuning |
| artifacts. The validation-aware selected-episode held-out package is now verified as a |
| diagnostic pilot, not a strong final model. |
| The useful distinction is: |
|
|
| - direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language |
| prompts, |
| - adapter-required Xperience-10M sensor inputs: depth, pose/SLAM, hand/body |
| mocap, contacts, and IMU. |
|
|
|  |
|
|
| The figure shows the intended end-to-end training flow: raw valid episodes enter |
| episode-level split validation, parallel media/sensor export creates Qwen-style |
| JSONL records, Qwen3-Omni receives video/audio/text directly, the sensor bridge |
| adds depth/pose/mocap/IMU features, LoRA adapters are trained on prepared |
| train/val episodes, and sealed held-out test evaluation produces predictions, |
| metrics, run reports, and upload-ready adapter artifacts. |
|
|
| The scale-up path requires valid prepared episodes, held-out episode splits, |
| training metadata, predictions, metrics, and a run report. A result is ready |
| for public README, website, or Hugging Face updates only after the validator |
| passes and `scripts/omni/package_verified_omni_result.py` creates a |
| public-safe derived-artifact package. The current verified package is listed in |
| [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json). |
|
|
| ### Sample Count Decision |
|
|
| Do not treat "10M" as a reason to start with the entire dataset. The engineering |
| unit that matters first is diverse held-out episodes, not adjacent windows from |
| one session. |
|
|
| | Phase | Episodes/samples | Approx windows at stride 5 | Purpose | |
| | --- | ---: | ---: | --- | |
| | Readiness | 1-3 | 1k-3k | Verify loaders, token alignment, and task heads | |
| | Pilot | 16-32 | 18k-37k | First held-out-episode evaluation | |
| | Useful LoRA run | 64-128 | 74k-149k | Train sensor adapters plus selected Qwen3-Omni LoRA | |
| | Storage-heavy run | 256+ | 297k+ | Only after download layout and checkpoint size are stable | |
|
|
| Use the budget helper before downloading: |
|
|
| ```bash |
| python scripts/omni/plan_finetune_sample_budget.py \ |
| --storage-root /path/to/storage \ |
| --target-free-after-download-gb 800 \ |
| --all-training-per-episode-gb 2.4 \ |
| --full-preview-per-episode-gb 5.1 |
| ``` |
|
|
| ### Multi-Episode Readiness Gate |
|
|
| ```bash |
| python scripts/omni/discover_xperience10m_sources.py \ |
| --workspace /path/to/ropedia-xperience-10m-task-suite \ |
| --data-root /path/to/xperience10m_data \ |
| --output results/omni_finetune/source_discovery.json |
| ``` |
|
|
| Current status in this repo: |
|
|
| - public_sample_valid_episodes: 1 (degraded-valid: annotation + fisheye_cam0.mp4) |
| - gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions |
| - selected_episode_plan: 128 source-balanced episodes, 96/16/16 train/val/test |
| - selected_download_size: 277.71 GiB excluding `visualization.rrd` |
| - verified_validation_aware_diagnostic_package: true |
| - selected_split: 96 train / 16 validation / 16 held-out test episodes |
| - exported_windows: 2,848 train / 512 validation / 448 test |
| - validation_samples_used: 512 |
| - held_out_eval: 448 test windows from 14 exported test episodes |
| - train_loss / val_loss: 0.4130 / 0.0331 |
| - current_quality_target: JSON validity 87.50%, below the 98% target |
| - gated dataset: available for selected multi-episode data preparation |
| - source_discovery: `results/omni_finetune/source_discovery.json` |
| - data_status: `results/omni_finetune/DATA_ACCESS_STATUS.md` |
| - access_status: `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md` |
| |
| Use this gate before scheduling any full fine-tune run. The pilot should use |
| balanced held-out selection, not the first paths in repository order. The |
| current 128-episode selection filters for complete leaf episodes, excludes |
| `visualization.rrd`, balances episode-size bands, and preserves one selected |
| episode per top-level session UUID. |
| |
| ### Progressive Train/Validation Pilot |
| |
| The selected 128-episode plan can be used before every episode has arrived by |
| training only on prepared `train` episodes and monitoring prepared `val` episodes. |
| The final `test` episodes stay sealed until the end, so early development does |
| not contaminate held-out evaluation. |
| |
| ```bash |
| python scripts/omni/build_selection_episode_manifest.py \ |
| --workspace /path/to/ropedia-xperience-10m-task-suite \ |
| --data-root /path/to/xperience10m_128 \ |
| --selection-json results/omni_finetune/xperience10m_128_episode_selection.json \ |
| --output results/omni_finetune/trainval_progressive/episode_manifest_trainval.json \ |
| --include-split train \ |
| --include-split val |
| ``` |
| |
| `scripts/omni/run_trainval_progressive_128.sh` wraps the same guard, exports a |
| train/val-only Qwen3-Omni JSONL dataset, and launches LoRA training without |
| running final test evaluation. The exporter uses session-qualified episode IDs |
| and path-based split matching so repeated folder names such as `ep1` cannot |
| collide across different sessions. |
|
|
| For larger prepared subsets, `scripts/omni/run_trainval_parallel_export_8gpu.sh` |
| uses the same split guard, exports episodes in parallel CPU shards, skips and |
| reports episodes that contain no labeled windows under the configured label |
| rule, then launches Qwen3-Omni LoRA with `NUM_PROCESSES=8`. |
|
|
| ### Full 128-Episode Held-Out Pilot |
|
|
| Once all selected episodes are complete, use the fixed selected-episode split: |
|
|
| - 96 train episodes, |
| - 16 validation episodes, |
| - 16 held-out test episodes. |
|
|
| The clean full-run launcher validates the selected split, exports all splits in |
| parallel, trains Qwen3-Omni LoRA on train episodes while optionally monitoring |
| validation loss, then evaluates on the held-out test split: |
|
|
| ```bash |
| RUN_ID=xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| DATA_ROOT=/path/to/xperience10m_128 \ |
| SELECTION_JSON=results/omni_finetune/xperience10m_128_episode_selection.json \ |
| MODEL_DIR=/path/to/Qwen__Qwen3-Omni-30B-A3B-Instruct \ |
| NUM_PROCESSES=8 \ |
| TRAIN_VAL_SPLIT=val \ |
| MAX_VAL_SAMPLES=512 \ |
| scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh |
| ``` |
|
|
| The current verified diagnostic package uses the same selected split and 8-GPU |
| training path, records validation loss over 512 validation windows, and keeps |
| the held-out test split sealed for final evaluation. The next pass should keep |
| this package contract while tightening JSON decoding, target formatting, and |
| action/subtask error analysis. |
|
|
| Monitor the run with: |
|
|
| ```bash |
| python scripts/omni/monitor_omni_progress.py \ |
| --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu |
| ``` |
|
|
| The monitor reads training `progress.jsonl`, new evaluator partial-prediction |
| progress, and legacy generation logs, so long held-out evals can still expose |
| sample-level progress even before final metrics are written. |
|
|
| Validate the run artifacts stage by stage: |
|
|
| ```bash |
| python scripts/omni/validate_omni_finetune_run.py \ |
| --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --require-stage manifest |
| |
| python scripts/omni/validate_omni_finetune_run.py \ |
| --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --require-stage eval \ |
| --min-json-validity 0.98 |
| ``` |
|
|
| After the eval validator passes, create the public-safe result package: |
|
|
| ```bash |
| python scripts/omni/package_verified_omni_result.py \ |
| --dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --train-run-id <train_run_id> \ |
| --eval-run-id <eval_run_id> |
| ``` |
|
|
| For long-running remote jobs, the packaging step can be watched automatically: |
|
|
| ```bash |
| python scripts/omni/watch_verified_omni_package.py \ |
| --dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --train-run-id <train_run_id> \ |
| --eval-run-id <eval_run_id> |
| ``` |
|
|
| While waiting, the watcher can append `eval_progress_observed` events from |
| partial prediction files or legacy generation logs. This keeps the package |
| status file useful during long held-out evaluations. |
|
|
| The package copies only small derived artifacts such as metrics, predictions, |
| confusion matrices, run reports, manifests, validation summaries, and training |
| metadata. The exact required eval files and primary metrics come from the |
| selected backbone contract in `configs/omni_backbones`, so Qwen3-Omni, |
| Cosmos-style world models, and VLA/policy branches can share the same verified |
| publication gate once their model-specific evaluators exist. The package |
| excludes raw Xperience-10M files, base-model weights, adapter or checkpoint |
| weights, full checkpoints, and large archives. |
|
|
| For hardware setups that can run multiple eval workers, the Qwen evaluator also |
| supports deterministic sample shards: |
|
|
| ```bash |
| python scripts/omni/eval_qwen3_omni_lora.py \ |
| --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl \ |
| --adapter-dir checkpoints/<train_run_id>/adapter_lora \ |
| --run-id <eval_shard_0> \ |
| --eval-split test \ |
| --sample-offset 0 \ |
| --sample-stride 4 |
| |
| python scripts/omni/merge_qwen3_omni_eval_shards.py \ |
| --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl \ |
| --output-dir results/omni_finetune/<merged_eval_run_id> \ |
| --shard-dir results/omni_finetune/<eval_shard_0> \ |
| --shard-dir results/omni_finetune/<eval_shard_1> \ |
| --shard-dir results/omni_finetune/<eval_shard_2> \ |
| --shard-dir results/omni_finetune/<eval_shard_3> |
| ``` |
|
|
| Only the merged eval directory should be validated and reported publicly, |
| because the merger checks coverage and recomputes the metrics from all |
| held-out predictions. |
|
|
| After dataset export, a model-neutral window index can be created for future |
| backbones: |
|
|
| ```bash |
| python scripts/omni/export_model_neutral_window_index.py \ |
| --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl |
| ``` |
|
|
| This produces `window_index.jsonl` and `window_index_manifest.json` so Cosmos- |
| style world models and VLA/policy branches can reuse the same split-checked |
| windows without depending on Qwen chat-message records. |
|
|
| ### Uploading Qwen3-Omni LoRA artifacts |
|
|
| The public-safe verified package intentionally excludes raw data, base Qwen |
| weights, LoRA weights, and full checkpoints. Adapter upload is a separate step: |
| use it only when the intended adapter directory is present and the model card |
| clearly distinguishes older smoke weights from the selected-episode diagnostic |
| or validation-aware run. |
|
|
| ```bash |
| python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \ |
| --repo-id cy0307/ropedia-qwen3-omni-lora-smoke \ |
| --source-dir /path/to/adapter_upload_package \ |
| --message "Upload Xperience-10M Qwen3-Omni LoRA pilot" |
| ``` |
|
|
| This script requires a valid Hugging Face token via `HF_TOKEN` or `--token`. |
| Network availability to `huggingface.co` is required. |
|
|
| ### Foundation Backbone Plan |
|
|
| The next modeling plan tracks several foundation-model branches instead of |
| assuming one backbone solves every Xperience-10M objective. |
|
|
| | Branch | Current role | When to use it | |
| | --- | --- | --- | |
| | Qwen3-Omni | First trainable multimodal LoRA pilot | Use for the selected 128-episode held-out baseline over video/audio/language plus sensor-bridge features. | |
| | Cosmos 3 | First world-model/action-generation branch | Use after data preparation for future-window prediction, action-conditioned world modeling, and synthetic-data usefulness tests. | |
| | GR00T | Humanoid/action-policy branch | Use after mocap/contact retargeting creates well-defined humanoid action targets. | |
| | OpenVLA / openpi | Open VLA/policy baselines | Use after the project defines robot-compatible or action-token targets. | |
| | Gemini Robotics | External reasoning reference | Use only for qualitative comparison or annotation support unless local trainable access exists. | |
| | Xperience Embodied Foundation Model | Future Xperience-native pretraining goal | Use only after multi-episode pilots, full-corpus storage, distributed training infrastructure, and scaling evidence justify a from-scratch domain model. | |
|
|
| See [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md) and |
| [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json) |
| for the full selection matrix, source links, and model-specific evaluation |
| additions. See |
| [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) |
| for the long-term full-corpus pretraining plan. |
|
|
| Backbone-specific contracts now live in [`configs/omni_backbones`](configs/omni_backbones). |
| The extension contract is documented in |
| [`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md), and the |
| registry can be checked with: |
|
|
| ```bash |
| python scripts/omni/backbone_registry.py --validate --json |
| ``` |
|
|
| Verify that every configured backbone can pass the public-safe packaging |
| contract on synthetic derived artifacts: |
|
|
| ```bash |
| python scripts/omni/smoke_test_backbone_packaging.py |
| ``` |
|
|
| After a real held-out package is created, audit it before updating README, |
| website, or Hugging Face pages: |
|
|
| ```bash |
| python scripts/omni/audit_verified_omni_package.py \ |
| --package-dir results/omni_finetune/verified_public/<eval_run_id> |
| ``` |
|
|
| Create a new planned backbone branch from an existing contract template with: |
|
|
| ```bash |
| python scripts/omni/scaffold_omni_backbone.py \ |
| --template-backbone policy_vla_branch \ |
| --id new_policy_branch \ |
| --display-name "New Policy Branch" \ |
| --model-family "Model family name" \ |
| --dataset-contract xperience10m_observation_action_v1 \ |
| --training-objective observation_to_action_policy \ |
| --checkpoint-gate policy_checkpoint_action_space_and_normalizer \ |
| --dry-run |
| ``` |
|
|
| Each backbone config declares the checkpoint gate, required train/eval files, |
| allowed public artifacts, and forbidden private or heavyweight artifacts. This |
| keeps Qwen3-Omni, Cosmos-style world models, and policy/VLA branches on the same |
| split, validation, and publication discipline even though their training targets |
| are different. |
|
|
| ## Additional Development Directions |
|
|
| Beyond backbone selection and fine-tuning, Xperience-10M supports several |
| concrete research-development tracks: |
|
|
| | Direction | First useful artifact | Role in the project | |
| | --- | --- | --- | |
| | Episode taxonomy and data engine | Episode atlas, balance report, and split builder | Select representative data before training. | |
| | Standardized benchmark protocol | Versioned train/val/test manifests and metric scripts | Make future model results comparable. | |
| | Multimodal representation learning | Contrastive and masked-window encoder objectives | Learn reusable video/audio/depth/pose/mocap/IMU/language features. | |
| | Skill and procedure graph mining | Step graph, transitions, preconditions, and effects | Connect perception to planning and long-horizon reasoning. | |
| | Human-object affordance modeling | Contact, reachable-object, tool-use, and next-affordance tasks | Model what actions the scene makes possible. | |
| | 3D/4D scene and object memory | Persistent scene/object maps from depth, pose, multiview video, and objects | Track world state beyond single frames. | |
| | Data-quality and synchronization diagnostics | Per-episode QA for drift, missing streams, calibration, and corrupted files | Keep large multimodal training trustworthy. | |
| | Policy, retargeting, and simulation transfer | Action-token conversion and robot-compatible imitation examples | Bridge human egocentric experience to robot policy work. | |
|
|
| See [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md) |
| and [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json). |
|
|
| ## Four Research Directions |
|
|
| The 12 tasks are now organized against the four Ropedia research directions in |
| a generated artifact, not only in prose: |
|
|
| - [`research_direction_taxonomy.json`](results/episode_task_suite/research_directions/research_direction_taxonomy.json) |
| - [`research_direction_task_map.csv`](results/episode_task_suite/research_directions/research_direction_task_map.csv) |
| - [`research_direction_summary.md`](results/episode_task_suite/research_directions/research_direction_summary.md) |
| - [`docs/data/research_directions.json`](docs/data/research_directions.json) |
|
|
| The taxonomy uses two current baselines for every task: |
|
|
| | Baseline | Role | |
| | --- | --- | |
| | Minimal interpretable heads | Softmax, logistic, ridge, and retrieval heads over the 8,546-dimensional multimodal representation. These expose the input/output contract cleanly. | |
| | Neural MLP heads | Small PyTorch MLP classifiers/regressors on the same features and splits. These check whether nonlinear heads help before moving to Qwen/Omni fine-tuning. | |
|
|
| Current direction-level coverage: |
|
|
| | Direction | Current status | Covered task evidence | What is not solved yet | |
| | --- | --- | --- | --- | |
| | A. Human Modeling & Motion Understanding | Partially implemented | Hand Trajectory Forecasting and Contact State Prediction are direct; Action Recognition and Object Relevance Prediction are proxies. Neural MLP improves hand forecasting from `0.8647` to `0.1079` MPJPE. | No full body/shape model, SMPL/MANO target, deformation prior, or multi-episode motion-generation evaluation yet. | |
| | B. 3D/4D Reconstruction & Neural Rendering | Proxy tasks only | Cross-Modal Retrieval, Cross-Modal Reconstruction, and Multimodal Synchronization Detection test alignment/reconstruction prerequisites. | No NeRF, Gaussian Splatting, TSDF, mesh, novel-view synthesis, or calibrated 4D reconstruction model yet. | |
| | C. Egocentric Vision & Interaction | Strongest implemented track | 6 direct tasks: action, subtask, transition, next-action, object relevance, and caption grounding, plus alignment/order diagnostics and audio ablation. | Single-episode chronological split limits generalization; stronger audio and video-language backbones still need multi-episode testing. | |
| | D. Scene Reconstruction & World Modeling | Early proxy tasks | Procedure Step Recognition, Next-Action Prediction, Object Relevance Prediction, Cross-Modal Retrieval, Cross-Modal Reconstruction, Temporal Order Verification, and Multimodal Synchronization Detection provide state/world-model probes. | No persistent scene graph, object permanence task, long-term map, or held-out-episode world model yet. | |
|
|
| The important interpretation is that all four directions can be **started** from |
| the Xperience-10M sample modalities, but only direction C is strongly represented |
| by the current 12-task suite. Directions A, B, and D need additional targets and |
| multi-episode training before they become full research deliverables. |
|
|
| ## Four Direction-Extension Probes |
|
|
| Beyond the original 12 core tasks, the repo now includes one extra data-backed |
| probe for each research direction. These probes are computed from the same |
| `shared_windows.npz`, `windows.csv`, and `feature_manifest.json` artifacts, so |
| the reported numbers are computed from sample-derived features and saved metric artifacts. |
|
|
| - [`research_direction_extension_results.json`](results/episode_task_suite/research_direction_extensions/research_direction_extension_results.json) |
| - [`research_direction_extension_summary.md`](results/episode_task_suite/research_direction_extensions/research_direction_extension_summary.md) |
| - [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) |
| - [`research_direction_extension_tasks.svg`](docs/assets/charts/research_direction_extension_tasks.svg) |
|
|
|  |
|
|
| | Direction | New extension task | Input | Output | Minimal | Neural MLP | Why it matters | |
| | --- | --- | --- | --- | ---: | ---: | --- | |
| | A. Human Modeling & Motion Understanding | Body and Hand Motion Intensity | non-mocap video/depth/pose/IMU/SLAM/language features | high vs low body/hand motion | `0.7827` macro-F1 | `0.7986` macro-F1 | Starts a human-motion-energy target without leaking mocap input. | |
| | B. 3D/4D Reconstruction & Neural Rendering | Multi-View Consistency Retrieval | fisheye camera feature query | synchronized stereo-left view rank | `0.5534` MRR | `0.3469` MRR | Tests whether multi-view features preserve synchronized 4D scene identity. | |
| | C. Egocentric Vision & Interaction | Action Phase Progress Estimation | non-caption multimodal window | progress inside current action segment | `0.3416` MAE | `0.3038` MAE | Adds a task-structure/intent-style target beyond class labels. | |
| | D. Scene Reconstruction & World Modeling | Short-Horizon Ego-Motion Forecasting | current sensors excluding camera translation and captions | future camera-translation delta | `0.1989` MAE | `0.0989` MAE | Starts a short-horizon world-model target over wearer motion. | |
|
|
| Run: |
|
|
| ```bash |
| python scripts/research_direction_extension_tasks.py |
| ``` |
|
|
| These four probes make the four-direction mapping more concrete, but they are |
| still single-episode extension baselines. Full research conclusions still require |
| multi-episode training, held-out episode evaluation, and stronger task-specific |
| models. |
|
|
| ## Task Walkthroughs For Juniors |
|
|
| Every task now has a beginner-facing explanation with: |
|
|
| - a concrete coffee-episode case study, |
| - exact input contract, |
| - middle process modules, |
| - output contract, |
| - minimal and neural metric, |
| - one important limitation. |
|
|
| Primary files: |
|
|
| - [`TASK_WALKTHROUGHS.md`](results/episode_task_suite/task_walkthroughs/TASK_WALKTHROUGHS.md) |
| - [`task_walkthroughs.json`](results/episode_task_suite/task_walkthroughs/task_walkthroughs.json) |
| - [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json) |
| - [`docs/data/task_surface_integrity.json`](docs/data/task_surface_integrity.json) |
|
|
| Compact map: |
|
|
| | Task | Case study | Input -> process -> output | |
| | --- | --- | --- | |
| | Action Recognition | A pouring window should be named as the current action. | all-modality window -> action label builder + classifier -> action class | |
| | Procedure Step Recognition | A fine action is grouped into a broader drink-preparation stage. | all-modality window -> subtask label builder + classifier -> subtask label | |
| | Action Boundary Detection | Detect the change from preparing to pouring. | window -> boundary builder + binary classifier -> boundary/steady | |
| | Next-Action Prediction | A preparing window predicts what happens 20 frames later. | current window -> future-label shift + classifier -> next action | |
| | Hand Trajectory Forecasting | A hand moving toward a cup becomes a future 3D hand path. | current window -> future mocap target + regressor -> hand trajectory | |
| | Contact State Prediction | Decide whether hand/body contact is happening. | non-contact features -> contact target + binary classifier -> contact label | |
| | Object Relevance Prediction | Infer milk, cup, coffee, or related objects during pouring. | non-caption features -> multi-hot object target + sigmoid heads -> object set | |
| | Language Grounding | Query Pour milk into coffee and retrieve the matching moment. | text-like query + candidates -> projection + cosine ranker -> ranked windows | |
| | Cross-Modal Retrieval | Motion/IMU from pouring retrieves matching depth/video. | motion/IMU/camera -> projection + candidate index -> ranked depth/video windows | |
| | Cross-Modal Reconstruction | Infer depth/video features from motion, IMU, and camera pose. | source modalities -> scaler + regressor -> target modality vector | |
| | Temporal Order Verification | Tell whether reaching then pouring was reversed. | adjacent window pair -> pair combiner + binary classifier -> correct/reversed | |
| | Multimodal Synchronization Detection | Catch motion paired with visual/depth features shifted in time. | motion side + visual side -> aligned/shifted pair builder + classifier -> aligned/shifted | |
|
|
| ## Minimal 12-Task Architectures |
|
|
| These are deliberately minimal baselines. They are useful because every |
| input/output contract is explicit, not because they are strong embodied-AI |
| models. |
|
|
| Shared setup: |
|
|
| ```text |
| raw episode -> 20-frame windows, stride 5 -> 8,546-dimensional multimodal representation |
| chronological split: first 70% train, last 30% test |
| scalers are fit on train windows only |
| ``` |
|
|
| There are four reusable head families: |
|
|
| | Head family | Used by | What it means | |
| | --- | --- | --- | |
| | Linear softmax classifier | Action Recognition, Procedure Step Recognition, Action Boundary Detection, Next-Action Prediction, Contact State Prediction, Temporal Order Verification, Multimodal Synchronization Detection | z-score features, then `XW+b`, softmax, cross-entropy, L2 | |
| | Dual ridge regression/projection | Hand Trajectory Forecasting, Cross-Modal Reconstruction | z-score input/target, solve ridge regression with L2=10 | |
| | Ridge + cosine ranking | Language Grounding, Cross-Modal Retrieval | project one modality into another feature space, then rank candidates by cosine | |
| | Multi-label logistic regression | Object Relevance Prediction | z-score non-caption features, sigmoid object heads, threshold at 0.5 | |
|
|
| The optional neural run keeps the same window representation, leakage filters, |
| chronological splits, and metrics, but replaces the task heads with small |
| PyTorch MLP classifiers or regressors. Its outputs live under |
| [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), |
| and the rollup is stored in the `neural_tasks` section of |
| [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json). |
|
|
| The task-specific heads are: |
|
|
| | Task | Input | Minimal head | Output | |
| | --- | --- | --- | --- | |
| | Action Recognition | all featurized modalities | linear softmax | current action class | |
| | Procedure Step Recognition | all featurized modalities | linear softmax | current subtask class | |
| | Action Boundary Detection | all featurized modalities | linear softmax | steady vs action boundary | |
| | Next-Action Prediction | all featurized modalities at `t` | linear softmax | action at `t+20` frames | |
| | Hand Trajectory Forecasting | all featurized modalities at `t` | ridge regression | future 10-frame left/right hand joints | |
| | Contact State Prediction | non-contact and non-caption signals | linear softmax | any body contact | |
| | Object Relevance Prediction | non-caption signals | multi-label logistic | relevant object set | |
| | Language Grounding | sensor windows projected to text space | ridge projection + cosine ranking | matching time window for text query | |
| | Cross-Modal Retrieval | motion/IMU/camera projected to visual space | ridge projection + cosine ranking | matching depth/video window | |
| | Cross-Modal Reconstruction | motion/IMU/camera | ridge regression | compressed depth/video target | |
| | Temporal Order Verification | `[x_t, x_t+1, x_t+1-x_t]` | binary linear softmax | correct vs reversed order | |
| | Multimodal Synchronization Detection | motion plus visual pair | binary linear softmax | aligned vs shifted by 8 windows | |
|
|
| ## Key Results |
|
|
| | Experiment | Main score | Accuracy | Notes | |
| | --- | ---: | ---: | --- | |
| | Motion-only action | 0.9688 macro-F1 | 0.9828 | Uses motion/IMU features only | |
| | Current all-feature action | 0.9829 macro-F1 | 0.9863 | 8,546-dimensional multimodal representation | |
| | Motion-only subtask | 0.9528 macro-F1 | 0.9759 | Strong within-episode subtask signal | |
| | Current all-feature subtask | 0.9173 macro-F1 | 0.9828 | High accuracy, lower class-balanced score | |
| | Cross-modal retrieval | 0.3678 top-5 | n/a | Motion/IMU/camera/audio retrieves matching depth/video | |
| | Transition detection | 0.6118 macro-F1 | 0.9080 | Boundary F1 is 0.1250 | |
| | Hand trajectory forecast | 0.8647 MPJPE | n/a | Predicts future hand-joint trajectory | |
| | Neural MLP hand forecast | 0.1079 MPJPE | n/a | Same features/split, nonlinear regression head | |
| | Neural MLP temporal order | 0.8520 F1 | 0.8578 | Strong improvement on adjacent-window ordering | |
| | Neural MLP misalignment | 0.7153 F1 | 0.7009 | Detects shifted motion/visual/audio pairs better than the linear head | |
| | Audio ablation | +0.0418 mean delta | n/a | Current audio variant improves the primary metric on 6 of 12 task contracts | |
| | Alternate audio representation | +0.0936 mean delta | n/a | Alternate audio-window representation improves over the baseline audio variant on 6 of 12 task contracts | |
|
|
| ## Audio Contribution Study |
|
|
| The audio ablation keeps the same windows and task labels, then compares input |
| variants under the same chronological split. The script |
| [`scripts/audio_ablation_and_raw_upgrade.py`](scripts/audio_ablation_and_raw_upgrade.py) |
| reuses the real task-suite windows and evaluates six variants for |
| every task: current inputs, no audio, audio-only, alternate audio-only, audio |
| representation replacement, and all inputs plus the alternate audio representation. |
|
|
| The measured single-episode result is task-specific: |
|
|
| | Readout | Value | |
| | --- | ---: | |
| | Tasks where current audio improves the primary metric | 6 / 12 | |
| | Mean current-audio delta | +0.0418 | |
| | Tasks where alternate audio representation improves over baseline audio | 6 / 12 | |
| | Mean alternate-representation delta vs baseline audio | +0.0936 | |
|
|
| Full files: |
|
|
| - [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md) |
| - [`results/audio_ablation/audio_ablation_metrics.csv`](results/audio_ablation/audio_ablation_metrics.csv) |
| - [`results/audio_ablation/audio_delta_summary.csv`](results/audio_ablation/audio_delta_summary.csv) |
| - [`docs/data/audio_ablation_summary.json`](docs/data/audio_ablation_summary.json) |
| - [`docs/assets/charts/audio_ablation_delta.svg`](docs/assets/charts/audio_ablation_delta.svg) |
|
|
| ## Neural MLP Results |
|
|
| The neural baseline was run locally with `--include-neural` for all 12 tasks |
| using 80 epochs, hidden size 128, batch size 128, and CPU execution. It is not a |
| foundation model result; it is a controlled nonlinear-head comparison over the |
| same 8,546-dimensional multimodal representation. |
|
|
| | Task | Neural metric | Minimal metric | Readout | |
| | --- | ---: | ---: | --- | |
| | Action Recognition | 0.0148 macro-F1 | 0.0500 macro-F1 | Still blocked by unseen future classes | |
| | Procedure Step Recognition | 0.0281 macro-F1 | 0.0506 macro-F1 | Same single-episode split limitation | |
| | Action Boundary Detection | 0.5862 macro-F1 | 0.6118 macro-F1 | Similar to the linear baseline | |
| | Next-Action Prediction | 0.0419 macro-F1 | 0.0593 macro-F1 | Same unseen-label issue | |
| | Hand Trajectory Forecasting | 0.1079 MPJPE | 0.8647 MPJPE | Neural regression improves this target | |
| | Contact State Prediction | 1.0000 macro-F1 | 1.0000 macro-F1 | Degenerate one-class sample | |
| | Object Relevance Prediction | 0.1679 micro-F1 | 0.1803 micro-F1 | Similar weak object signal | |
| | Language Grounding | 0.0168 MRR | 0.0160 MRR | Similar ranking behavior | |
| | Cross-Modal Retrieval | 0.1300 MRR | 0.2693 MRR | Linear ridge remains stronger here | |
| | Cross-Modal Reconstruction | -0.0102 R2 | -0.0153 R2 | Small improvement but still weak | |
| | Temporal Order Verification | 0.8520 F1 | 0.5400 F1 | Neural head captures local temporal structure | |
| | Multimodal Synchronization Detection | 0.7153 F1 | 0.5052 F1 | Neural head improves alignment detection | |
|
|
| The strongest single-episode self-supervised signal is cross-modal retrieval: |
| motion/IMU/camera/audio features retrieve matching depth/video windows substantially |
| better than random. |
|
|
| ## Single-Episode Diagnostics and Explorer |
|
|
| While waiting for broader Xperience-10M access, the repo now includes an |
| artifact-driven diagnostics pass over the public sample episode: |
|
|
| - `results/single_episode_diagnostics/object_labels/window_object_labels.csv` |
| exports 1,161 real window-level object-label sets from `annotation.hdf5`. |
| - `results/single_episode_diagnostics/modality_ablation/ablation_metrics.csv` |
| recomputes all 96 task/modality cells, including object relevance. |
| - `results/single_episode_diagnostics/timeline_overlay/timeline_overlay.csv` |
| aligns 2,079 existing prediction rows back to the episode timeline. |
| - `results/single_episode_diagnostics/alignment_stress/alignment_shift_metrics.csv` |
| evaluates cross-modal retrieval under explicit time shifts. |
| - `docs/single_episode_explorer.html` is a static interactive page for |
| inspecting window labels, objects, predictions, modality statistics, and |
| diagnostic scores. |
|
|
| These are single-episode research diagnostics. They are useful for studying |
| task definitions, feature behavior, and model errors before scaling to more |
| episodes; they are not reported as multi-episode benchmark results. |
|
|
| ## Reproducibility Check |
|
|
| I re-ran the full pipeline from the local raw public sample into a temporary |
| local workspace and compared regenerated metrics with the committed |
| artifacts. The baseline metrics, 12 task metrics, feature manifest, and |
| available modality manifest matched exactly after float normalization. |
|
|
| See [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) for the |
| commands and verification evidence. |
|
|
| ## Why Some Scores Are Low |
|
|
| The task suite intentionally uses a chronological split: |
|
|
| ```text |
| first 70% of the episode -> train |
| last 30% of the episode -> test |
| ``` |
|
|
| The test segment contains some action/subtask labels never seen during training. |
| Timeline and next-action classifiers therefore expose the core limitation of |
| single-episode learning instead of hiding it behind random splits. |
|
|
| ## Modalities Used |
|
|
| The current public-sample pipeline uses: |
|
|
| - hand/body mocap joints and contact labels, |
| - camera translation and rotation, |
| - IMU acceleration and gyroscope traces, |
| - depth confidence features, |
| - six video streams, |
| - audio from the sample MP4 stream, |
| - caption/object/interaction text features, |
| - SLAM point-cloud summary features, |
| - calibration parameters. |
|
|
| The full technical source manifest is stored in |
| [`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json). |
|
|
| ## Data Notice |
|
|
| Xperience-10M data belongs to its original authors and is subject to the |
| official Ropedia dataset license and access terms. This repo contains code and |
| derived experiment artifacts only; it does not redistribute the raw videos or |
| raw annotation dataset. |
|
|