| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - embodied-ai |
| - robotics |
| - multimodal |
| - xperience-10m |
| - baseline |
| - evaluation |
| - qwen3-omni |
| - cosmos |
| datasets: |
| - ropedia-ai/xperience-10m-sample |
| - ropedia-ai/xperience-10m |
| metrics: |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| --- |
| |
| # Ropedia Xperience-10M Task Suite |
|
|
| [](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) |
| [](https://huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite) |
| [](https://huggingface.co/datasets/ropedia-ai/xperience-10m) |
| [](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite/pkgs/container/ropedia-xperience-10m-task-suite) |
| [](#scope) |
| [](CITATION.cff) |
| [](LICENSE) |
|
|
| <p align="center"> |
| <img src="docs/assets/brand/xperience10m-logo-social-card.png" alt="Ropedia Xperience-10M Task Suite logo card" width="760"> |
| </p> |
|
|
| A research-development project built on the public Xperience-10M sample episode |
| released by Ropedia. The goal is to make one richly multimodal egocentric |
| episode understandable, turn it into concrete embodied-AI task definitions, and |
| prepare the same pipeline for future held-out multi-episode training. |
|
|
| The central research questions are: |
|
|
| - What can be learned from one aligned Xperience-10M episode while separating |
| sample-specific observations from later multi-episode questions? |
| - Which input/output tasks are meaningful for embodied AI when video, depth, |
| pose, mocap, IMU, and language annotations are synchronized? |
| - What baseline models and evaluation files should exist before scaling to |
| Qwen3-Omni or other multimodal foundation-model fine-tuning? |
|
|
| ## Why This Project Exists |
|
|
| This project is organized as a compact research artifact around Xperience-10M: |
| start from a real public episode, make every modality and label path inspectable, |
| turn the data into concrete embodied-AI tasks, and keep the evaluation boundary |
| clear while preparing the next multi-episode experiments. The emphasis is on |
| research judgment as much as implementation: what the sample can show, what it |
| cannot show, and what evidence should exist before claiming model quality. |
|
|
| The work is designed to demonstrate four capabilities that matter for |
| embodied-AI research infrastructure: |
|
|
| | Capability | What this project shows | |
| | --- | --- | |
| | Multimodal data understanding | Parses the public sample into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals | |
| | Task design | Defines 20 human-readable tasks in one unified public-sample suite, plus four direction-extension probes with inputs, outputs, process modules, metrics, and case-study walkthroughs | |
| | Model and evaluation discipline | Runs minimal and compact neural baselines, records predictions/metrics, keeps chronological split boundaries explicit, and separates sample evidence from held-out claims | |
| | Scale-up planning | Connects the public-sample pipeline to 32/128-episode held-out pilots, Qwen3-Omni LoRA, Cosmos-style world-model branches, policy-model branches, and the future Xperience-native foundation-model pretraining goal | |
|
|
| ## Start Here |
|
|
| For a first pass, use [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) or the |
| machine-readable [`docs/data/project_brief.json`](docs/data/project_brief.json). |
| They give the project shape in one page: what exists now, what the public |
| sample can support, where the 20 tasks and baselines live, and how the verified |
| 128-episode baseline, Qwen3-Omni, Cosmos3-Nano, and Cosmos3-Super branches |
| should be compared. |
|
|
| | Reader goal | Best entry point | |
| | --- | --- | |
| | Understand the whole project quickly | [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md) | |
| | See the visual research dashboard | [GitHub Pages dashboard](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) | |
| | Navigate the unified 20 tasks, four tracks, and scale-up plan | [Interactive research roadmap](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/research_roadmap.html), [`TASK_SUITE_20.md`](TASK_SUITE_20.md), [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json), [`docs/data/research_roadmap_interactive.json`](docs/data/research_roadmap_interactive.json) | |
| | Compare current task metrics | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`docs/data/summary_metrics.json`](docs/data/summary_metrics.json) | |
| | Compare possible foundation backbones | [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json) | |
| | Understand the future native pretraining goal | [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) | |
| | See additional concrete project directions | [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md), [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json) | |
| | Understand one model input | [`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json), [`results/episode_task_suite/windows.csv`](results/episode_task_suite/windows.csv) | |
| | Check multi-episode data status | [`results/omni_finetune/DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md) | |
|
|
| Public release checks are exposed as JSON for mirrors and dashboards: |
| [`docs/data/website_integrity.json`](docs/data/website_integrity.json), |
| [`docs/data/rendered_site_check.json`](docs/data/rendered_site_check.json), |
| [`docs/data/task_surface_integrity.json`](docs/data/task_surface_integrity.json), |
| [`docs/data/publication_audit.json`](docs/data/publication_audit.json), |
| [`docs/data/mirror_parity.json`](docs/data/mirror_parity.json), |
| [`docs/data/public_surface_qa.json`](docs/data/public_surface_qa.json), and |
| [`docs/data/research_roadmap.json`](docs/data/research_roadmap.json). |
|
|
| ## Research Project Overview |
|
|
| | Theme | Current implementation | |
| | --- | --- | |
| | Dataset slice | One public Xperience-10M sample episode, 5,821 frames, 1,161 windows, and an 8,546-dimensional representation | |
| | Modalities | Video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, calibration, and language annotations | |
| | Task suite | 20 human-readable tasks form one embodied-AI public-sample suite; tasks 1-12 are the original contracts and tasks 13-20 reuse the same windows, split discipline, and minimal/neural head pattern | |
| | Baselines | Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split; companion simple/NN metadata baselines are also aligned to the selected 128-episode 96/16/16 split | |
| | Research directions | Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling | |
| | Scale-up path | The selected-episode Qwen3-Omni LoRA final diagnostic result is verified on the 96/16/16 split; same-split simple/NN metadata baselines now cover the 12 task ids as a companion comparison. The Qwen result proves the multi-episode export/train/eval/package loop and meets the strict-JSON target, but weak action/subtask metrics make it a baseline for error analysis rather than a strong model. Cosmos3 now has three verified diagnostics: Nano future-window compatibility, Super base-weight Reasoner evaluation, and Super forward-dynamics LoRA fine-tuning over camera-pose proxy targets. | |
| | Public surfaces | GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection | |
|
|
| For the fastest interpretation of the current metrics, start with |
| [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md) and |
| [`docs/data/research_takeaways.json`](docs/data/research_takeaways.json). |
| They summarize what the public sample results actually show: class shift under |
| chronological splits, neural gains on dynamics/order/alignment, harder |
| retrieval/reconstruction probes, and why the next model-quality step needs |
| held-out episodes. |
|
|
| Current contributions: |
|
|
| - manifested sliding-window features over the currently extracted modalities, |
| - motion-only and current all-feature baseline models, |
| - 20 end-to-end episode-level task contracts, |
| - tasks 13-20 aligned to the same 20-frame windows and chronological split as tasks 1-12, |
| - lightweight neural MLP heads for the same task contracts, |
| - a generated four-direction research taxonomy matching the Ropedia job tracks, |
| - four additional direction-extension probes with minimal and neural baselines, |
| - human-readable research task cards and an interactive scrub/play walkthrough storyboard for every task, |
| - an interactive research roadmap connecting 20 tasks, four research tracks, current sample evidence, the Qwen3-Omni scale-up path, and foundation-model branch selection, |
| - a next-milestone track for Qwen3-Omni fine-tuning, Cosmos 3 world modeling, and sensor-bridge evaluation, |
| - a future pretraining plan for an Xperience Embodied Foundation Model over the full corpus after smaller multi-episode stages prove value, |
| - metrics, predictions, model weights, manifests, charts, and a two-level |
| tabbed static research website, |
| - a clear explanation of what is implemented now and what moves to the multi-episode stage. |
|
|
| ## Current Research Scope |
|
|
| This project is best read as a staged embodied-AI research study: |
|
|
| | Layer | Current scope | Where to start | |
| | --- | --- | --- | |
| | Data understanding | One public Xperience-10M sample episode is converted into 5,821 frames, 1,161 aligned windows, and an 8,546-dimensional multimodal representation. | [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md), [`PROJECT_STATUS.md`](PROJECT_STATUS.md) | |
| | Task suite | Twenty human-readable tasks cover action, procedure, contact, object, language, retrieval, reconstruction, order, synchronization, long-horizon forecasting, interaction text, action-object binding, sensor bridging, camera sync, and transition timing. Tasks 13-20 live under the historical `tier2_task_suite` artifact path for link stability, but they are part of the same suite. | [`TASK_SUITE_20.md`](TASK_SUITE_20.md), [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json), [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json), [`results/episode_task_suite/tier2_task_suite/TIER2_TASK_BASELINES.md`](results/episode_task_suite/tier2_task_suite/TIER2_TASK_BASELINES.md) | |
| | Baselines | Minimal heads and compact PyTorch MLP heads provide a first controlled comparison on the same chronological split; the selected 128-episode setup also has same-split simple/NN metadata baselines for JSON-supported tasks and raw-feature simple/NN baselines on all 20 task axes, with tasks 15 and 19 explicitly marked as compact-proxy completions. | [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), [`results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md`](results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md), [`results/omni_finetune/a100_128_raw20_task_baselines_complete20_proxy_20260616T091500Z/run_summary_all.json`](results/omni_finetune/a100_128_raw20_task_baselines_complete20_proxy_20260616T091500Z/run_summary_all.json) | |
| | Diagnostics | Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard. | [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md), [`docs/single_episode_explorer.html`](docs/single_episode_explorer.html) | |
| | Scale-up | The selected 128-episode Qwen3-Omni LoRA diagnostic path now has a latest verified v6 held-out package: 96/16/16 selected episodes, 34,269 exported windows, 4,032 held-out test predictions, and public-safe metrics/predictions. v6 improves action macro-F1/contact accuracy versus v5, while v5 remains the pinned prior release row because it is stronger on several other metrics. Same-split simple/NN metadata baselines are published for JSON-supported axes, and the raw-feature run now adds simple/NN baselines on 20/20 task axes; tasks 15 and 19 are documented compact proxies because raw interaction strings and paired video-view embeddings are absent from the 128 export. Cosmos3-Nano has a verified future-window compatibility package. Cosmos3-Super now has two verified branches: a 448-window base-weight JSON-task Reasoner evaluation and a fine-tuned forward-dynamics LoRA package over camera-pose proxy targets with 2,848 train rows, 512 val rows, and 448 test rows. The 128-episode enhancement pack records the no-new-episode path: dense-window sizing, hierarchical action/subtask targets, task bottlenecks, and experiment cards for the next Qwen/Cosmos/policy pushes without overwriting existing results. | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`TASK_SUITE_ENHANCEMENT_128.md`](TASK_SUITE_ENHANCEMENT_128.md), [`docs/data/task_suite_enhancement_128.json`](docs/data/task_suite_enhancement_128.json), [`docs/data/omni_model_comparison.json`](docs/data/omni_model_comparison.json), [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`docs/data/qwen3_v5_v6_comparison.json`](docs/data/qwen3_v5_v6_comparison.json), [`results/omni_finetune/QWEN3_V5_V6_COMPARISON_20260614.md`](results/omni_finetune/QWEN3_V5_V6_COMPARISON_20260614.md), [`results/omni_finetune/OMNI_MODEL_COMPARISON.md`](results/omni_finetune/OMNI_MODEL_COMPARISON.md), [`results/omni_finetune/verified_public/`](results/omni_finetune/verified_public/), [`results/omni_finetune/task_suite_enhancement_128_v1_20260608/`](results/omni_finetune/task_suite_enhancement_128_v1_20260608/) | |
|
|
| Detailed dataset notes, reproduction checks, and generated JSON reports are |
| included for readers who want to inspect the implementation, but they are |
| supporting materials rather than the main reading path. Use |
| [`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) when you want the full file map. |
|
|
| Source alignment is tracked in [`SOURCE_ALIGNMENT_AUDIT.md`](SOURCE_ALIGNMENT_AUDIT.md) |
| and [`docs/data/source_alignment_audit.json`](docs/data/source_alignment_audit.json). |
| The official gated `ropedia-ai/xperience-10m` card reports `31.9 TB` on the |
| live HF surface and an `about-1PB` full-scale storage statement; the committed |
| API-listing snapshot records `12,103 episode folders` as upstream `metadata only`, |
| not a local raw-data inventory. In other words, those episode folders are |
| upstream listing metadata only for this project. The public sample remains |
| `ropedia-ai/xperience-10m-sample` under `cc-by-nc-4.0`, with the `HOMIE Toolkit` |
| and `Rerun 0.29.0` noted as source tooling. The official responsible-use note |
| that the data is `limited in diversity` is preserved. |
|
|
| ## Project Status |
|
|
| If you only have one minute, use |
| [`PROJECT_STATUS.md`](PROJECT_STATUS.md) and |
| [`docs/data/project_status.json`](docs/data/project_status.json). |
| They give the current research state in one compact table: |
|
|
| | Area | Current decision | |
| | --- | --- | |
| | Public-sample pipeline | Verified on one public sample episode: 5,821 frames, 1,161 windows, 8,546 dimensions | |
| | 20-task suite | Verified minimal baselines with committed metrics, predictions, and manifests | |
| | Neural heads | Verified compact PyTorch MLP heads over the same task contracts and chronological splits | |
| | Dataset context | Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented | |
| | Evaluation protocol | Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics | |
| | Website and Hub pages | Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links | |
| | Qwen3-Omni multi-episode pilot | Final verified diagnostic result package exists for the selected 96/16/16 episode split; JSON validity meets the target, while action/subtask metrics remain weak | |
| | Raw Xperience-10M data / full Qwen weights | Not redistributed | |
|
|
| ## 90-Second Research Project Path |
|
|
| If you are reading the project cold, open these in order: |
|
|
| | Step | Question | Primary artifacts | What should be true | |
| | --- | --- | --- | --- | |
| | 1 | What is this project? | [`PROJECT_BRIEF.md`](PROJECT_BRIEF.md), [`PROJECT_STATUS.md`](PROJECT_STATUS.md), [dashboard](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) | A public-sample Xperience-10M research project with 20 tasks, baselines, and a scale-up plan. | |
| | 2 | What data is used? | [`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md), [official HF dataset](https://huggingface.co/datasets/ropedia-ai/xperience-10m), [sample HF dataset](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) | The implemented suite uses one public sample episode; the gated dataset is reserved for selected multi-episode training. | |
| | 3 | What does one model input contain? | [`windows.csv`](results/episode_task_suite/windows.csv), [`feature_manifest.json`](results/episode_task_suite/feature_manifest.json), [`available_modalities.json`](results/episode_task_suite/available_modalities.json) | Each window is an aligned multimodal unit with video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals. | |
| | 4 | What are the 20 tasks? | [`TASK_SUITE_20.md`](TASK_SUITE_20.md), [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json), [`results/episode_task_suite/task_walkthroughs/`](results/episode_task_suite/task_walkthroughs/), [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json) | Every task has a human-readable name, input, output, metric, baseline scores, and an explicit artifact path. | |
| | 5 | How are tasks evaluated? | [`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md), [`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) | The window unit, chronological split, leakage controls, task metrics, and current limitations are explicit. | |
| | 6 | What do the current results mean? | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`docs/data/research_takeaways.json`](docs/data/research_takeaways.json), [`docs/data/summary_metrics.json`](docs/data/summary_metrics.json) | Current metrics describe sample-level task behavior and identify which signals need larger held-out experiments. | |
| | 7 | Which models are implemented? | [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json), [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), [HF baseline repo](https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines) | Each task has minimal and neural-head evidence over the same feature windows. | |
| | 8 | What research directions does this support? | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`docs/data/research_directions.json`](docs/data/research_directions.json), [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json), [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json) | The unified tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling. | |
| | 9 | Which foundation model comes next? | [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json), [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) | Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 is now represented by Nano future-window compatibility and Super forward-dynamics LoRA; policy models wait for robot-compatible action targets; Xperience-native pretraining is the full-corpus future goal. | |
| | 10 | How can the 128-episode suite be pushed without more data? | [`TASK_SUITE_ENHANCEMENT_128.md`](TASK_SUITE_ENHANCEMENT_128.md), [`docs/data/task_suite_enhancement_128.json`](docs/data/task_suite_enhancement_128.json) | The enhancement pack proposes dense windows, hierarchical action/subtask labels, raw-feature shard priorities, and `multiscale_20s10_40s20_80s40` as the next export target. | |
| | 11 | How do I reproduce it? | [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md), [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) | Public commands and expected outputs are documented for the sample-episode task suite. | |
| | 12 | What is still pending? | [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md), [`MULTI_EPISODE_ACCESS_STATUS.md`](results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md) | The final held-out diagnostic Qwen pass is verified and JSON-validity target is met; strong action/subtask model quality remains pending. | |
|
|
| A compact reader-path summary is available at |
| [`docs/data/project_packet.json`](docs/data/project_packet.json). |
|
|
| ## Supporting Files |
|
|
| [`ARTIFACT_GUIDE.md`](ARTIFACT_GUIDE.md) is the human-readable map for readers |
| who want to inspect the project files after the first pass. It groups the main |
| briefs, task outputs, baseline results, visual assets, data notes, and |
| scale-up documents. |
|
|
| [`docs/data/artifact_index.json`](docs/data/artifact_index.json) is the compact |
| machine-readable companion used by the website and Hugging Face artifact |
| dataset. |
|
|
| ## Evaluation Protocol |
|
|
| [`EVALUATION_PROTOCOL.md`](EVALUATION_PROTOCOL.md) and |
| [`docs/data/evaluation_protocol.json`](docs/data/evaluation_protocol.json) are |
| generated from committed metric artifacts. They define: |
|
|
| - the 20-frame window unit, stride, feature dimension, and raw-data policy, |
| - the chronological 70/30 single-episode split and its generalization limit, |
| - the per-task input, target, primary metric, minimal score, and neural score, |
| - leakage controls for future labels, target-side signals, caption/object |
| labels, and train-only normalization, |
| - current limitations, including cross-episode generalization, |
| audio-visual learning, pixel-depth reconstruction, and real held-out |
| multi-episode Qwen3-Omni quality. |
|
|
| ## Dataset Context |
|
|
| The official [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m) |
| dataset is a gated large-scale egocentric multimodal dataset for embodied AI, |
| robotics, spatial intelligence, and world modeling. The public |
| [`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) |
| repo provides the sample episode used for the implemented task suite here. |
|
|
| This project keeps those layers separate: the public sample supports the |
| current 20-task study, while the gated full dataset is used only for the |
| selected multi-episode Qwen3-Omni pilot. Raw Xperience-10M MP4/HDF5/RRD files |
| are not redistributed in this repo or in the Hugging Face mirrors. |
|
|
| The current verified public-sample subset is: |
|
|
| - one public sample episode, 5,821 frames, and 1,161 aligned windows, |
| - raw sample files with six MP4 video streams and audio streams, |
| - `annotation.hdf5` carrying depth, SLAM/camera pose, hand/body mocap, IMU, |
| language/caption annotations, calibration, metadata, and timing records, |
| - an 8,546-dimensional baseline representation using video, audio, depth, |
| pose/SLAM, mocap, IMU, calibration, and language-derived signals. |
|
|
| Detailed dataset notes are available in |
| [`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`](XPERIENCE10M_DATASET_CARD_ALIGNMENT.md) |
| and [`docs/data/xperience10m_dataset_card_alignment.json`](docs/data/xperience10m_dataset_card_alignment.json) |
| for readers who need the full upstream-card and access-term context. The |
| practical boundary is simple: current task-suite results come from the public |
| sample, and the first multi-episode Qwen3-Omni diagnostic pilot is verified but |
| not yet strong model quality. |
|
|
| Start with the visual dashboard: |
|
|
| **[chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/)** |
|
|
| Hugging Face Space app: |
|
|
| **[cy0307-ropedia-xperience-10m-task-suite.static.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.static.hf.space/)** |
|
|
| ## Read This Project In Three Layers |
|
|
| | Layer | What to inspect | Why it matters | |
| | --- | --- | --- | |
| | Project status | `PROJECT_STATUS.md`, `docs/data/project_status.json` | Gives a one-table current project summary before reading the full artifact trail | |
| | Data contract | `windows.csv`, `feature_manifest.json`, modality manifests | Confirms what each sample window contains before modeling | |
| | Dataset context | `XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`, official dataset links | Explains the official dataset, public sample, modalities, access boundary, and what this repo uses | |
| | Visual assets | `FIGURE_INDEX.md`, `docs/assets/` | Shows the task-suite graphic, modality thumbnails, pipeline diagrams, charts, and logo assets | |
| | Evaluation protocol | `EVALUATION_PROTOCOL.md`, `docs/data/evaluation_protocol.json` | Defines the task unit, split, metrics, leakage controls, and current limitations | |
| | Research roadmap | `RESEARCH_ROADMAP.md`, `docs/data/research_roadmap.json` | Shows the path from sample-level task development to multi-episode work, larger model branches, and the future native-pretraining goal | |
| | Additional development directions | `ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`, `docs/data/additional_development_directions.json` | Records concrete non-backbone tracks: taxonomy, benchmark protocol, representation learning, skill graphs, affordances, 3D/4D memory, QA, and policy transfer | |
| | Xperience Embodied Foundation Model plan | `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` | Describes the long-term full-corpus pretraining goal, target modules, objectives, staged scale-up, hardware ranges, and evaluation protocol | |
| | Minimal heads | softmax, ridge projection/regression, multi-label logistic heads | Keeps every input/output contract visible and inspectable | |
| | Neural heads | PyTorch MLP classifiers/regressors under `neural_mlp/` | Checks whether nonlinear heads improve each task without changing features | |
| | Evidence | metrics, predictions, confusion matrices, diagrams, dashboard | Makes the single-episode task development inspectable without rerunning first | |
| | Artifact guide | `ARTIFACT_GUIDE.md` | Groups the public evidence into research-project layers after the first-pass overview | |
| | Reproducibility contract | `REPRODUCIBILITY.md`, `docs/data/reproducibility_matrix.json` | States public commands, expected outputs, exact-match reproduction evidence, and non-reproducible boundaries | |
| | Citation metadata | `CITATION.cff`, `codemeta.json`, `LICENSE` | Makes the repo easier to cite, index, and reuse without confusing code license and dataset terms | |
|
|
| ## Links |
|
|
| | Resource | Link | |
| | --- | --- | |
| | This GitHub repo | [github.com/ChaoYue0307/ropedia-xperience-10m-task-suite](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite) | |
| | This project website | [chaoyue0307.github.io/ropedia-xperience-10m-task-suite](https://chaoyue0307.github.io/ropedia-xperience-10m-task-suite/) | |
| | This Hugging Face Space | [huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite](https://huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite) | |
| | Live Hugging Face static app | [cy0307-ropedia-xperience-10m-task-suite.static.hf.space](https://cy0307-ropedia-xperience-10m-task-suite.static.hf.space/) | |
| | GitHub Container package | [ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite](https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite/pkgs/container/ropedia-xperience-10m-task-suite) | |
| | Derived artifacts on Hugging Face | [huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts](https://huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts) | |
| | Minimal and neural task baselines on Hugging Face | [huggingface.co/cy0307/ropedia-xperience-10m-task-baselines](https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines) | |
| | Qwen3-Omni 128-episode LoRA adapter | [huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep](https://huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep) | |
| | Cosmos3-Super forward-dynamics LoRA adapter | [huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep](https://huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep) | |
| | Hugging Face collection | [huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite](https://huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite) | |
| | Xperience-10M dataset website | [ropedia.com/dataset](https://ropedia.com/dataset) | |
| | Xperience-10M release page | [ropedia.com/blog/20260316_xperience_10m](https://ropedia.com/blog/20260316_xperience_10m) | |
| | Ropedia GitHub organization | [github.com/Ropedia](https://github.com/Ropedia) | |
| | HOMIE Toolkit | [github.com/Ropedia/HOMIE-toolkit](https://github.com/Ropedia/HOMIE-toolkit) | |
| | Xperience-10M Hugging Face dataset | [huggingface.co/datasets/ropedia-ai/xperience-10m](https://huggingface.co/datasets/ropedia-ai/xperience-10m) | |
| | Xperience-10M sample on Hugging Face | [huggingface.co/datasets/ropedia-ai/xperience-10m-sample](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample) | |
| | Ropedia Hugging Face organization | [huggingface.co/ropedia-ai](https://huggingface.co/ropedia-ai) | |
|
|
| ## Citation, License, And Metadata |
|
|
| Use [`CITATION.cff`](CITATION.cff) when citing this project. The repository |
| also includes [`codemeta.json`](codemeta.json) for machine-readable software |
| metadata and [`docs/data/project_manifest.json`](docs/data/project_manifest.json) |
| for website/Hugging Face surface metadata. |
|
|
| The code files are MIT-licensed. Raw Xperience-10M data is not redistributed |
| here, and dataset use remains governed by the official Ropedia/Xperience-10M |
| terms. See [`LICENSE`](LICENSE) and [`DATA_NOTICE.md`](DATA_NOTICE.md). |
|
|
|  |
|
|
| The infographic uses a custom text-free research background and puts the shared |
| processing contract plus all 12 task families before the modality atlas. |
| Public-sample modality thumbnails remain enlarged below the task map. The task |
| names, input/output summaries, and metrics are overlaid from |
| [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) |
| with [`scripts/render_task_suite_infographic.py`](scripts/render_task_suite_infographic.py), |
| so the published PNG is a presentation graphic with verified labels and metrics, |
| not a hallucinated metric sheet. |
|
|
| The complete unified task list is now documented in [`TASK_SUITE_20.md`](TASK_SUITE_20.md) |
| and [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json). Tasks 13-20 |
| also have a compact chart and result bundle under the historical |
| `tier2_task_suite` path for stable public links. |
|
|
|  |
|
|
| The unified radar compares all 20 task axes with two filled colors for the |
| minimal and neural MLP baselines. Every method now has 20 explicit result |
| records in the public matrix; numeric points appear only where the runner or |
| verified package produced that task target. The 128-episode raw-feature |
| simple/NN overlays are plotted on all 20 axes backed by the exported |
| 4430-dimensional sensor NPZ blocks. Tasks 15 and 19 are marked as compact-proxy |
| completions because the 128 export lacks raw interaction strings and paired |
| video-view embeddings. Qwen3-Omni and Cosmos3-Super also include a task-16 |
| action/object relation score derived from existing verified held-out JSON |
| outputs; Cosmos3-Nano and the metadata-only baselines keep scoreless records for |
| unsupported or not-evaluated targets instead of hiding those cells. |
| Cosmos3-Super forward-dynamics LoRA |
| remains a branch card because its camera-pose proxy MSE is not one of the 20 |
| task metrics. The machine-readable copies are |
| [`docs/data/unified_task_model_radar.json`](docs/data/unified_task_model_radar.json) |
| and |
| [`docs/data/task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json); |
| the explicit score-gap ledger is |
| [`docs/data/task_method_20_gap_audit.json`](docs/data/task_method_20_gap_audit.json) |
| and [`TASK_METHOD_20_GAP_AUDIT.md`](TASK_METHOD_20_GAP_AUDIT.md); |
| the reader-facing matrix is |
| [`TASK_METHOD_20_RESULT_MATRIX.md`](TASK_METHOD_20_RESULT_MATRIX.md). |
|
|
| For easier reading, the same source data is also split into two focused radars: |
|
|
|  |
|
|
|  |
|
|
| The single-episode radar isolates Minimal vs Neural MLP, both with 20/20 scored |
| public-sample axes. The 128-episode radar isolates metadata/raw baselines and |
| Qwen3/Cosmos branches: raw-feature simple/NN baselines are the current complete |
| 20/20 scored multi-episode results, while metadata and foundation-model rows |
| retain explicit scoreless records where no public target was evaluated. The |
| current matrix has 113 numeric method-task scores out of 180 records. |
|
|
| The website also includes a responsive native modality atlas backed by |
| [`docs/data/modality_atlas.json`](docs/data/modality_atlas.json) and |
| [`docs/assets/modalities/`](docs/assets/modalities/). Those assets are small |
| derived thumbnails from the public sample, not raw Xperience-10M files. |
|
|
|  |
|
|
|  |
|
|
|  |
|
|
| The pipeline and architecture figures use the same pattern: text-free visual |
| backgrounds carry the composition, while |
| [`scripts/render_overview_figures.py`](scripts/render_overview_figures.py) |
| overlays exact labels, dimensions, and metrics from the committed result files. |
|
|
| ## Scope |
|
|
| This is a learning, inspection, and pipeline-validation repo built from one |
| public sample episode. The next model-quality stage is to run the same suite |
| over many episodes and split train/test by held-out episode. |
|
|
| ## What Is Inside |
|
|
| ```text |
| scripts/ |
| train_min_action_model.py # motion/IMU baseline |
| train_all_modalities_model.py # current all-feature lightweight baseline |
| episode_task_suite.py # original end-to-end task definitions |
| neural_task_models.py # optional PyTorch MLP heads for task contracts |
| research_direction_taxonomy.py # maps original tasks to the four research tracks |
| research_direction_extension_tasks.py # one extra data-backed probe per track |
| tier2_task_suite.py # historical-name builder for tasks 13-20 |
| build_unified_task_suite.py # builds TASK_SUITE_20.md and task_suite_20.json |
| build_unified_task_model_radar.py # builds the unified 20-axis model comparison chart |
| build_task_method_20_gap_audit.py # builds the explicit 113/180 scored-cell gap ledger |
| task_walkthroughs.py # human-readable task-card and walkthrough-storyboard metadata |
| generate_visualizations.py # refreshes SVG charts + summary JSON |
| render_task_suite_infographic.py # renders the task-suite presentation PNG |
| export_modality_atlas_assets.py # exports responsive modality-card assets |
| render_overview_figures.py # renders polished pipeline/architecture PNGs |
| build_brand_assets.py # derives logo sizes, favicon, social card |
| build_artifact_index.py # builds the compact artifact guide data |
| build_quality_gates.py # builds release checks |
| validate_mirror_parity.py # checks prepared GitHub/HF mirror file parity |
| validate_scope_claims.py # separates setup artifacts from completed model metrics |
| validate_task_surface.py # checks readable task cards and interactive storyboard wiring |
| validate_website_integrity.py # checks local site links, anchors, and images |
| validate_publication_package.py # checks public repo + HF bundle contents |
| publish_hf_bundles.py # uploads prepared HF Space/artifact/model bundles |
| omni/ |
| download_sample_modelscope.py # ModelScope sample download helper |
| build_episode_manifest.py # metadata-only multi-episode scanner |
| plan_finetune_sample_budget.py # storage/sample-count planner |
| qwen3_omni_adapter_smoke.py # real-data Qwen3-Omni adapter setup check |
| score_existing_model_output_task_probes.py # scores task targets already present in verified model outputs |
| collect_qwen3_v4_release_artifacts.py # pulls verified v4 results after remote eval |
| |
| results/ |
| min_action_model/ # motion-only action baseline artifacts |
| min_subtask_model/ # motion-only subtask baseline artifacts |
| min_all_modalities_action_model/ # current all-feature action artifacts |
| min_all_modalities_subtask_model/ # current all-feature subtask artifacts |
| episode_task_suite/ # task-suite metrics and predictions |
| neural_mlp/ # optional neural baseline artifacts per task |
| research_directions/ # four-track taxonomy, CSV, and summary |
| research_direction_extensions/ # four extra direction probes + predictions |
| tier2_task_suite/ # tasks 13-20 baseline tasks + predictions; historical path |
| task_walkthroughs/ # case-study walkthroughs for original tasks |
| omni_exploration/ # ModelScope readiness-check artifacts |
| omni_finetune/model_output_task_probes_20260616/ # task-16 probe derived from verified model JSON |
| |
| docs/ |
| index.html # GitHub Pages dashboard |
| data/additional_development_directions.json # concrete non-backbone project directions |
| data/summary_metrics.json # website-readable metrics bundle |
| data/task_suite_20.json # unified 20-task suite bundle |
| data/unified_task_model_radar.json # 20-task radar values and model-branch overlays |
| data/single_episode_task_model_radar.json # 1-episode split radar values |
| data/episode128_task_model_radar.json # 128-episode split radar values |
| data/task_method_20_result_matrix.json # 9-method x 20-task result matrix |
| data/task_method_20_gap_audit.json # explicit 113/180 scored-cell gap ledger |
| data/evidence_contract.json # machine-readable project scope |
| data/artifact_index.json # compact project-artifact catalog |
| data/live_publication_status.json # live GitHub/HF publication verification |
| data/quality_gates.json # machine-readable release checks |
| data/task_suite_enhancement_128.json # no-new-episode 128-suite enhancement pack |
| data/task_surface_integrity.json # machine-readable task-card/storyboard integrity check |
| data/project_manifest.json # machine-readable public-surface metadata |
| data/project_packet.json # compact project path and scope summary |
| data/research_roadmap.json # multi-episode and omni-model roadmap |
| data/research_directions.json # four-track website data bundle |
| data/research_direction_extensions.json # four extra probe data bundle |
| data/tier2_task_suite.json # tasks 13-20 baseline bundle; historical path |
| data/task_walkthroughs.json # human-readable task-card and walkthrough-storyboard data |
| data/modality_atlas.json # responsive modality-card data |
| assets/brand/*.png # project logo, favicon, social card |
| assets/task_suite_infographic.png # task-suite presentation graphic |
| assets/modalities/ # public-sample derived modality thumbnails |
| assets/pipeline_diagram.png # verified episode pipeline graphic |
| assets/qwen3_omni_lora_pipeline.png # Qwen3-Omni LoRA training-flow figure |
| assets/task_architectures.png # verified task-head architecture map |
| assets/charts/unified_task_model_radar.svg # 20-task minimal/NN/Qwen/Cosmos radar |
| assets/charts/single_episode_task_model_radar.svg # 1-episode split radar |
| assets/charts/episode128_task_model_radar.svg # 128-episode split radar |
| assets/charts/*.svg # regenerated visualizations |
| |
| notes/ |
| min_action_model.md |
| all_modalities_model.md |
| episode_task_suite.md |
| ``` |
|
|
| Raw Xperience-10M data is **not** committed. Download it from the official |
| Ropedia distribution and follow the dataset terms. |
|
|
| ## GitHub Package |
|
|
| The public dashboard is packaged as a static-site container on GitHub Container |
| Registry. It contains the `docs/` site plus the main reader documents; it does |
| not include raw Xperience-10M videos, raw annotations, gated data, or model |
| weights. |
|
|
| ```bash |
| docker pull ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest |
| docker run --rm -p 8080:80 ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest |
| ``` |
|
|
| Then open `http://localhost:8080`. |
|
|
| ## Data Expected |
|
|
| The scripts expect a workspace with the Ropedia HOMIE toolkit and the |
| Xperience-10M sample episode: |
|
|
| ```text |
| <workspace>/ |
| HOMIE-toolkit/ |
| data/sample/xperience-10m-sample/ |
| annotation.hdf5 |
| fisheye_cam0.mp4 |
| fisheye_cam1.mp4 |
| fisheye_cam2.mp4 |
| fisheye_cam3.mp4 |
| stereo_left.mp4 |
| stereo_right.mp4 |
| ``` |
|
|
| The public website also includes a Raw Sample Browser that lists every official |
| sample file, plays compact browser-preview clips derived from the official MP4 |
| streams, exposes the audio track embedded in `fisheye_cam0.mp4`, links the full |
| raw Hugging Face source for each MP4/HDF5/RRD file, and describes the |
| `annotation.hdf5` group organization without copying large raw files into this |
| repository. |
|
|
| The public sample dataset identifier is: |
|
|
| ```text |
| ropedia-ai/xperience-10m-sample |
| ``` |
|
|
| Hugging Face URL: |
|
|
| ```text |
| https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample |
| ``` |
|
|
| ## Quickstart |
|
|
| From a workspace folder: |
|
|
| ```bash |
| git clone https://github.com/Ropedia/HOMIE-toolkit.git |
| python3.12 -m venv .venv |
| source .venv/bin/activate |
| pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet |
| ``` |
|
|
| Download the sample: |
|
|
| ```bash |
| hf download ropedia-ai/xperience-10m-sample \ |
| --repo-type dataset \ |
| --local-dir data/sample/xperience-10m-sample |
| ``` |
|
|
| If Hugging Face access is unavailable in your environment, use ModelScope: |
|
|
| ```bash |
| python scripts/omni/download_sample_modelscope.py \ |
| --output-dir data/sample/xperience-10m-sample \ |
| --mode minimal |
| ``` |
|
|
| `--mode minimal` downloads `annotation.hdf5`, `README.md`, and |
| `fisheye_cam0.mp4`. Use `--mode all-training` to add all six MP4 streams while |
| still skipping `visualization.rrd`. |
|
|
| Clone and run this repo: |
|
|
| ```bash |
| git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git |
| cd ropedia-xperience-10m-task-suite |
| python scripts/episode_task_suite.py --workspace /path/to/workspace |
| ``` |
|
|
| Run the original task definitions with lightweight neural heads: |
|
|
| ```bash |
| pip install torch |
| python scripts/episode_task_suite.py \ |
| --workspace /path/to/workspace \ |
| --include-neural |
| ``` |
|
|
| Then rebuild the unified 20-task index after tasks 13-20 are generated: |
|
|
| ```bash |
| python scripts/tier2_task_suite.py --workspace /path/to/workspace |
| python scripts/build_unified_task_suite.py |
| python scripts/build_evaluation_protocol.py |
| ``` |
|
|
| Run the smaller baselines: |
|
|
| ```bash |
| python scripts/train_min_action_model.py --workspace /path/to/workspace |
| python scripts/train_all_modalities_model.py --workspace /path/to/workspace |
| ``` |
|
|
| ## Xperience-10M Fine-Tuning Exploration |
|
|
| This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The |
| repository separates public-sample evidence from multi-episode fine-tuning |
| artifacts. The selected-episode held-out package is now verified as a |
| diagnostic result, not a strong final action/subtask model. |
| The useful distinction is: |
|
|
| - direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language |
| prompts, |
| - adapter-required Xperience-10M sensor inputs: depth, pose/SLAM, hand/body |
| mocap, contacts, and IMU. |
|
|
|  |
|
|
| The figure shows the intended end-to-end training flow: raw valid episodes enter |
| episode-level split validation, parallel media/sensor export creates Qwen-style |
| JSONL records, Qwen3-Omni receives video/audio/text directly, the sensor bridge |
| adds depth/pose/mocap/IMU features, LoRA adapters are trained on prepared |
| train/val episodes, and sealed held-out test evaluation produces predictions, |
| metrics, run reports, and upload-ready adapter artifacts. |
|
|
| The scale-up path requires valid prepared episodes, held-out episode splits, |
| training metadata, predictions, metrics, and a run report. A result is ready |
| for public README, website, or Hugging Face updates only after the validator |
| passes and `scripts/omni/package_verified_omni_result.py` creates a |
| public-safe derived-artifact package. The current verified package is listed in |
| [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json). |
| The current cross-version comparison is generated at |
| [`docs/data/omni_model_comparison.json`](docs/data/omni_model_comparison.json) |
| and [`results/omni_finetune/OMNI_MODEL_COMPARISON.md`](results/omni_finetune/OMNI_MODEL_COMPARISON.md); |
| it separates the single-episode task suite, 128-episode aligned simple/NN |
| baselines, and verified Qwen3/Cosmos model-branch packages. The same generated |
| files also include `model_groups`: a model-first view that pairs 1-episode and |
| 128-episode entries for the same family. Use that section when comparing task |
| heads against task heads, Qwen3-Omni smoke/LoRA against Qwen3-Omni LoRA, or |
| Cosmos3-Nano compatibility against future Cosmos weight releases. |
|
|
| The no-new-episode enhancement layer is recorded in |
| [`docs/data/task_suite_enhancement_128.json`](docs/data/task_suite_enhancement_128.json) |
| and [`TASK_SUITE_ENHANCEMENT_128.md`](TASK_SUITE_ENHANCEMENT_128.md). It keeps |
| the current Qwen/Cosmos packages as baselines, then defines dense-window |
| scenarios, hierarchical action/subtask targets, task bottlenecks, and experiment |
| cards for a stronger 128-episode v5 run without overwriting earlier results. |
|
|
| ### Sample Count Decision |
|
|
| Do not treat "10M" as a reason to start with the entire dataset. The engineering |
| unit that matters first is diverse held-out episodes, not adjacent windows from |
| one session. |
|
|
| | Phase | Episodes/samples | Approx windows at stride 5 | Purpose | |
| | --- | ---: | ---: | --- | |
| | Readiness | 1-3 | 1k-3k | Verify loaders, token alignment, and task heads | |
| | Pilot | 16-32 | 18k-37k | First held-out-episode evaluation | |
| | Useful LoRA run | 64-128 | 74k-149k | Train sensor adapters plus selected Qwen3-Omni LoRA | |
| | Storage-heavy run | 256+ | 297k+ | Only after download layout and checkpoint size are stable | |
|
|
| Use the budget helper before downloading: |
|
|
| ```bash |
| python scripts/omni/plan_finetune_sample_budget.py \ |
| --storage-root /path/to/storage \ |
| --target-free-after-download-gb 800 \ |
| --all-training-per-episode-gb 2.4 \ |
| --full-preview-per-episode-gb 5.1 |
| ``` |
|
|
| ### Multi-Episode Readiness Gate |
|
|
| ```bash |
| python scripts/omni/discover_xperience10m_sources.py \ |
| --workspace /path/to/ropedia-xperience-10m-task-suite \ |
| --data-root /path/to/xperience10m_data \ |
| --output results/omni_finetune/source_discovery.json |
| ``` |
|
|
| Current status in this repo: |
|
|
| - public_sample_valid_episodes: 1 (degraded-valid: annotation + fisheye_cam0.mp4) |
| - gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions |
| - selected_episode_plan: 128 source-balanced episodes, 96/16/16 train/val/test |
| - selected_download_size: 277.71 GiB excluding `visualization.rrd` |
| - verified_final_diagnostic_package: true |
| - selected_split: 96 train / 16 validation / 16 held-out test episodes |
| - exported_windows: 2,848 train / 512 validation / 448 test |
| - validation_samples_used: 512 |
| - held_out_eval: 448 test windows from 14 exported test episodes |
| - final_train_loss / final_val_loss: 0.0277 / 0.0278 |
| - current_quality_target: strict-label JSON validity 100.00%, meeting the 98% target; action/subtask quality remains weak |
| - qwen3_lora_adapter_repo: https://huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep |
| - cosmos3_super_lora_adapter_repo: https://huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep |
| - 128_aligned_baselines: 12 task ids, 8 simple metadata/text baselines, 6 neural metadata/text baselines |
| - cosmos3_nano_branch: verified Cosmos3-Nano future-window compatibility package, 378 held-out future-window predictions from 14 test episodes |
| - cosmos3_super_branch: verified Cosmos3-Super Reasoner base-weight JSON-task evaluation, 448 held-out predictions from 14 test episodes; JSON validity 51.12%, action macro-F1 0.0008, contact accuracy 32.14%, transition accuracy 36.83% |
| - cosmos3_super_forward_dynamics_lora: verified 8-GPU FSDP LoRA branch over camera-pose proxy targets; 2,848 train rows, 512 val rows, 448 test rows, 26.2M adapter parameters, val MSE 4.0082, test MSE 3.6853; public package excludes safetensors |
| - gated dataset: available for selected multi-episode data preparation |
| - source_discovery: `results/omni_finetune/source_discovery.json` |
| - data_status: `results/omni_finetune/DATA_ACCESS_STATUS.md` |
| - access_status: `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md` |
| |
| Use this gate before scheduling any full fine-tune run. The pilot should use |
| balanced held-out selection, not the first paths in repository order. The |
| current 128-episode selection filters for complete leaf episodes, excludes |
| `visualization.rrd`, balances episode-size bands, and preserves one selected |
| episode per top-level session UUID. |
| |
| ### Progressive Train/Validation Pilot |
| |
| The selected 128-episode plan can be used before every episode has arrived by |
| training only on prepared `train` episodes and monitoring prepared `val` episodes. |
| The final `test` episodes stay sealed until the end, so early development does |
| not contaminate held-out evaluation. |
| |
| ```bash |
| python scripts/omni/build_selection_episode_manifest.py \ |
| --workspace /path/to/ropedia-xperience-10m-task-suite \ |
| --data-root /path/to/xperience10m_128 \ |
| --selection-json results/omni_finetune/xperience10m_128_episode_selection.json \ |
| --output results/omni_finetune/trainval_progressive/episode_manifest_trainval.json \ |
| --include-split train \ |
| --include-split val |
| ``` |
| |
| `scripts/omni/run_trainval_progressive_128.sh` wraps the same guard, exports a |
| train/val-only Qwen3-Omni JSONL dataset, and launches LoRA training without |
| running final test evaluation. The exporter uses session-qualified episode IDs |
| and path-based split matching so repeated folder names such as `ep1` cannot |
| collide across different sessions. |
|
|
| For larger prepared subsets, `scripts/omni/run_trainval_parallel_export_8gpu.sh` |
| uses the same split guard, exports episodes in parallel CPU shards, skips and |
| reports episodes that contain no labeled windows under the configured label |
| rule, then launches Qwen3-Omni LoRA with `NUM_PROCESSES=8`. |
|
|
| ### Full 128-Episode Held-Out Pilot |
|
|
| Once all selected episodes are complete, use the fixed selected-episode split: |
|
|
| - 96 train episodes, |
| - 16 validation episodes, |
| - 16 held-out test episodes. |
|
|
| The clean full-run launcher validates the selected split, exports all splits in |
| parallel, trains Qwen3-Omni LoRA on train episodes while optionally monitoring |
| validation loss, then evaluates on the held-out test split: |
|
|
| ```bash |
| RUN_ID=xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| DATA_ROOT=/path/to/xperience10m_128 \ |
| SELECTION_JSON=results/omni_finetune/xperience10m_128_episode_selection.json \ |
| MODEL_DIR=/path/to/Qwen__Qwen3-Omni-30B-A3B-Instruct \ |
| NUM_PROCESSES=8 \ |
| TRAIN_VAL_SPLIT=val \ |
| MAX_VAL_SAMPLES=512 \ |
| scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh |
| ``` |
|
|
| The latest verified diagnostic package uses the same selected split and 8-GPU |
| training path, includes the full held-out evaluation with 4,032 predictions and |
| 99.90% JSON validity, and keeps raw data plus full Qwen weights out of the |
| public repos. The next pass should keep this package contract while improving |
| action/subtask target quality and error analysis. |
|
|
| Monitor the run with: |
|
|
| ```bash |
| python scripts/omni/monitor_omni_progress.py \ |
| --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu |
| ``` |
|
|
| The monitor reads training `progress.jsonl`, new evaluator partial-prediction |
| progress, and legacy generation logs, so long held-out evals can still expose |
| sample-level progress even before final metrics are written. |
|
|
| Validate the run artifacts stage by stage: |
|
|
| ```bash |
| python scripts/omni/validate_omni_finetune_run.py \ |
| --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --require-stage manifest |
| |
| python scripts/omni/validate_omni_finetune_run.py \ |
| --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --require-stage eval \ |
| --min-json-validity 0.98 |
| ``` |
|
|
| After the eval validator passes, create the public-safe result package: |
|
|
| ```bash |
| python scripts/omni/package_verified_omni_result.py \ |
| --dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --train-run-id <train_run_id> \ |
| --eval-run-id <eval_run_id> |
| ``` |
|
|
| For long-running remote jobs, the packaging step can be watched automatically: |
|
|
| ```bash |
| python scripts/omni/watch_verified_omni_package.py \ |
| --dataset-run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \ |
| --train-run-id <train_run_id> \ |
| --eval-run-id <eval_run_id> |
| ``` |
|
|
| While waiting, the watcher can append `eval_progress_observed` events from |
| partial prediction files or legacy generation logs. This keeps the package |
| status file useful during long held-out evaluations. |
|
|
| The package copies only small derived artifacts such as metrics, predictions, |
| confusion matrices, run reports, manifests, validation summaries, and training |
| metadata. The exact required eval files and primary metrics come from the |
| selected backbone contract in `configs/omni_backbones`, so Qwen3-Omni, |
| Cosmos-style world models, and VLA/policy branches can share the same verified |
| publication gate once their model-specific evaluators exist. The package |
| excludes raw Xperience-10M files, base-model weights, adapter or checkpoint |
| weights, full checkpoints, and large archives. |
|
|
| For hardware setups that can run multiple eval workers, the Qwen evaluator also |
| supports deterministic sample shards: |
|
|
| ```bash |
| CUDA_DEVICE_GROUPS="0,1 2,3 4,5 6,7" \ |
| SHARDS=4 \ |
| RUN_ID=<merged_eval_run_id> \ |
| scripts/omni/run_qwen3_omni_lora_eval_sharded.sh |
| ``` |
|
|
| Only the merged eval directory should be validated and reported publicly, |
| because the merger checks coverage and recomputes the metrics from all |
| held-out predictions. |
|
|
| After dataset export, a model-neutral window index can be created for future |
| backbones: |
|
|
| ```bash |
| python scripts/omni/export_model_neutral_window_index.py \ |
| --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl |
| ``` |
|
|
| This produces `window_index.jsonl` and `window_index_manifest.json` so Cosmos- |
| style world models and VLA/policy branches can reuse the same split-checked |
| windows without depending on Qwen chat-message records. |
|
|
| ### Uploading Qwen3-Omni LoRA artifacts |
|
|
| The public-safe verified package intentionally excludes raw data, base Qwen |
| weights, LoRA weights, and full checkpoints. Adapter upload is a separate step: |
| use it only when the intended adapter directory is present and the model card |
| clearly distinguishes older smoke weights from the final selected-episode |
| diagnostic run. |
|
|
| Keep weight-bearing repositories model-specific: the final 128-episode |
| Qwen3-Omni adapter belongs in `cy0307/ropedia-qwen3-omni-lora-128ep`, older |
| Qwen smoke material remains historical. Cosmos3-Nano remains an artifacts-only |
| compatibility result; Cosmos3-Super Forward-Dynamics now has a separate |
| weight-bearing model repo at |
| `cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep`. |
| Metrics, predictions, audits, and reports stay in the artifact dataset. |
|
|
| ```bash |
| python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \ |
| --repo-id cy0307/ropedia-qwen3-omni-lora-128ep \ |
| --source-dir /path/to/adapter_upload_package \ |
| --message "Upload Xperience-10M Qwen3-Omni LoRA pilot" |
| ``` |
|
|
| This script requires a valid Hugging Face token via `HF_TOKEN` or `--token`. |
| Network availability to `huggingface.co` is required. |
|
|
| ### Foundation Backbone Plan |
|
|
| The next modeling plan tracks several foundation-model branches instead of |
| assuming one backbone solves every Xperience-10M objective. |
|
|
| | Branch | Current role | When to use it | |
| | --- | --- | --- | |
| | Qwen3-Omni | First trainable multimodal LoRA pilot | Use for the selected 128-episode held-out baseline over video/audio/language plus sensor-bridge features. | |
| | Cosmos 3 | First world-model/action-generation branch | Use now for future-window compatibility analysis and the verified Cosmos3-Super forward-dynamics LoRA branch; compare its loss metrics separately from Qwen JSON-task accuracy. | |
| | GR00T | Humanoid/action-policy branch | Use after mocap/contact retargeting creates well-defined humanoid action targets. | |
| | OpenVLA / openpi | Open VLA/policy baselines | Use after the project defines robot-compatible or action-token targets. | |
| | Gemini Robotics | External reasoning reference | Use only for qualitative comparison or annotation support unless local trainable access exists. | |
| | Xperience Embodied Foundation Model | Future Xperience-native pretraining goal | Use only after multi-episode pilots, full-corpus storage, distributed training infrastructure, and scaling evidence justify a from-scratch domain model. | |
|
|
| See [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md) and |
| [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json) |
| for the full selection matrix, source links, and model-specific evaluation |
| additions. See |
| [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) |
| for the long-term full-corpus pretraining plan. |
|
|
| Backbone-specific contracts now live in [`configs/omni_backbones`](configs/omni_backbones). |
| The extension contract is documented in |
| [`OMNI_MODEL_EXTENSION_CONTRACT.md`](OMNI_MODEL_EXTENSION_CONTRACT.md), and the |
| registry can be checked with: |
|
|
| ```bash |
| python scripts/omni/backbone_registry.py --validate --json |
| ``` |
|
|
| Verify that every configured backbone can pass the public-safe packaging |
| contract on synthetic derived artifacts: |
|
|
| ```bash |
| python scripts/omni/smoke_test_backbone_packaging.py |
| ``` |
|
|
| After a real held-out package is created, audit it before updating README, |
| website, or Hugging Face pages: |
|
|
| ```bash |
| python scripts/omni/audit_verified_omni_package.py \ |
| --package-dir results/omni_finetune/verified_public/<eval_run_id> |
| ``` |
|
|
| Create a new planned backbone branch from an existing contract template with: |
|
|
| ```bash |
| python scripts/omni/scaffold_omni_backbone.py \ |
| --template-backbone policy_vla_branch \ |
| --id new_policy_branch \ |
| --display-name "New Policy Branch" \ |
| --model-family "Model family name" \ |
| --dataset-contract xperience10m_observation_action_v1 \ |
| --training-objective observation_to_action_policy \ |
| --checkpoint-gate policy_checkpoint_action_space_and_normalizer \ |
| --dry-run |
| ``` |
|
|
| Each backbone config declares the checkpoint gate, required train/eval files, |
| allowed public artifacts, and forbidden private or heavyweight artifacts. This |
| keeps Qwen3-Omni, Cosmos-style world models, and policy/VLA branches on the same |
| split, validation, and publication discipline even though their training targets |
| are different. |
|
|
| ## Additional Development Directions |
|
|
| Beyond backbone selection and fine-tuning, Xperience-10M supports several |
| concrete research-development tracks: |
|
|
| | Direction | First useful artifact | Role in the project | |
| | --- | --- | --- | |
| | Episode taxonomy and data engine | Episode atlas, balance report, and split builder | Select representative data before training. | |
| | Standardized benchmark protocol | Versioned train/val/test manifests and metric scripts | Make future model results comparable. | |
| | Multimodal representation learning | Contrastive and masked-window encoder objectives | Learn reusable video/audio/depth/pose/mocap/IMU/language features. | |
| | Skill and procedure graph mining | Step graph, transitions, preconditions, and effects | Connect perception to planning and long-horizon reasoning. | |
| | Human-object affordance modeling | Contact, reachable-object, tool-use, and next-affordance tasks | Model what actions the scene makes possible. | |
| | 3D/4D scene and object memory | Persistent scene/object maps from depth, pose, multiview video, and objects | Track world state beyond single frames. | |
| | Data-quality and synchronization diagnostics | Per-episode QA for drift, missing streams, calibration, and corrupted files | Keep large multimodal training trustworthy. | |
| | Policy, retargeting, and simulation transfer | Action-token conversion and robot-compatible imitation examples | Bridge human egocentric experience to robot policy work. | |
|
|
| See [`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`](ADDITIONAL_DEVELOPMENT_DIRECTIONS.md) |
| and [`docs/data/additional_development_directions.json`](docs/data/additional_development_directions.json). |
|
|
| ## Four Research Directions |
|
|
| The original task contracts are organized against the four Ropedia research directions in |
| a generated artifact, not only in prose: |
|
|
| - [`research_direction_taxonomy.json`](results/episode_task_suite/research_directions/research_direction_taxonomy.json) |
| - [`research_direction_task_map.csv`](results/episode_task_suite/research_directions/research_direction_task_map.csv) |
| - [`research_direction_summary.md`](results/episode_task_suite/research_directions/research_direction_summary.md) |
| - [`docs/data/research_directions.json`](docs/data/research_directions.json) |
|
|
| The taxonomy uses two current baselines for every task: |
|
|
| | Baseline | Role | |
| | --- | --- | |
| | Minimal interpretable heads | Softmax, logistic, ridge, and retrieval heads over the 8,546-dimensional multimodal representation. These expose the input/output contract cleanly. | |
| | Neural MLP heads | Small PyTorch MLP classifiers/regressors on the same features and splits. These check whether nonlinear heads help before moving to Qwen/Omni fine-tuning. | |
|
|
| Current direction-level coverage: |
|
|
| | Direction | Current status | Covered task evidence | What is not solved yet | |
| | --- | --- | --- | --- | |
| | A. Human Modeling & Motion Understanding | Partially implemented | Hand Trajectory Forecasting and Contact State Prediction are direct; Action Recognition and Object Relevance Prediction are proxies. Neural MLP improves hand forecasting from `0.8647` to `0.1079` MPJPE. | No full body/shape model, SMPL/MANO target, deformation prior, or multi-episode motion-generation evaluation yet. | |
| | B. 3D/4D Reconstruction & Neural Rendering | Proxy tasks only | Cross-Modal Retrieval, Cross-Modal Reconstruction, and Multimodal Synchronization Detection test alignment/reconstruction prerequisites. | No NeRF, Gaussian Splatting, TSDF, mesh, novel-view synthesis, or calibrated 4D reconstruction model yet. | |
| | C. Egocentric Vision & Interaction | Strongest implemented track | 6 direct tasks: action, subtask, transition, next-action, object relevance, and caption grounding, plus alignment/order diagnostics and audio ablation. | Single-episode chronological split limits generalization; stronger audio and video-language backbones still need multi-episode testing. | |
| | D. Scene Reconstruction & World Modeling | Early proxy tasks | Procedure Step Recognition, Next-Action Prediction, Object Relevance Prediction, Cross-Modal Retrieval, Cross-Modal Reconstruction, Temporal Order Verification, and Multimodal Synchronization Detection provide state/world-model probes. | No persistent scene graph, object permanence task, long-term map, or held-out-episode world model yet. | |
|
|
| The important interpretation is that all four directions can be **started** from |
| the Xperience-10M sample modalities, but only direction C is strongly represented |
| by the original task suite. Directions A, B, and D need additional targets and |
| multi-episode training before they become full research deliverables. |
|
|
| ## Four Direction-Extension Probes |
|
|
| Beyond the original task contracts, the repo now includes one extra data-backed |
| probe for each research direction. These probes are computed from the same |
| `shared_windows.npz`, `windows.csv`, and `feature_manifest.json` artifacts, so |
| the reported numbers are computed from sample-derived features and saved metric artifacts. |
|
|
| - [`research_direction_extension_results.json`](results/episode_task_suite/research_direction_extensions/research_direction_extension_results.json) |
| - [`research_direction_extension_summary.md`](results/episode_task_suite/research_direction_extensions/research_direction_extension_summary.md) |
| - [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) |
| - [`research_direction_extension_tasks.svg`](docs/assets/charts/research_direction_extension_tasks.svg) |
|
|
|  |
|
|
| | Direction | New extension task | Input | Output | Minimal | Neural MLP | Why it matters | |
| | --- | --- | --- | --- | ---: | ---: | --- | |
| | A. Human Modeling & Motion Understanding | Body and Hand Motion Intensity | non-mocap video/depth/pose/IMU/SLAM/language features | high vs low body/hand motion | `0.7827` macro-F1 | `0.7986` macro-F1 | Starts a human-motion-energy target without leaking mocap input. | |
| | B. 3D/4D Reconstruction & Neural Rendering | Multi-View Consistency Retrieval | fisheye camera feature query | synchronized stereo-left view rank | `0.5534` MRR | `0.3469` MRR | Tests whether multi-view features preserve synchronized 4D scene identity. | |
| | C. Egocentric Vision & Interaction | Action Phase Progress Estimation | non-caption multimodal window | progress inside current action segment | `0.3416` MAE | `0.3038` MAE | Adds a task-structure/intent-style target beyond class labels. | |
| | D. Scene Reconstruction & World Modeling | Short-Horizon Ego-Motion Forecasting | current sensors excluding camera translation and captions | future camera-translation delta | `0.1989` MAE | `0.0989` MAE | Starts a short-horizon world-model target over wearer motion. | |
|
|
| Run: |
|
|
| ```bash |
| python scripts/research_direction_extension_tasks.py |
| ``` |
|
|
| These four probes make the four-direction mapping more concrete, but they are |
| still single-episode extension baselines. Full research conclusions still require |
| multi-episode training, held-out episode evaluation, and stronger task-specific |
| models. |
|
|
| ## Unified 20-Task Suite |
|
|
| The sample task surface is now presented as 20 tasks in one suite. Tasks 1-12 |
| are the original public-sample contracts; tasks 13-20 add long-horizon |
| forecasting, interaction text, action-object binding, object-set forecasting, |
| IMU-to-hand reconstruction, camera synchronization, and transition timing while |
| keeping the same 20-frame window unit, 5-frame stride, chronological split, and |
| minimal/neural comparison style. |
|
|
| The historical `tier2_task_suite` file and directory names remain only for |
| stable artifact links. They should be read as the result bundle for tasks |
| 13-20, not as a separate benchmark tier. |
|
|
| - [`TASK_SUITE_20.md`](TASK_SUITE_20.md) |
| - [`docs/data/task_suite_20.json`](docs/data/task_suite_20.json) |
| - [`docs/data/unified_task_model_radar.json`](docs/data/unified_task_model_radar.json) |
| - [`docs/data/single_episode_task_model_radar.json`](docs/data/single_episode_task_model_radar.json) |
| - [`docs/data/episode128_task_model_radar.json`](docs/data/episode128_task_model_radar.json) |
| - [`docs/data/task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json) |
| - [`docs/data/task_method_20_gap_audit.json`](docs/data/task_method_20_gap_audit.json) |
| - [`TASK_METHOD_20_GAP_AUDIT.md`](TASK_METHOD_20_GAP_AUDIT.md) |
| - [`TIER2_TASK_BASELINES.md`](results/episode_task_suite/tier2_task_suite/TIER2_TASK_BASELINES.md) |
| - [`tier2_task_suite_results.json`](results/episode_task_suite/tier2_task_suite/tier2_task_suite_results.json) |
| - [`docs/data/tier2_task_suite.json`](docs/data/tier2_task_suite.json) |
| - [`unified_task_model_radar.svg`](docs/assets/charts/unified_task_model_radar.svg) |
| - [`single_episode_task_model_radar.svg`](docs/assets/charts/single_episode_task_model_radar.svg) |
| - [`episode128_task_model_radar.svg`](docs/assets/charts/episode128_task_model_radar.svg) |
| - [`tier2_task_suite.svg`](docs/assets/charts/tier2_task_suite.svg) |
|
|
|  |
|
|
|  |
|
|
|  |
|
|
|  |
|
|
| | # | Task | Input | Output | Minimal | Neural MLP | Meaning | |
| | ---: | --- | --- | --- | ---: | ---: | --- | |
| | 13 | Long-Horizon Next-Action Forecasting | current non-caption multimodal window | action label five seconds later | `0.0750` macro-F1 | `0.0655` macro-F1 | Tests procedure context beyond the one-second next-action task. | |
| | 14 | Long-Horizon Next-Subtask Forecasting | current non-caption multimodal window | subtask five seconds later | `0.0455` macro-F1 | `0.0507` macro-F1 | Moves anticipation from low-level action to high-level procedure state. | |
| | 15 | Interaction Text Prediction | current sensor window without caption text | raw interaction phrase | `0.0444` macro-F1 | `0.0381` macro-F1 | Uses the original annotation interaction text instead of only hashed features. | |
| | 16 | Action-Object Relation Prediction | current sensor window without caption text | joint action plus object-set label | `0.0000` macro-F1 | `0.0000` macro-F1 | Exposes a hard binding target for action-object reasoning. | |
| | 17 | Future Object-Set Forecasting | current sensor window without caption text | object set five seconds later | `0.1694` micro-F1 | `0.1972` micro-F1 | Predicts which objects become relevant soon. | |
| | 18 | IMU-to-Hand Pose Reconstruction | IMU feature block only | current left/right hand joints | `0.0420` MAE | `0.0426` MAE | Tests inertial-to-hand sensor bridging. | |
| | 19 | Camera-View Synchronization Retrieval | fisheye camera-1 query | synchronized fisheye camera-3 window | `0.4943` MRR | `0.2409` MRR | Stress-tests multi-camera temporal alignment. | |
| | 20 | Time-to-Next-Transition Regression | current non-caption multimodal window | capped frames until next action boundary | `10.5374` MAE frames | `10.5545` MAE frames | Converts boundary detection into continuous timing. | |
|
|
| Run: |
|
|
| ```bash |
| /path/to/python-with-h5py scripts/tier2_task_suite.py |
| ``` |
|
|
| Regeneration needs either `HOMIE-toolkit` or an environment with `h5py` because |
| the interaction/object targets come from the raw public-sample |
| `annotation.hdf5`. The raw HDF5 and MP4 files remain excluded from the public |
| repo and Hugging Face mirrors. |
|
|
| ## Task Walkthroughs For Juniors |
|
|
| Every task now has a beginner-facing explanation with: |
|
|
| - a concrete coffee-episode case study, |
| - exact input contract, |
| - middle process modules, |
| - output contract, |
| - minimal and neural metric, |
| - one important limitation. |
|
|
| Primary files: |
|
|
| - [`TASK_WALKTHROUGHS.md`](results/episode_task_suite/task_walkthroughs/TASK_WALKTHROUGHS.md) |
| - [`task_walkthroughs.json`](results/episode_task_suite/task_walkthroughs/task_walkthroughs.json) |
| - [`docs/data/task_walkthroughs.json`](docs/data/task_walkthroughs.json) |
| - [`docs/data/task_surface_integrity.json`](docs/data/task_surface_integrity.json) |
|
|
| Compact map: |
|
|
| | Task | Case study | Input -> process -> output | |
| | --- | --- | --- | |
| | Action Recognition | A pouring window should be named as the current action. | all-modality window -> action label builder + classifier -> action class | |
| | Procedure Step Recognition | A fine action is grouped into a broader drink-preparation stage. | all-modality window -> subtask label builder + classifier -> subtask label | |
| | Action Boundary Detection | Detect the change from preparing to pouring. | window -> boundary builder + binary classifier -> boundary/steady | |
| | Next-Action Prediction | A preparing window predicts what happens 20 frames later. | current window -> future-label shift + classifier -> next action | |
| | Hand Trajectory Forecasting | A hand moving toward a cup becomes a future 3D hand path. | current window -> future mocap target + regressor -> hand trajectory | |
| | Contact State Prediction | Decide whether hand/body contact is happening. | non-contact features -> contact target + binary classifier -> contact label | |
| | Object Relevance Prediction | Infer milk, cup, coffee, or related objects during pouring. | non-caption features -> multi-hot object target + sigmoid heads -> object set | |
| | Language Grounding | Query Pour milk into coffee and retrieve the matching moment. | text-like query + candidates -> projection + cosine ranker -> ranked windows | |
| | Cross-Modal Retrieval | Motion/IMU from pouring retrieves matching depth/video. | motion/IMU/camera -> projection + candidate index -> ranked depth/video windows | |
| | Cross-Modal Reconstruction | Infer depth/video features from motion, IMU, and camera pose. | source modalities -> scaler + regressor -> target modality vector | |
| | Temporal Order Verification | Tell whether reaching then pouring was reversed. | adjacent window pair -> pair combiner + binary classifier -> correct/reversed | |
| | Multimodal Synchronization Detection | Catch motion paired with visual/depth features shifted in time. | motion side + visual side -> aligned/shifted pair builder + classifier -> aligned/shifted | |
|
|
| ## Minimal 12-Task Architectures |
|
|
| These are deliberately minimal baselines. They are useful because every |
| input/output contract is explicit, not because they are strong embodied-AI |
| models. |
|
|
| Shared setup: |
|
|
| ```text |
| raw episode -> 20-frame windows, stride 5 -> 8,546-dimensional multimodal representation |
| chronological split: first 70% train, last 30% test |
| scalers are fit on train windows only |
| ``` |
|
|
| There are four reusable head families: |
|
|
| | Head family | Used by | What it means | |
| | --- | --- | --- | |
| | Linear softmax classifier | Action Recognition, Procedure Step Recognition, Action Boundary Detection, Next-Action Prediction, Contact State Prediction, Temporal Order Verification, Multimodal Synchronization Detection | z-score features, then `XW+b`, softmax, cross-entropy, L2 | |
| | Dual ridge regression/projection | Hand Trajectory Forecasting, Cross-Modal Reconstruction | z-score input/target, solve ridge regression with L2=10 | |
| | Ridge + cosine ranking | Language Grounding, Cross-Modal Retrieval | project one modality into another feature space, then rank candidates by cosine | |
| | Multi-label logistic regression | Object Relevance Prediction | z-score non-caption features, sigmoid object heads, threshold at 0.5 | |
|
|
| The optional neural run keeps the same window representation, leakage filters, |
| chronological splits, and metrics, but replaces the task heads with small |
| PyTorch MLP classifiers or regressors. Its outputs live under |
| [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), |
| and the rollup is stored in the `neural_tasks` section of |
| [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json). |
|
|
| The task-specific heads are: |
|
|
| | Task | Input | Minimal head | Output | |
| | --- | --- | --- | --- | |
| | Action Recognition | all featurized modalities | linear softmax | current action class | |
| | Procedure Step Recognition | all featurized modalities | linear softmax | current subtask class | |
| | Action Boundary Detection | all featurized modalities | linear softmax | steady vs action boundary | |
| | Next-Action Prediction | all featurized modalities at `t` | linear softmax | action at `t+20` frames | |
| | Hand Trajectory Forecasting | all featurized modalities at `t` | ridge regression | future 10-frame left/right hand joints | |
| | Contact State Prediction | non-contact and non-caption signals | linear softmax | any body contact | |
| | Object Relevance Prediction | non-caption signals | multi-label logistic | relevant object set | |
| | Language Grounding | sensor windows projected to text space | ridge projection + cosine ranking | matching time window for text query | |
| | Cross-Modal Retrieval | motion/IMU/camera projected to visual space | ridge projection + cosine ranking | matching depth/video window | |
| | Cross-Modal Reconstruction | motion/IMU/camera | ridge regression | compressed depth/video target | |
| | Temporal Order Verification | `[x_t, x_t+1, x_t+1-x_t]` | binary linear softmax | correct vs reversed order | |
| | Multimodal Synchronization Detection | motion plus visual pair | binary linear softmax | aligned vs shifted by 8 windows | |
|
|
| ## Key Results |
|
|
| | Experiment | Main score | Accuracy | Notes | |
| | --- | ---: | ---: | --- | |
| | Motion-only action | 0.9688 macro-F1 | 0.9828 | Uses motion/IMU features only | |
| | Current all-feature action | 0.9829 macro-F1 | 0.9863 | 8,546-dimensional multimodal representation | |
| | Motion-only subtask | 0.9528 macro-F1 | 0.9759 | Strong within-episode subtask signal | |
| | Current all-feature subtask | 0.9173 macro-F1 | 0.9828 | High accuracy, lower class-balanced score | |
| | Cross-modal retrieval | 0.3678 top-5 | n/a | Motion/IMU/camera/audio retrieves matching depth/video | |
| | Transition detection | 0.6118 macro-F1 | 0.9080 | Boundary F1 is 0.1250 | |
| | Hand trajectory forecast | 0.8647 MPJPE | n/a | Predicts future hand-joint trajectory | |
| | Neural MLP hand forecast | 0.1079 MPJPE | n/a | Same features/split, nonlinear regression head | |
| | Neural MLP temporal order | 0.8520 F1 | 0.8578 | Strong improvement on adjacent-window ordering | |
| | Neural MLP misalignment | 0.7153 F1 | 0.7009 | Detects shifted motion/visual/audio pairs better than the linear head | |
| | Audio ablation | +0.0418 mean delta | n/a | Current audio variant improves the primary metric on 6 of 12 task contracts | |
| | Alternate audio representation | +0.0936 mean delta | n/a | Alternate audio-window representation improves over the baseline audio variant on 6 of 12 task contracts | |
|
|
| ## Audio Contribution Study |
|
|
| The audio ablation keeps the same windows and task labels, then compares input |
| variants under the same chronological split. The script |
| [`scripts/audio_ablation_and_raw_upgrade.py`](scripts/audio_ablation_and_raw_upgrade.py) |
| reuses the real task-suite windows and evaluates six variants for |
| every task: current inputs, no audio, audio-only, alternate audio-only, audio |
| representation replacement, and all inputs plus the alternate audio representation. |
|
|
| The measured single-episode result is task-specific: |
|
|
| | Readout | Value | |
| | --- | ---: | |
| | Tasks where current audio improves the primary metric | 6 / 12 | |
| | Mean current-audio delta | +0.0418 | |
| | Tasks where alternate audio representation improves over baseline audio | 6 / 12 | |
| | Mean alternate-representation delta vs baseline audio | +0.0936 | |
|
|
| Full files: |
|
|
| - [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md) |
| - [`results/audio_ablation/audio_ablation_metrics.csv`](results/audio_ablation/audio_ablation_metrics.csv) |
| - [`results/audio_ablation/audio_delta_summary.csv`](results/audio_ablation/audio_delta_summary.csv) |
| - [`docs/data/audio_ablation_summary.json`](docs/data/audio_ablation_summary.json) |
| - [`docs/assets/charts/audio_ablation_delta.svg`](docs/assets/charts/audio_ablation_delta.svg) |
|
|
| ## Neural MLP Results |
|
|
| The neural baseline was run locally with `--include-neural` for all 12 tasks |
| using 80 epochs, hidden size 128, batch size 128, and CPU execution. It is not a |
| foundation model result; it is a controlled nonlinear-head comparison over the |
| same 8,546-dimensional multimodal representation. |
|
|
| | Task | Neural metric | Minimal metric | Readout | |
| | --- | ---: | ---: | --- | |
| | Action Recognition | 0.0148 macro-F1 | 0.0500 macro-F1 | Still blocked by unseen future classes | |
| | Procedure Step Recognition | 0.0281 macro-F1 | 0.0506 macro-F1 | Same single-episode split limitation | |
| | Action Boundary Detection | 0.5862 macro-F1 | 0.6118 macro-F1 | Similar to the linear baseline | |
| | Next-Action Prediction | 0.0419 macro-F1 | 0.0593 macro-F1 | Same unseen-label issue | |
| | Hand Trajectory Forecasting | 0.1079 MPJPE | 0.8647 MPJPE | Neural regression improves this target | |
| | Contact State Prediction | 1.0000 macro-F1 | 1.0000 macro-F1 | Degenerate one-class sample | |
| | Object Relevance Prediction | 0.1679 micro-F1 | 0.1803 micro-F1 | Similar weak object signal | |
| | Language Grounding | 0.0168 MRR | 0.0160 MRR | Similar ranking behavior | |
| | Cross-Modal Retrieval | 0.1300 MRR | 0.2693 MRR | Linear ridge remains stronger here | |
| | Cross-Modal Reconstruction | -0.0102 R2 | -0.0153 R2 | Small improvement but still weak | |
| | Temporal Order Verification | 0.8520 F1 | 0.5400 F1 | Neural head captures local temporal structure | |
| | Multimodal Synchronization Detection | 0.7153 F1 | 0.5052 F1 | Neural head improves alignment detection | |
|
|
| The strongest single-episode self-supervised signal is cross-modal retrieval: |
| motion/IMU/camera/audio features retrieve matching depth/video windows substantially |
| better than random. |
|
|
| ## Single-Episode Diagnostics and Explorer |
|
|
| While waiting for broader Xperience-10M access, the repo now includes an |
| artifact-driven diagnostics pass over the public sample episode: |
|
|
| - `results/single_episode_diagnostics/object_labels/window_object_labels.csv` |
| exports 1,161 real window-level object-label sets from `annotation.hdf5`. |
| - `results/single_episode_diagnostics/modality_ablation/ablation_metrics.csv` |
| recomputes all 96 task/modality cells, including object relevance. |
| - `results/single_episode_diagnostics/timeline_overlay/timeline_overlay.csv` |
| aligns 2,079 existing prediction rows back to the episode timeline. |
| - `results/single_episode_diagnostics/alignment_stress/alignment_shift_metrics.csv` |
| evaluates cross-modal retrieval under explicit time shifts. |
| - `docs/single_episode_explorer.html` is a static interactive page for |
| inspecting window labels, objects, predictions, modality statistics, and |
| diagnostic scores. |
|
|
| These are single-episode research diagnostics. They are useful for studying |
| task definitions, feature behavior, and model errors before scaling to more |
| episodes; they are not reported as multi-episode benchmark results. |
|
|
| ## Reproducibility Check |
|
|
| I re-ran the full pipeline from the local raw public sample into a temporary |
| local workspace and compared regenerated metrics with the committed |
| artifacts. The baseline metrics, 12 task metrics, feature manifest, and |
| available modality manifest matched exactly after float normalization. |
|
|
| See [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) for the |
| commands and verification evidence. |
|
|
| ## Why Some Scores Are Low |
|
|
| The task suite intentionally uses a chronological split: |
|
|
| ```text |
| first 70% of the episode -> train |
| last 30% of the episode -> test |
| ``` |
|
|
| The test segment contains some action/subtask labels never seen during training. |
| Timeline and next-action classifiers therefore expose the core limitation of |
| single-episode learning instead of hiding it behind random splits. |
|
|
| ## Modalities Used |
|
|
| The current public-sample pipeline uses: |
|
|
| - hand/body mocap joints and contact labels, |
| - camera translation and rotation, |
| - IMU acceleration and gyroscope traces, |
| - depth confidence features, |
| - six video streams, |
| - audio from the sample MP4 stream, |
| - caption/object/interaction text features, |
| - SLAM point-cloud summary features, |
| - calibration parameters. |
|
|
| The full technical source manifest is stored in |
| [`results/episode_task_suite/feature_manifest.json`](results/episode_task_suite/feature_manifest.json). |
|
|
| ## Data Notice |
|
|
| Xperience-10M data belongs to its original authors and is subject to the |
| official Ropedia dataset license and access terms. This repo contains code and |
| derived experiment artifacts only; it does not redistribute the raw videos or |
| raw annotation dataset. |
|
|