| # Research Roadmap |
|
|
| This roadmap connects the current public-sample task lab to the next |
| multi-episode Xperience-10M experiments and the later foundation-model branches. |
| Each stage lists the entry condition, the deliverables, and the evidence that |
| should exist before the stage is treated as complete. |
|
|
| ## Roadmap Summary |
|
|
| | Stage | Status | Entry condition | Research deliverables | Completion evidence | |
| | --- | --- | --- | --- | --- | |
| | Public-Sample Task Lab | Implemented | One public Xperience-10M sample episode is available. | 1,161 aligned windows, 12 task contracts, minimal heads, neural MLP heads, modality atlas, task walkthroughs, and derived figures. | `PROJECT_STATUS.md`, `EVALUATION_PROTOCOL.md`, `RESEARCH_TAKEAWAYS.md`, `docs/data/summary_metrics.json`, `results/episode_task_suite/summary_report.json` | |
| | Multi-Episode Data Preparation | Implemented for first selected pilot | Gated dataset availability and enough storage for selected episodes. | 128 selected episodes, episode manifest, missing-view manifest, held-out episode split, and source-discovery report. | `results/omni_finetune/DATA_ACCESS_STATUS.md`, `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md`, `results/omni_finetune/xperience10m_128_episode_selection.json` | |
| | Qwen3-Omni LoRA Latest Diagnostic Branch | Verified latest branch | Selected episodes prepared locally with no train/test episode leakage. | Dataset JSONL/media manifests, LoRA adapter checkpoint, progress logs, validation monitoring, held-out predictions, metrics, confusion matrices, v5/v6 comparison, run report, and public LoRA adapter repo. | `docs/data/omni_finetune_verified_result.json`, `docs/data/qwen3_v5_v6_comparison.json`, `results/omni_finetune/QWEN3_V5_V6_COMPARISON_20260614.md`, `results/omni_finetune/verified_public/`, `metrics.json`, `predictions.jsonl`, `RUN_REPORT.md`, `https://huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep` | |
| | 128-Episode Same-Split Simple/NN Baselines | Verified companion result | Derived Qwen JSONL export for the selected 96/16/16 split. | Same 12 task ids, simple metadata/text baselines, neural MLP baselines where JSON labels support them, and explicit unsupported markers for tasks that still require raw 128 feature blocks. | `results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md`, `summary_report.json`, `scripts/omni/run_128_task_baselines.py` | |
| | 128-Episode Task Suite Enhancement Pack | Current no-new-episode plan | Same selected 96/16/16 split and current public 3,808-window export. | Dense-window and multiscale export estimates, hierarchical action/subtask target contract, raw-feature shard priorities for unsupported tasks, Qwen v5 and Cosmos continuation run cards, and publication-ready artifacts. | `TASK_SUITE_ENHANCEMENT_128.md`, `docs/data/task_suite_enhancement_128.json`, `results/omni_finetune/task_suite_enhancement_128_v1_20260608/enhancement_plan.json`, `scripts/omni/build_task_suite_enhancement_128.py` | |
| | Action/Subtask Error-Analysis Pass | Active next step | The final diagnostic package meets strict JSON validity but has weak action/subtask held-out quality. | Same 96/16/16 split, action/subtask confusion analysis, unseen-label analysis, object/action family breakdowns, and comparison to the final verified Qwen baseline. | Updated error-analysis tables, held-out metrics by failure type, and verified public package. | |
| | Foundation-Model Selection Matrix | Current | The selected pilot episodes are prepared, or a 3-8 episode dry run is available for preprocessing checks. | Backbone registry, Cosmos 3 world-model branch plan, Qwen3-Omni baseline plan, OpenVLA/openpi/GR00T policy candidates, and model-specific evaluation additions. | `FOUNDATION_MODEL_PLAN.md`, `docs/data/foundation_model_plan.json`, `research_roadmap_interactive.json` | |
| | 64-128 Episode Robustness Run | Planned | The final selected-episode Qwen diagnostic run trains and evaluates cleanly. | Split-by-session metrics, modality ablations, calibration/object/language error analysis, and sensitivity to missing views. | Held-out metrics by session, task, and modality; ablation tables; qualitative error analysis. | |
| | Cosmos 3 and Policy-Model Extensions | Planned | Enough multi-episode data, compute budget, and model-specific action/world-state targets. | Cosmos 3 future-window or action-conditioned world-model probes, OpenVLA/openpi/GR00T action-policy baselines, modality-conditioning checks, affordance tasks, and synthetic-data usefulness tests. | Task-specific held-out evaluations, qualitative inspection, and updated model cards. | |
| | Xperience Embodied Foundation Model Pretraining | Future | Full-corpus access, PB-scale storage path, multi-node compute, and positive scaling evidence from smaller runs. | Xperience-native temporal multimodal model, full-corpus manifests, pretraining shards, scaling curves, held-out evaluations, and model card. | Pretraining metadata, checkpoint inventory, held-out metrics, scaling report, and data-boundary report. | |
|
|
| ## Current Decision Point |
|
|
| The useful next decision is model-quality improvement plus backbone fit without |
| requiring more raw episodes first: keep the public-sample task suite as the |
| development harness, use the verified Qwen3-Omni v6 diagnostic branch plus the |
| pinned v5 row as the current cross-episode references, then improve action/subtask quality before |
| claiming model quality. The earlier simple and neural baseline framing is now |
| aligned to the same 96/16/16 split through metadata/text baselines for |
| JSON-supported task ids; raw-feature-only tasks remain marked as needing the |
| 128-run sensor feature blocks. The current no-new-episode recommendation is to |
| export `multiscale_20s10_40s20_80s40` windows, add hierarchical |
| action/subtask targets, and publish separate verified packages rather than |
| overwriting the existing Qwen, Cosmos, or baseline results. |
| Qwen3-Omni remains the first trainable multimodal LoRA target. Cosmos 3 becomes |
| the first world-model/action-generation branch. OpenVLA, openpi, GR00T, Octo, |
| and SmolVLA-style models become policy/action branches only after the action |
| target is explicit. A from-scratch Xperience Embodied Foundation Model is the |
| long-term native-pretraining goal, not the immediate experiment. The public |
| sample is already enough for task design, feature contracts, walkthroughs, and |
| baseline comparisons. The first multi-episode pilot is enough to verify the |
| end-to-end training loop, but its weak metrics are not final model quality. |
|
|
| The three headline directions should therefore be organized as pipeline tracks: |
| spatial intelligence models, human-video world models, and |
| vision-language-action models. All three are legitimate directions for |
| Xperience-10M, but each needs a different artifact gate. Spatial intelligence |
| needs depth/pose-backed scene-memory targets and held-out spatial metrics; |
| world modeling needs future-state or latent/visual future metrics beyond |
| structured task probes; VLA needs traceable action-token conversion, |
| normalization, and policy-style held-out metrics. The detailed track contract is |
| [`THREE_FOUNDATION_PIPELINES.md`](THREE_FOUNDATION_PIPELINES.md), with the |
| website data copy in |
| [`docs/data/three_foundation_pipelines.json`](docs/data/three_foundation_pipelines.json). |
|
|
| ## Additional Concrete Development Directions |
|
|
| The project can also grow through smaller, high-leverage directions that do not |
| depend on immediately training a larger foundation model: |
|
|
| | Direction | First artifact | Research value | |
| | --- | --- | --- | |
| | Episode taxonomy and data engine | Episode atlas, category tags, balance report, and split builder. | Makes episode selection representative and measurable. | |
| | Standardized benchmark protocol | Fixed splits, task cards, metric scripts, and leakage checks. | Makes future model comparisons fair. | |
| | Multimodal representation learning | Contrastive and masked-window objectives over synchronized modalities. | Learns reusable encoders before expensive large-model training. | |
| | Skill and procedure graph mining | Steps, transitions, preconditions, effects, and temporal skill graphs. | Connects perception to planning and long-horizon reasoning. | |
| | Human-object interaction and affordance modeling | Contact, reachable-object, tool-use, and next-affordance tasks. | Models what the scene makes possible, not only the current label. | |
| | 3D/4D scene and object memory | Persistent scene/object maps from depth, pose, multiview video, and objects. | Supports object permanence and spatial reasoning. | |
| | Data quality and synchronization diagnostics | Per-episode QA for drift, missing streams, calibration, and corrupted files. | Prevents silent failures in large multimodal training. | |
| | Policy, retargeting, and simulation transfer | Action-token conversion and robot-compatible imitation examples. | Bridges human egocentric experience to robot policy work. | |
|
|
| The concise public source is |
| `ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`; the website/Hugging Face data copy is |
| `docs/data/additional_development_directions.json`. |
|
|
| ## No-New-Episode Enhancement Pack |
|
|
| The current 128-episode setup still has headroom before adding more data. The |
| non-overwriting enhancement pack estimates denser and multiscale windows from |
| the observed frame spans, identifies the action/subtask and next-action |
| label-pressure bottleneck, and defines the next export/model contracts. |
|
|
| Evidence to inspect: |
|
|
| - `TASK_SUITE_ENHANCEMENT_128.md` |
| - `docs/data/task_suite_enhancement_128.json` |
| - `results/omni_finetune/task_suite_enhancement_128_v1_20260608/enhancement_plan.json` |
| - `results/omni_finetune/task_suite_enhancement_128_v1_20260608/dense_window_scenarios.csv` |
| - `scripts/omni/build_task_suite_enhancement_128.py` |
|
|
| ## Stage Details |
|
|
| ### 1. Public-Sample Task Lab |
|
|
| This stage turns one synchronized egocentric episode into a clean research |
| surface. It defines what one model input is, what each task predicts, how the |
| split is constructed, and how minimal and neural heads are compared. |
|
|
| Evidence to inspect: |
|
|
| - `results/episode_task_suite/windows.csv` |
| - `results/episode_task_suite/feature_manifest.json` |
| - `results/episode_task_suite/summary_report.json` |
| - `results/episode_task_suite/neural_mlp/` |
| - `docs/data/task_walkthroughs.json` |
|
|
| ### 2. Multi-Episode Data Preparation |
|
|
| This stage expands the same data contract to official gated episodes. The key |
| research requirement is episode-level separation: training and test examples |
| must come from different episodes, not different windows inside the same |
| episode. The first selected 96/16/16 split has been used for a verified |
| Qwen3-Omni diagnostic pilot. |
|
|
| Evidence to inspect: |
|
|
| - `results/omni_finetune/DATA_ACCESS_STATUS.md` |
| - `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md` |
| - `scripts/omni/discover_xperience10m_sources.py` |
| - `results/omni_finetune/source_discovery.json` |
| - `results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md` |
|
|
| ### 3. Qwen3-Omni LoRA Pilot |
|
|
| This stage uses Qwen3-Omni as the multimodal backbone and trains lightweight |
| LoRA adapters. The final held-out diagnostic package now exists. It proves the |
| export, training, evaluation, validation, public-safe packaging, and adapter |
| publication loop. The current v4 four-epoch evaluation reaches 100.00% JSON |
| validity, 97.32% transition accuracy, 72.99% contact accuracy, and 31.10% |
| object micro-F1, but action macro-F1 is 0.0019 and subtask accuracy is 0.0000. |
| Treat it as a baseline and error-analysis starting point, not as a strong |
| action/subtask model. |
|
|
| Expected outputs: |
|
|
| - `dataset_manifest.json` |
| - `episode_manifest.json` |
| - `training_metadata.json` |
| - `progress.jsonl` |
| - `metrics.json` |
| - `predictions.jsonl` |
| - `predictions.csv` |
| - `confusion_matrix.csv` |
| - `RUN_REPORT.md` |
|
|
| ### 4. 64-128 Episode Robustness Run |
|
|
| This stage asks whether the pilot conclusions survive more sessions, |
| different objects, missing views, and stronger modality ablations. It should |
| report performance by task, session, modality, and failure type. |
|
|
| ### 5. Foundation-Model Selection Matrix |
|
|
| This stage records which foundation model is suitable for which Xperience-10M |
| objective. The current decision is: |
|
|
| - Qwen3-Omni first for multimodal instruction, structured JSON prediction, and |
| LoRA over video/audio/language plus sensor-bridge features. |
| - Cosmos 3 next for world modeling, action-conditioned future prediction, and |
| synthetic-data experiments. |
| - OpenVLA, openpi, GR00T, Octo, and SmolVLA-style policies after action-space |
| conversion and retargeting are traceable. |
| - Gemini Robotics only as an external reasoning/reference surface unless local |
| trainable access becomes available. |
|
|
| Evidence to inspect: |
|
|
| - `FOUNDATION_MODEL_PLAN.md` |
| - `docs/data/foundation_model_plan.json` |
| - `docs/data/research_roadmap_interactive.json` |
|
|
| ### 6. Cosmos 3 and Policy-Model Extensions |
|
|
| This stage moves beyond lightweight heads and LoRA pilots into richer multimodal |
| objectives: audio-visible alignment, future-window prediction, |
| action-conditioned world modeling, synthetic-data usefulness tests, policy-style |
| next action, contact, object relevance, and affordance reasoning. |
|
|
| Current Cosmos3-Super status: a camera-pose proxy action target export augments |
| all 3,808 selected 128-episode windows, passes the contract audit, and now has |
| a verified 8-GPU FSDP forward-dynamics LoRA run. The full run trains 26.2M LoRA |
| parameters on 2,848 train rows and evaluates 512 validation plus 448 held-out |
| test rows. It supervises noisy future vision velocity under camera-pose action |
| conditioning, not semantic JSON labels or `preds_action`; supervised |
| action-token prediction still needs a separate policy or inverse-dynamics |
| target export. |
|
|
| ### 7. Xperience Embodied Foundation Model Pretraining |
|
|
| This stage is the long-term full-corpus goal. Instead of adapting an existing |
| backbone, it would pretrain a domain model directly on the synchronized |
| Xperience-10M modality structure: video, audio, depth, pose/SLAM, hand/body |
| mocap, IMU, calibration, and language annotations. |
|
|
| The first realistic target is a 3B-7B Xperience-native domain model after |
| smaller 0.3B-1B and 1B-3B pilots prove that the objectives and data loaders |
| scale. The training objective should combine masked multimodal modeling, |
| cross-modal alignment, future-state prediction, ego-motion and hand-motion |
| forecasting, action/procedure prediction, language grounding, contact and |
| affordance prediction, and optional policy-style targets after action |
| conversion. |
|
|
| This stage needs full-corpus access, PB-scale storage planning, high-throughput |
| media decoding, distributed training, reliable checkpoints, and held-out |
| evaluation across episodes, sessions, activities, objects, and missing |
| modalities. The plan is reader-facing in |
| `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`. |
|
|
| ## Public Artifacts That Should Move Together |
|
|
| When a roadmap stage advances, update these public surfaces together: |
|
|
| - `README.md` |
| - `PROJECT_STATUS.md` |
| - `RESEARCH_TAKEAWAYS.md` |
| - `EVALUATION_PROTOCOL.md` |
| - `ARTIFACT_GUIDE.md` |
| - `ADDITIONAL_DEVELOPMENT_DIRECTIONS.md` |
| - `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md` |
| - `docs/index.html` |
| - `docs/data/additional_development_directions.json` |
| - `docs/data/research_roadmap.json` |
| - Hugging Face Space, artifact dataset, and model cards |
|
|