| # Research Roadmap |
|
|
| This roadmap connects the current public-sample task lab to the next |
| multi-episode Xperience-10M experiments. Each stage lists the entry condition, |
| the deliverables, and the evidence that should exist before the stage is treated |
| as complete. |
|
|
| ## Roadmap Summary |
|
|
| | Stage | Status | Entry condition | Research deliverables | Completion evidence | |
| | --- | --- | --- | --- | --- | |
| | Public-Sample Task Lab | Implemented | One public Xperience-10M sample episode is available. | 1,161 aligned windows, 12 task contracts, minimal heads, neural MLP heads, modality atlas, task walkthroughs, and derived figures. | `PROJECT_STATUS.md`, `EVALUATION_PROTOCOL.md`, `RESEARCH_TAKEAWAYS.md`, `docs/data/summary_metrics.json`, `results/episode_task_suite/summary_report.json` | |
| | Multi-Episode Data Staging | Active | Gated dataset access and enough storage for selected episodes. | 32 valid episodes, episode manifest, missing-view manifest, held-out episode split, and source-discovery report. | `results/omni_finetune/DATA_ACCESS_STATUS.md`, `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md`, `results/omni_finetune/source_discovery.json` | |
| | 32-Episode Qwen3-Omni LoRA Pilot | Next | At least 32 valid episodes staged locally with no train/test episode leakage. | Dataset JSONL/media manifests, LoRA adapter checkpoint, progress logs, held-out predictions, metrics, confusion matrices, and run report. | `dataset_manifest.json`, `training_metadata.json`, `progress.jsonl`, `metrics.json`, `predictions.jsonl`, `RUN_REPORT.md` | |
| | 64-128 Episode Robustness Run | Planned | The 32-episode pilot trains and evaluates cleanly. | Split-by-session metrics, modality ablations, calibration/object/language error analysis, and sensitivity to missing views. | Held-out metrics by session, task, and modality; ablation tables; qualitative error analysis. | |
| | Foundation and World-Model Extensions | Planned | Enough multi-episode data and compute budget for larger multimodal objectives. | Audio encoder integration, depth/image reconstruction, SLAM/world modeling probes, policy-style next-action tasks, and affordance/object interaction tasks. | Task-specific held-out evaluations, qualitative inspection, and updated model cards. | |
|
|
| ## Current Decision Point |
|
|
| The useful next decision is data scale: keep the public-sample task suite as the |
| development harness, then stage enough official Xperience-10M episodes to run |
| the 32-episode held-out pilot. The public sample is already enough for task |
| design, feature contracts, walkthroughs, and baseline comparisons. It is not |
| enough to measure general embodied-AI model quality. |
|
|
| ## Stage Details |
|
|
| ### 1. Public-Sample Task Lab |
|
|
| This stage turns one synchronized egocentric episode into a clean research |
| surface. It defines what one model input is, what each task predicts, how the |
| split is constructed, and how minimal and neural heads are compared. |
|
|
| Evidence to inspect: |
|
|
| - `results/episode_task_suite/windows.csv` |
| - `results/episode_task_suite/feature_manifest.json` |
| - `results/episode_task_suite/summary_report.json` |
| - `results/episode_task_suite/neural_mlp/` |
| - `docs/data/task_walkthroughs.json` |
|
|
| ### 2. Multi-Episode Data Staging |
|
|
| This stage expands the same data contract to official gated episodes. The key |
| research requirement is episode-level separation: training and test examples |
| must come from different episodes, not different windows inside the same |
| episode. |
|
|
| Evidence to inspect: |
|
|
| - `results/omni_finetune/DATA_ACCESS_STATUS.md` |
| - `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md` |
| - `scripts/omni/discover_xperience10m_sources.py` |
| - `results/omni_finetune/source_discovery.json` |
|
|
| ### 3. 32-Episode Qwen3-Omni LoRA Pilot |
|
|
| This stage uses Qwen3-Omni as the multimodal backbone and trains lightweight |
| LoRA adapters. The first target is a complete held-out-episode training and |
| evaluation loop with inspectable manifests, predictions, and metrics. |
|
|
| Expected outputs: |
|
|
| - `dataset_manifest.json` |
| - `episode_manifest.json` |
| - `training_metadata.json` |
| - `progress.jsonl` |
| - `metrics.json` |
| - `predictions.jsonl` |
| - `predictions.csv` |
| - `confusion_matrix.csv` |
| - `RUN_REPORT.md` |
|
|
| ### 4. 64-128 Episode Robustness Run |
|
|
| This stage asks whether the 32-episode conclusions survive more sessions, |
| different objects, missing views, and stronger modality ablations. It should |
| report performance by task, session, modality, and failure type. |
|
|
| ### 5. Foundation and World-Model Extensions |
|
|
| This stage moves beyond lightweight heads and LoRA pilots into richer multimodal |
| objectives: audio-visible alignment, depth/image reconstruction, dynamic scene |
| state, SLAM/world modeling, policy-style next action, contact, object relevance, |
| and affordance reasoning. |
|
|
| ## Public Artifacts That Should Move Together |
|
|
| When a roadmap stage advances, update these public surfaces together: |
|
|
| - `README.md` |
| - `PROJECT_STATUS.md` |
| - `RESEARCH_TAKEAWAYS.md` |
| - `EVALUATION_PROTOCOL.md` |
| - `ARTIFACT_GUIDE.md` |
| - `docs/index.html` |
| - `docs/data/research_roadmap.json` |
| - Hugging Face Space, artifact dataset, and model cards |
|
|