Research Roadmap
This roadmap connects the current public-sample task lab to the next multi-episode Xperience-10M experiments. Each stage lists the entry condition, the deliverables, and the evidence that should exist before the stage is treated as complete.
Roadmap Summary
| Stage | Status | Entry condition | Research deliverables | Completion evidence |
|---|---|---|---|---|
| Public-Sample Task Lab | Implemented | One public Xperience-10M sample episode is available. | 1,161 aligned windows, 12 task contracts, minimal heads, neural MLP heads, modality atlas, task walkthroughs, and derived figures. | PROJECT_STATUS.md, EVALUATION_PROTOCOL.md, RESEARCH_TAKEAWAYS.md, docs/data/summary_metrics.json, results/episode_task_suite/summary_report.json |
| Multi-Episode Data Staging | Active | Gated dataset access and enough storage for selected episodes. | 32 valid episodes, episode manifest, missing-view manifest, held-out episode split, and source-discovery report. | results/omni_finetune/DATA_ACCESS_STATUS.md, results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md, results/omni_finetune/source_discovery.json |
| 32-Episode Qwen3-Omni LoRA Pilot | Next | At least 32 valid episodes staged locally with no train/test episode leakage. | Dataset JSONL/media manifests, LoRA adapter checkpoint, progress logs, held-out predictions, metrics, confusion matrices, and run report. | dataset_manifest.json, training_metadata.json, progress.jsonl, metrics.json, predictions.jsonl, RUN_REPORT.md |
| 64-128 Episode Robustness Run | Planned | The 32-episode pilot trains and evaluates cleanly. | Split-by-session metrics, modality ablations, calibration/object/language error analysis, and sensitivity to missing views. | Held-out metrics by session, task, and modality; ablation tables; qualitative error analysis. |
| Foundation and World-Model Extensions | Planned | Enough multi-episode data and compute budget for larger multimodal objectives. | Audio encoder integration, depth/image reconstruction, SLAM/world modeling probes, policy-style next-action tasks, and affordance/object interaction tasks. | Task-specific held-out evaluations, qualitative inspection, and updated model cards. |
Current Decision Point
The useful next decision is data scale: keep the public-sample task suite as the development harness, then stage enough official Xperience-10M episodes to run the 32-episode held-out pilot. The public sample is already enough for task design, feature contracts, walkthroughs, and baseline comparisons. It is not enough to measure general embodied-AI model quality.
Stage Details
1. Public-Sample Task Lab
This stage turns one synchronized egocentric episode into a clean research surface. It defines what one model input is, what each task predicts, how the split is constructed, and how minimal and neural heads are compared.
Evidence to inspect:
results/episode_task_suite/windows.csvresults/episode_task_suite/feature_manifest.jsonresults/episode_task_suite/summary_report.jsonresults/episode_task_suite/neural_mlp/docs/data/task_walkthroughs.json
2. Multi-Episode Data Staging
This stage expands the same data contract to official gated episodes. The key research requirement is episode-level separation: training and test examples must come from different episodes, not different windows inside the same episode.
Evidence to inspect:
results/omni_finetune/DATA_ACCESS_STATUS.mdresults/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.mdscripts/omni/discover_xperience10m_sources.pyresults/omni_finetune/source_discovery.json
3. 32-Episode Qwen3-Omni LoRA Pilot
This stage uses Qwen3-Omni as the multimodal backbone and trains lightweight LoRA adapters. The first target is a complete held-out-episode training and evaluation loop with inspectable manifests, predictions, and metrics.
Expected outputs:
dataset_manifest.jsonepisode_manifest.jsontraining_metadata.jsonprogress.jsonlmetrics.jsonpredictions.jsonlpredictions.csvconfusion_matrix.csvRUN_REPORT.md
4. 64-128 Episode Robustness Run
This stage asks whether the 32-episode conclusions survive more sessions, different objects, missing views, and stronger modality ablations. It should report performance by task, session, modality, and failure type.
5. Foundation and World-Model Extensions
This stage moves beyond lightweight heads and LoRA pilots into richer multimodal objectives: audio-visible alignment, depth/image reconstruction, dynamic scene state, SLAM/world modeling, policy-style next action, contact, object relevance, and affordance reasoning.
Public Artifacts That Should Move Together
When a roadmap stage advances, update these public surfaces together:
README.mdPROJECT_STATUS.mdRESEARCH_TAKEAWAYS.mdEVALUATION_PROTOCOL.mdARTIFACT_GUIDE.mddocs/index.htmldocs/data/research_roadmap.json- Hugging Face Space, artifact dataset, and model cards