Additional Development Directions

This note records concrete directions that can grow from Xperience-10M beyond the current minimal baselines, Qwen3-Omni LoRA plan, Cosmos/world-model branch, and long-term Xperience-native pretraining goal. These are project directions, not completed benchmark results.

Direction	What to build first	Why it matters
Episode taxonomy and data engine	Episode atlas, category tags, balance report, and split builder across activities, objects, scenes, people, sessions, and missing modalities.	Fine-tuning quality depends on selecting representative episodes instead of sampling randomly from a large corpus.
Standardized benchmark protocol	Fixed train/val/test manifests, task cards, leakage checks, metric scripts, and small reference baselines.	Makes future model results comparable across Qwen, Cosmos-style world models, policy models, and smaller task heads.
Multimodal representation learning	Contrastive and masked-prediction objectives over video, audio, depth, pose, mocap, IMU, and language windows.	Turns Xperience-10M into a reusable encoder-learning dataset before committing to expensive large-model training.
Skill and procedure graph mining	Segment actions into steps, transitions, preconditions, effects, and temporal skill graphs.	Connects egocentric perception to task structure, planning, and long-horizon embodied reasoning.
Human-object interaction and affordance modeling	Contact, hand-object state, reachable object, likely tool use, and next-affordance prediction tasks.	Uses the dataset's hands, mocap, objects, contacts, and language to model what actions the scene affords.
3D/4D scene and object memory	Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps.	Moves beyond frame-level recognition toward world-state tracking, object permanence, and spatial reasoning.
Data quality, synchronization, and missing-modality diagnostics	Per-episode QA for timestamp drift, camera/audio/depth availability, calibration consistency, and corrupted files.	Large multimodal training fails quietly without strong data-quality gates; this should become a first-class artifact.
Policy, retargeting, and simulation transfer	Convert mocap/hand/contact traces into action tokens, robot-compatible targets, imitation-learning data, and simulation probes.	Creates a bridge from human egocentric experience to robot policies while keeping action-space assumptions explicit.

Practical Order

Build the episode taxonomy and data-quality diagnostics first.
Lock the benchmark protocol and split manifests before reporting model scores.
Add representation-learning and skill-graph objectives once enough episodes are staged.
Add affordance, 3D/4D memory, and policy-retargeting branches after the labels and action targets are measurable.

The current public sample is useful for prototyping the contracts and visual explanations. Strong claims for these directions require multi-episode training, held-out evaluation, and artifact-level evidence.