# Additional Development Directions This note records concrete directions that can grow from Xperience-10M beyond the current minimal baselines, Qwen3-Omni LoRA plan, Cosmos/world-model track, and long-term Xperience-native pretraining goal. These are project directions, not completed benchmark results. | Direction | What to build first | Why it matters | | --- | --- | --- | | Episode taxonomy and data engine | Episode atlas, category tags, balance report, and split builder across activities, objects, scenes, people, sessions, and missing modalities. | Fine-tuning quality depends on selecting representative episodes instead of sampling randomly from a large corpus. | | Standardized benchmark protocol | Fixed train/val/test manifests, task cards, leakage checks, metric scripts, and small reference baselines. | Makes future model results comparable across Qwen, Cosmos-style world models, policy models, and smaller task heads. | | Multimodal representation learning | Contrastive and masked-prediction objectives over video, audio, depth, pose, mocap, IMU, and language windows. | Turns Xperience-10M into a reusable encoder-learning dataset before committing to expensive large-model training. | | Skill and procedure graph mining | Segment actions into steps, transitions, preconditions, effects, and temporal skill graphs. | Connects egocentric perception to task structure, planning, and long-horizon embodied reasoning. | | Human-object interaction and affordance modeling | Contact, hand-object state, reachable object, likely tool use, and next-affordance prediction tasks. | Uses the dataset's hands, mocap, objects, contacts, and language to model what actions the scene affords. | | 3D/4D scene and object memory | Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps. | Moves beyond frame-level recognition toward world-state tracking, object permanence, and spatial reasoning. | | Data quality, synchronization, and missing-modality diagnostics | Per-episode QA for timestamp drift, camera/audio/depth availability, calibration consistency, and corrupted files. | Large multimodal training fails quietly without strong data-quality gates; this should become a first-class artifact. | | Policy, retargeting, and simulation transfer | Convert mocap/hand/contact traces into action tokens, robot-compatible targets, imitation-learning data, and simulation probes. | Creates a bridge from human egocentric experience to robot policies while keeping action-space assumptions explicit. | ## Practical Order 1. Build the episode taxonomy and data-quality diagnostics first. 2. Lock the benchmark protocol and split manifests before reporting model scores. 3. Add representation-learning and skill-graph objectives once enough episodes are staged. 4. Add affordance, 3D/4D memory, and policy-retargeting branches after the labels and action targets are measurable. The current public sample is useful for prototyping the contracts and visual explanations. Stronger direction-level results require multi-episode training, held-out evaluation, and artifact-level evidence.