File size: 3,137 Bytes
d96f266 eeac43c d96f266 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | # Additional Development Directions
This note records concrete directions that can grow from Xperience-10M beyond
the current minimal baselines, Qwen3-Omni LoRA plan, Cosmos/world-model branch,
and long-term Xperience-native pretraining goal. These are project directions,
not completed benchmark results.
| Direction | What to build first | Why it matters |
| --- | --- | --- |
| Episode taxonomy and data engine | Episode atlas, category tags, balance report, and split builder across activities, objects, scenes, people, sessions, and missing modalities. | Fine-tuning quality depends on selecting representative episodes instead of sampling randomly from a large corpus. |
| Standardized benchmark protocol | Fixed train/val/test manifests, task cards, leakage checks, metric scripts, and small reference baselines. | Makes future model results comparable across Qwen, Cosmos-style world models, policy models, and smaller task heads. |
| Multimodal representation learning | Contrastive and masked-prediction objectives over video, audio, depth, pose, mocap, IMU, and language windows. | Turns Xperience-10M into a reusable encoder-learning dataset before committing to expensive large-model training. |
| Skill and procedure graph mining | Segment actions into steps, transitions, preconditions, effects, and temporal skill graphs. | Connects egocentric perception to task structure, planning, and long-horizon embodied reasoning. |
| Human-object interaction and affordance modeling | Contact, hand-object state, reachable object, likely tool use, and next-affordance prediction tasks. | Uses the dataset's hands, mocap, objects, contacts, and language to model what actions the scene affords. |
| 3D/4D scene and object memory | Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps. | Moves beyond frame-level recognition toward world-state tracking, object permanence, and spatial reasoning. |
| Data quality, synchronization, and missing-modality diagnostics | Per-episode QA for timestamp drift, camera/audio/depth availability, calibration consistency, and corrupted files. | Large multimodal training fails quietly without strong data-quality gates; this should become a first-class artifact. |
| Policy, retargeting, and simulation transfer | Convert mocap/hand/contact traces into action tokens, robot-compatible targets, imitation-learning data, and simulation probes. | Creates a bridge from human egocentric experience to robot policies while keeping action-space assumptions explicit. |
## Practical Order
1. Build the episode taxonomy and data-quality diagnostics first.
2. Lock the benchmark protocol and split manifests before reporting model scores.
3. Add representation-learning and skill-graph objectives once enough episodes
are staged.
4. Add affordance, 3D/4D memory, and policy-retargeting branches after the
labels and action targets are measurable.
The current public sample is useful for prototyping the contracts and visual
explanations. Strong claims for these directions require multi-episode training,
held-out evaluation, and artifact-level evidence.
|