File size: 3,137 Bytes
d96f266
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeac43c
d96f266
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Additional Development Directions

This note records concrete directions that can grow from Xperience-10M beyond
the current minimal baselines, Qwen3-Omni LoRA plan, Cosmos/world-model branch,
and long-term Xperience-native pretraining goal. These are project directions,
not completed benchmark results.

| Direction | What to build first | Why it matters |
| --- | --- | --- |
| Episode taxonomy and data engine | Episode atlas, category tags, balance report, and split builder across activities, objects, scenes, people, sessions, and missing modalities. | Fine-tuning quality depends on selecting representative episodes instead of sampling randomly from a large corpus. |
| Standardized benchmark protocol | Fixed train/val/test manifests, task cards, leakage checks, metric scripts, and small reference baselines. | Makes future model results comparable across Qwen, Cosmos-style world models, policy models, and smaller task heads. |
| Multimodal representation learning | Contrastive and masked-prediction objectives over video, audio, depth, pose, mocap, IMU, and language windows. | Turns Xperience-10M into a reusable encoder-learning dataset before committing to expensive large-model training. |
| Skill and procedure graph mining | Segment actions into steps, transitions, preconditions, effects, and temporal skill graphs. | Connects egocentric perception to task structure, planning, and long-horizon embodied reasoning. |
| Human-object interaction and affordance modeling | Contact, hand-object state, reachable object, likely tool use, and next-affordance prediction tasks. | Uses the dataset's hands, mocap, objects, contacts, and language to model what actions the scene affords. |
| 3D/4D scene and object memory | Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps. | Moves beyond frame-level recognition toward world-state tracking, object permanence, and spatial reasoning. |
| Data quality, synchronization, and missing-modality diagnostics | Per-episode QA for timestamp drift, camera/audio/depth availability, calibration consistency, and corrupted files. | Large multimodal training fails quietly without strong data-quality gates; this should become a first-class artifact. |
| Policy, retargeting, and simulation transfer | Convert mocap/hand/contact traces into action tokens, robot-compatible targets, imitation-learning data, and simulation probes. | Creates a bridge from human egocentric experience to robot policies while keeping action-space assumptions explicit. |

## Practical Order

1. Build the episode taxonomy and data-quality diagnostics first.
2. Lock the benchmark protocol and split manifests before reporting model scores.
3. Add representation-learning and skill-graph objectives once enough episodes
   are staged.
4. Add affordance, 3D/4D memory, and policy-retargeting branches after the
   labels and action targets are measurable.

The current public sample is useful for prototyping the contracts and visual
explanations. Strong claims for these directions require multi-episode training,
held-out evaluation, and artifact-level evidence.