ropedia-xperience-10m-task-baselines / ADDITIONAL_DEVELOPMENT_DIRECTIONS.md

Publish Ropedia Xperience-10M task baseline cards

eeac43c verified 20 days ago

3.14 kB

	# Additional Development Directions

	This note records concrete directions that can grow from Xperience-10M beyond
	the current minimal baselines, Qwen3-Omni LoRA plan, Cosmos/world-model branch,
	and long-term Xperience-native pretraining goal. These are project directions,
	not completed benchmark results.

	\| Direction \| What to build first \| Why it matters \|
	\| --- \| --- \| --- \|
	\| Episode taxonomy and data engine \| Episode atlas, category tags, balance report, and split builder across activities, objects, scenes, people, sessions, and missing modalities. \| Fine-tuning quality depends on selecting representative episodes instead of sampling randomly from a large corpus. \|
	\| Standardized benchmark protocol \| Fixed train/val/test manifests, task cards, leakage checks, metric scripts, and small reference baselines. \| Makes future model results comparable across Qwen, Cosmos-style world models, policy models, and smaller task heads. \|
	\| Multimodal representation learning \| Contrastive and masked-prediction objectives over video, audio, depth, pose, mocap, IMU, and language windows. \| Turns Xperience-10M into a reusable encoder-learning dataset before committing to expensive large-model training. \|
	\| Skill and procedure graph mining \| Segment actions into steps, transitions, preconditions, effects, and temporal skill graphs. \| Connects egocentric perception to task structure, planning, and long-horizon embodied reasoning. \|
	\| Human-object interaction and affordance modeling \| Contact, hand-object state, reachable object, likely tool use, and next-affordance prediction tasks. \| Uses the dataset's hands, mocap, objects, contacts, and language to model what actions the scene affords. \|
	\| 3D/4D scene and object memory \| Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps. \| Moves beyond frame-level recognition toward world-state tracking, object permanence, and spatial reasoning. \|
	\| Data quality, synchronization, and missing-modality diagnostics \| Per-episode QA for timestamp drift, camera/audio/depth availability, calibration consistency, and corrupted files. \| Large multimodal training fails quietly without strong data-quality gates; this should become a first-class artifact. \|
	\| Policy, retargeting, and simulation transfer \| Convert mocap/hand/contact traces into action tokens, robot-compatible targets, imitation-learning data, and simulation probes. \| Creates a bridge from human egocentric experience to robot policies while keeping action-space assumptions explicit. \|

	## Practical Order

	1. Build the episode taxonomy and data-quality diagnostics first.
	2. Lock the benchmark protocol and split manifests before reporting model scores.
	3. Add representation-learning and skill-graph objectives once enough episodes
	are staged.
	4. Add affordance, 3D/4D memory, and policy-retargeting branches after the
	labels and action targets are measurable.

	The current public sample is useful for prototyping the contracts and visual
	explanations. Strong claims for these directions require multi-episode training,
	held-out evaluation, and artifact-level evidence.