cy0307's picture
Publish Ropedia Xperience-10M task baseline cards
eeac43c verified
|
Raw
History Blame
6.13 kB

Four-Direction Task Taxonomy

This file is generated by scripts/research_direction_taxonomy.py from the committed 12-task metrics. It maps the current Xperience-10M sample tasks to the four Ropedia research directions and marks which parts require multi-episode evidence.

Baseline Families

Baseline Meaning
Minimal Interpretable softmax, logistic, ridge, and retrieval heads over the 8,546-d window feature vector.
Neural MLP Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts.

Direction Coverage

Direction Current status Direct Proxy Diagnostic Current readout
A. Human Modeling & Motion Understanding partially implemented 2 2 0 The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.
B. 3D/4D Reconstruction & Neural Rendering proxy tasks only 0 2 1 The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.
C. Egocentric Vision & Interaction strongest implemented track 6 2 3 Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.
D. Scene Reconstruction & World Modeling early proxy tasks 0 6 3 The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.

Task Mapping With Two Baselines

Task Artifact id Primary direction Related directions Minimal Neural MLP Readout
Action Recognition timeline_action C C:direct, A:proxy 0.0500 macro-F1 0.0148 macro-F1 Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes.
Procedure Step Recognition timeline_subtask C C:direct, D:proxy 0.0506 macro-F1 0.0281 macro-F1 Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize.
Action Boundary Detection transition_detection C C:direct, D:diagnostic 0.6118 macro-F1 0.5862 macro-F1 Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading.
Next-Action Prediction next_action C C:direct, D:proxy 0.0593 macro-F1 0.0419 macro-F1 Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test.
Hand Trajectory Forecasting hand_trajectory_forecast A A:direct, C:proxy 0.8647 MPJPE 0.1079 MPJPE Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model.
Contact State Prediction contact_prediction A A:direct, C:proxy 1.0000 macro-F1 1.0000 macro-F1 Both baselines are tied. The public sample is degenerate for this target because one class dominates.
Object Relevance Prediction object_relevance C C:direct, A:proxy, D:proxy 0.1803 micro-F1 0.1679 micro-F1 Minimal baseline is stronger. Object labels are language-derived and sparse in one episode.
Language Grounding caption_grounding C C:direct, D:proxy 0.0160 MRR 0.0168 MRR Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding.
Cross-Modal Retrieval cross_modal_retrieval C C:diagnostic, B:proxy, D:proxy 0.2693 MRR 0.1300 MRR Minimal baseline is stronger. Retrieval shows an alignment signal, not geometric reconstruction.
Cross-Modal Reconstruction modality_reconstruction B B:proxy, D:proxy -0.0153 R2 -0.0102 R2 Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction.
Temporal Order Verification temporal_order C C:diagnostic, D:diagnostic 0.5400 F1 0.8520 F1 Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling.
Multimodal Synchronization Detection misalignment_detection C C:diagnostic, B:diagnostic, D:diagnostic 0.5052 F1 0.7153 F1 Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping.

Next-Step Interpretation

A. Human Modeling & Motion Understanding

The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.

  • Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.
  • Train sequence models over multi-episode motion trajectories instead of isolated windows.
  • Evaluate affordance prediction on held-out objects and held-out episodes.

B. 3D/4D Reconstruction & Neural Rendering

The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.

  • Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.
  • Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.
  • Evaluate novel-view synthesis and temporal consistency across held-out views/time.

C. Egocentric Vision & Interaction

Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.

  • Move from single-episode chronological splits to held-out-episode splits.
  • Use audio together with stronger multimodal backbones for action, intent, and grounding.
  • Evaluate long-horizon task success prediction and action-conditioned generation.

D. Scene Reconstruction & World Modeling

The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.

  • Convert windows into persistent object/scene-state nodes with timestamps and camera poses.
  • Add map consistency, object permanence, and spatial relation prediction tasks.
  • Train held-out-episode world models that predict future observations and task state.