# Four-Direction Task Taxonomy This file is generated by `scripts/research_direction_taxonomy.py` from the committed 12-task metrics. It maps the current Xperience-10M sample tasks to the four Ropedia research directions and marks which parts require multi-episode evidence. ## Baseline Families | Baseline | Meaning | | --- | --- | | Minimal | Interpretable softmax, logistic, ridge, and retrieval heads over the 8,546-d window feature vector. | | Neural MLP | Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts. | ## Direction Coverage | Direction | Current status | Direct | Proxy | Diagnostic | Current readout | | --- | --- | ---: | ---: | ---: | --- | | A. Human Modeling & Motion Understanding | partially implemented | 2 | 2 | 0 | The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors. | | B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | 0 | 2 | 1 | The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry. | | C. Egocentric Vision & Interaction | strongest implemented track | 6 | 2 | 3 | Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment. | | D. Scene Reconstruction & World Modeling | early proxy tasks | 0 | 6 | 3 | The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs. | ## Task Mapping With Two Baselines | Task | Artifact id | Primary direction | Related directions | Minimal | Neural MLP | Readout | | --- | --- | --- | --- | ---: | ---: | --- | | Action Recognition | `timeline_action` | C | C:direct, A:proxy | 0.0500 macro-F1 | 0.0148 macro-F1 | Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes. | | Procedure Step Recognition | `timeline_subtask` | C | C:direct, D:proxy | 0.0506 macro-F1 | 0.0281 macro-F1 | Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize. | | Action Boundary Detection | `transition_detection` | C | C:direct, D:diagnostic | 0.6118 macro-F1 | 0.5862 macro-F1 | Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading. | | Next-Action Prediction | `next_action` | C | C:direct, D:proxy | 0.0593 macro-F1 | 0.0419 macro-F1 | Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test. | | Hand Trajectory Forecasting | `hand_trajectory_forecast` | A | A:direct, C:proxy | 0.8647 MPJPE | 0.1079 MPJPE | Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model. | | Contact State Prediction | `contact_prediction` | A | A:direct, C:proxy | 1.0000 macro-F1 | 1.0000 macro-F1 | Both baselines are tied. The public sample is degenerate for this target because one class dominates. | | Object Relevance Prediction | `object_relevance` | C | C:direct, A:proxy, D:proxy | 0.1803 micro-F1 | 0.1679 micro-F1 | Minimal baseline is stronger. Object labels are language-derived and sparse in one episode. | | Language Grounding | `caption_grounding` | C | C:direct, D:proxy | 0.0160 MRR | 0.0168 MRR | Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding. | | Cross-Modal Retrieval | `cross_modal_retrieval` | C | C:diagnostic, B:proxy, D:proxy | 0.2693 MRR | 0.1300 MRR | Minimal baseline is stronger. Retrieval shows an alignment signal, not geometric reconstruction. | | Cross-Modal Reconstruction | `modality_reconstruction` | B | B:proxy, D:proxy | -0.0153 R2 | -0.0102 R2 | Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction. | | Temporal Order Verification | `temporal_order` | C | C:diagnostic, D:diagnostic | 0.5400 F1 | 0.8520 F1 | Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling. | | Multimodal Synchronization Detection | `misalignment_detection` | C | C:diagnostic, B:diagnostic, D:diagnostic | 0.5052 F1 | 0.7153 F1 | Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping. | ## Next-Step Interpretation ### A. Human Modeling & Motion Understanding The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors. - Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available. - Train sequence models over multi-episode motion trajectories instead of isolated windows. - Evaluate affordance prediction on held-out objects and held-out episodes. ### B. 3D/4D Reconstruction & Neural Rendering The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry. - Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories. - Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines. - Evaluate novel-view synthesis and temporal consistency across held-out views/time. ### C. Egocentric Vision & Interaction Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment. - Move from single-episode chronological splits to held-out-episode splits. - Use audio together with stronger multimodal backbones for action, intent, and grounding. - Evaluate long-horizon task success prediction and action-conditioned generation. ### D. Scene Reconstruction & World Modeling The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs. - Convert windows into persistent object/scene-state nodes with timestamps and camera poses. - Add map consistency, object permanence, and spatial relation prediction tasks. - Train held-out-episode world models that predict future observations and task state.