Four-Direction Task Taxonomy

This file is generated by scripts/research_direction_taxonomy.py from the committed 12-task metrics. It maps the current Xperience-10M sample tasks to the four Ropedia research directions and marks which parts require multi-episode evidence.

Baseline Families

Baseline	Meaning
Minimal	Interpretable softmax, logistic, ridge, and retrieval heads over the 8,546-d window feature vector.
Neural MLP	Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts.

Direction Coverage

Direction	Current status	Direct	Proxy	Diagnostic	Current readout
A. Human Modeling & Motion Understanding	partially implemented	2	2	0	The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.
B. 3D/4D Reconstruction & Neural Rendering	proxy tasks only	0	2	1	The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.
C. Egocentric Vision & Interaction	strongest implemented track	6	2	3	Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.
D. Scene Reconstruction & World Modeling	early proxy tasks	0	6	3	The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.

Task Mapping With Two Baselines

Task	Artifact id	Primary direction	Related directions	Minimal	Neural MLP	Readout
Action Recognition	`timeline_action`	C	C:direct, A:proxy	0.0500 macro-F1	0.0148 macro-F1	Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes.
Procedure Step Recognition	`timeline_subtask`	C	C:direct, D:proxy	0.0506 macro-F1	0.0281 macro-F1	Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize.
Action Boundary Detection	`transition_detection`	C	C:direct, D:diagnostic	0.6118 macro-F1	0.5862 macro-F1	Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading.
Next-Action Prediction	`next_action`	C	C:direct, D:proxy	0.0593 macro-F1	0.0419 macro-F1	Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test.
Hand Trajectory Forecasting	`hand_trajectory_forecast`	A	A:direct, C:proxy	0.8647 MPJPE	0.1079 MPJPE	Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model.
Contact State Prediction	`contact_prediction`	A	A:direct, C:proxy	1.0000 macro-F1	1.0000 macro-F1	Both baselines are tied. The public sample is degenerate for this target because one class dominates.
Object Relevance Prediction	`object_relevance`	C	C:direct, A:proxy, D:proxy	0.1803 micro-F1	0.1679 micro-F1	Minimal baseline is stronger. Object labels are language-derived and sparse in one episode.
Language Grounding	`caption_grounding`	C	C:direct, D:proxy	0.0160 MRR	0.0168 MRR	Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding.
Cross-Modal Retrieval	`cross_modal_retrieval`	C	C:diagnostic, B:proxy, D:proxy	0.2693 MRR	0.1300 MRR	Minimal baseline is stronger. Retrieval shows an alignment signal, not geometric reconstruction.
Cross-Modal Reconstruction	`modality_reconstruction`	B	B:proxy, D:proxy	-0.0153 R2	-0.0102 R2	Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction.
Temporal Order Verification	`temporal_order`	C	C:diagnostic, D:diagnostic	0.5400 F1	0.8520 F1	Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling.
Multimodal Synchronization Detection	`misalignment_detection`	C	C:diagnostic, B:diagnostic, D:diagnostic	0.5052 F1	0.7153 F1	Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping.

Next-Step Interpretation

A. Human Modeling & Motion Understanding

The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.

Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.
Train sequence models over multi-episode motion trajectories instead of isolated windows.
Evaluate affordance prediction on held-out objects and held-out episodes.

B. 3D/4D Reconstruction & Neural Rendering

The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.

Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.
Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.
Evaluate novel-view synthesis and temporal consistency across held-out views/time.

C. Egocentric Vision & Interaction

Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.

Move from single-episode chronological splits to held-out-episode splits.
Use audio together with stronger multimodal backbones for action, intent, and grounding.
Evaluate long-horizon task success prediction and action-conditioned generation.

D. Scene Reconstruction & World Modeling

The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.

Convert windows into persistent object/scene-state nodes with timestamps and camera poses.
Add map consistency, object permanence, and spatial relation prediction tasks.
Train held-out-episode world models that predict future observations and task state.