# Four-Direction Task Taxonomy

This file is generated by `scripts/research_direction_taxonomy.py` from the committed 12-task metrics.
It maps the current Xperience-10M sample tasks to the four Ropedia research directions and marks which parts require multi-episode evidence.

## Baseline Families

| Baseline | Meaning |
| --- | --- |
| Minimal | Interpretable softmax, logistic, ridge, and retrieval heads over the 8,546-d window feature vector. |
| Neural MLP | Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts. |

## Direction Coverage

| Direction | Current status | Direct | Proxy | Diagnostic | Current readout |
| --- | --- | ---: | ---: | ---: | --- |
| A. Human Modeling & Motion Understanding | partially implemented | 2 | 2 | 0 | The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors. |
| B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | 0 | 2 | 1 | The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry. |
| C. Egocentric Vision & Interaction | strongest implemented track | 6 | 2 | 3 | Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment. |
| D. Scene Reconstruction & World Modeling | early proxy tasks | 0 | 6 | 3 | The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs. |

## Task Mapping With Two Baselines

| Task | Artifact id | Primary direction | Related directions | Minimal | Neural MLP | Readout |
| --- | --- | --- | --- | ---: | ---: | --- |
| Action Recognition | `timeline_action` | C | C:direct, A:proxy | 0.0500 macro-F1 | 0.0148 macro-F1 | Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes. |
| Procedure Step Recognition | `timeline_subtask` | C | C:direct, D:proxy | 0.0506 macro-F1 | 0.0281 macro-F1 | Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize. |
| Action Boundary Detection | `transition_detection` | C | C:direct, D:diagnostic | 0.6118 macro-F1 | 0.5862 macro-F1 | Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading. |
| Next-Action Prediction | `next_action` | C | C:direct, D:proxy | 0.0593 macro-F1 | 0.0419 macro-F1 | Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test. |
| Hand Trajectory Forecasting | `hand_trajectory_forecast` | A | A:direct, C:proxy | 0.8647 MPJPE | 0.1079 MPJPE | Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model. |
| Contact State Prediction | `contact_prediction` | A | A:direct, C:proxy | 1.0000 macro-F1 | 1.0000 macro-F1 | Both baselines are tied. The public sample is degenerate for this target because one class dominates. |
| Object Relevance Prediction | `object_relevance` | C | C:direct, A:proxy, D:proxy | 0.1803 micro-F1 | 0.1679 micro-F1 | Minimal baseline is stronger. Object labels are language-derived and sparse in one episode. |
| Language Grounding | `caption_grounding` | C | C:direct, D:proxy | 0.0160 MRR | 0.0168 MRR | Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding. |
| Cross-Modal Retrieval | `cross_modal_retrieval` | C | C:diagnostic, B:proxy, D:proxy | 0.2693 MRR | 0.1300 MRR | Minimal baseline is stronger. Retrieval shows an alignment signal, not geometric reconstruction. |
| Cross-Modal Reconstruction | `modality_reconstruction` | B | B:proxy, D:proxy | -0.0153 R2 | -0.0102 R2 | Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction. |
| Temporal Order Verification | `temporal_order` | C | C:diagnostic, D:diagnostic | 0.5400 F1 | 0.8520 F1 | Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling. |
| Multimodal Synchronization Detection | `misalignment_detection` | C | C:diagnostic, B:diagnostic, D:diagnostic | 0.5052 F1 | 0.7153 F1 | Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping. |

## Next-Step Interpretation

### A. Human Modeling & Motion Understanding

The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.

- Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.
- Train sequence models over multi-episode motion trajectories instead of isolated windows.
- Evaluate affordance prediction on held-out objects and held-out episodes.

### B. 3D/4D Reconstruction & Neural Rendering

The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.

- Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.
- Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.
- Evaluate novel-view synthesis and temporal consistency across held-out views/time.

### C. Egocentric Vision & Interaction

Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.

- Move from single-episode chronological splits to held-out-episode splits.
- Use audio together with stronger multimodal backbones for action, intent, and grounding.
- Evaluate long-horizon task success prediction and action-conditioned generation.

### D. Scene Reconstruction & World Modeling

The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.

- Convert windows into persistent object/scene-state nodes with timestamps and camera poses.
- Add map consistency, object permanence, and spatial relation prediction tasks.
- Train held-out-episode world models that predict future observations and task state.