Four-Direction Task Taxonomy
This file is generated by scripts/research_direction_taxonomy.py from the committed 12-task metrics.
It maps the current Xperience-10M sample tasks to the four Ropedia research directions and marks which parts require multi-episode evidence.
Baseline Families
| Baseline | Meaning |
|---|---|
| Minimal | Interpretable softmax, logistic, ridge, and retrieval heads over the 8,546-d window feature vector. |
| Neural MLP | Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts. |
Direction Coverage
| Direction | Current status | Direct | Proxy | Diagnostic | Current readout |
|---|---|---|---|---|---|
| A. Human Modeling & Motion Understanding | partially implemented | 2 | 2 | 0 | The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors. |
| B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | 0 | 2 | 1 | The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry. |
| C. Egocentric Vision & Interaction | strongest implemented track | 6 | 2 | 3 | Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment. |
| D. Scene Reconstruction & World Modeling | early proxy tasks | 0 | 6 | 3 | The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs. |
Task Mapping With Two Baselines
| Task | Artifact id | Primary direction | Related directions | Minimal | Neural MLP | Readout |
|---|---|---|---|---|---|---|
| Action Recognition | timeline_action |
C | C:direct, A:proxy | 0.0500 macro-F1 | 0.0148 macro-F1 | Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes. |
| Procedure Step Recognition | timeline_subtask |
C | C:direct, D:proxy | 0.0506 macro-F1 | 0.0281 macro-F1 | Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize. |
| Action Boundary Detection | transition_detection |
C | C:direct, D:diagnostic | 0.6118 macro-F1 | 0.5862 macro-F1 | Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading. |
| Next-Action Prediction | next_action |
C | C:direct, D:proxy | 0.0593 macro-F1 | 0.0419 macro-F1 | Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test. |
| Hand Trajectory Forecasting | hand_trajectory_forecast |
A | A:direct, C:proxy | 0.8647 MPJPE | 0.1079 MPJPE | Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model. |
| Contact State Prediction | contact_prediction |
A | A:direct, C:proxy | 1.0000 macro-F1 | 1.0000 macro-F1 | Both baselines are tied. The public sample is degenerate for this target because one class dominates. |
| Object Relevance Prediction | object_relevance |
C | C:direct, A:proxy, D:proxy | 0.1803 micro-F1 | 0.1679 micro-F1 | Minimal baseline is stronger. Object labels are language-derived and sparse in one episode. |
| Language Grounding | caption_grounding |
C | C:direct, D:proxy | 0.0160 MRR | 0.0168 MRR | Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding. |
| Cross-Modal Retrieval | cross_modal_retrieval |
C | C:diagnostic, B:proxy, D:proxy | 0.2693 MRR | 0.1300 MRR | Minimal baseline is stronger. Retrieval shows an alignment signal, not geometric reconstruction. |
| Cross-Modal Reconstruction | modality_reconstruction |
B | B:proxy, D:proxy | -0.0153 R2 | -0.0102 R2 | Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction. |
| Temporal Order Verification | temporal_order |
C | C:diagnostic, D:diagnostic | 0.5400 F1 | 0.8520 F1 | Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling. |
| Multimodal Synchronization Detection | misalignment_detection |
C | C:diagnostic, B:diagnostic, D:diagnostic | 0.5052 F1 | 0.7153 F1 | Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping. |
Next-Step Interpretation
A. Human Modeling & Motion Understanding
The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.
- Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.
- Train sequence models over multi-episode motion trajectories instead of isolated windows.
- Evaluate affordance prediction on held-out objects and held-out episodes.
B. 3D/4D Reconstruction & Neural Rendering
The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.
- Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.
- Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.
- Evaluate novel-view synthesis and temporal consistency across held-out views/time.
C. Egocentric Vision & Interaction
Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.
- Move from single-episode chronological splits to held-out-episode splits.
- Use audio together with stronger multimodal backbones for action, intent, and grounding.
- Evaluate long-horizon task success prediction and action-conditioned generation.
D. Scene Reconstruction & World Modeling
The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.
- Convert windows into persistent object/scene-state nodes with timestamps and camera poses.
- Add map consistency, object permanence, and spatial relation prediction tasks.
- Train held-out-episode world models that predict future observations and task state.