cy0307 commited on
Commit
477807f
·
verified ·
1 Parent(s): b6a4313

Publish Xperience-10M minimal and neural task baseline cards

Browse files
README.md CHANGED
@@ -94,6 +94,7 @@ transfers them to H20 for manifest building, training, and evaluation.
94
  | `artifacts/episode_task_suite/neural_mlp/**/model.pt` | stores the neural MLP checkpoints |
95
  | `artifacts/**/metrics.json` | records the committed metric values |
96
  | `artifacts/**/feature_manifest.json` | maps feature blocks back to source modalities |
 
97
  | `assets/task_architectures.png` | shows the shared pipeline and all 12 heads |
98
  | `assets/task_suite_infographic.png` | presents the 12 heads with public-sample modality thumbnails and verified metrics |
99
 
@@ -104,6 +105,7 @@ transfers them to H20 for manifest building, training, and evaluation.
104
  - `artifacts/episode_task_suite/neural_mlp/**/history.json`: neural training traces
105
  - `artifacts/**/metrics.json`: committed metrics
106
  - `artifacts/**/feature_manifest.json`: feature block boundaries where relevant
 
107
  - `scripts/*.py`: training and visualization scripts
108
  - `notes/*.md`: interpretation and reproducibility notes
109
 
@@ -127,6 +129,21 @@ https://huggingface.co/collections/cy0307/ropedia-episode-task-suite
127
 
128
  ![Minimal 12-task architecture](assets/task_architectures.png)
129
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  ## Metrics Snapshot
131
 
132
  | Task | Neural MLP metric | Minimal metric |
 
94
  | `artifacts/episode_task_suite/neural_mlp/**/model.pt` | stores the neural MLP checkpoints |
95
  | `artifacts/**/metrics.json` | records the committed metric values |
96
  | `artifacts/**/feature_manifest.json` | maps feature blocks back to source modalities |
97
+ | `artifacts/episode_task_suite/research_directions/` | maps every task to the four Ropedia research directions with minimal-vs-neural readouts |
98
  | `assets/task_architectures.png` | shows the shared pipeline and all 12 heads |
99
  | `assets/task_suite_infographic.png` | presents the 12 heads with public-sample modality thumbnails and verified metrics |
100
 
 
105
  - `artifacts/episode_task_suite/neural_mlp/**/history.json`: neural training traces
106
  - `artifacts/**/metrics.json`: committed metrics
107
  - `artifacts/**/feature_manifest.json`: feature block boundaries where relevant
108
+ - `artifacts/episode_task_suite/research_directions/*.json|*.csv|*.md`: four-track task taxonomy
109
  - `scripts/*.py`: training and visualization scripts
110
  - `notes/*.md`: interpretation and reproducibility notes
111
 
 
129
 
130
  ![Minimal 12-task architecture](assets/task_architectures.png)
131
 
132
+ ## Four Research Directions
133
+
134
+ The baselines are also grouped by the four Ropedia research tracks:
135
+
136
+ | Direction | Current status | Baseline evidence |
137
+ | --- | --- | --- |
138
+ | A. Human Modeling & Motion Understanding | partially implemented | hand trajectory forecasting improves from `0.8223` to `0.1116` MPJPE with the neural MLP; contact is degenerate in this sample |
139
+ | B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | cross-modal retrieval, feature reconstruction, and misalignment are prerequisites, not full neural rendering |
140
+ | C. Egocentric Vision & Interaction | strongest implemented track | action/subtask/transition/next-action/object/caption tasks plus alignment/order diagnostics |
141
+ | D. Scene Reconstruction & World Modeling | early proxy tasks | state, object, retrieval, reconstruction, and temporal tasks are first probes before scene graphs or maps |
142
+
143
+ Primary taxonomy file:
144
+
145
+ `artifacts/episode_task_suite/research_directions/research_direction_taxonomy.json`
146
+
147
  ## Metrics Snapshot
148
 
149
  | Task | Neural MLP metric | Minimal metric |
artifacts/episode_task_suite/research_directions/research_direction_summary.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Four-Direction Task Taxonomy
2
+
3
+ This file is generated by `scripts/research_direction_taxonomy.py` from the committed 12-task metrics.
4
+ It maps the current Xperience-10M sample tasks to the four Ropedia research directions without claiming that a single episode solves any full direction.
5
+
6
+ ## Baseline Families
7
+
8
+ | Baseline | Meaning |
9
+ | --- | --- |
10
+ | Minimal | Interpretable softmax, logistic, ridge, and retrieval heads over the 8,378-d window feature vector. |
11
+ | Neural MLP | Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts. |
12
+
13
+ ## Direction Coverage
14
+
15
+ | Direction | Current status | Direct | Proxy | Diagnostic | Current readout |
16
+ | --- | --- | ---: | ---: | ---: | --- |
17
+ | A. Human Modeling & Motion Understanding | partially implemented | 2 | 2 | 0 | The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors. |
18
+ | B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | 0 | 2 | 1 | The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry. |
19
+ | C. Egocentric Vision & Interaction | strongest implemented track | 6 | 2 | 3 | Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment. |
20
+ | D. Scene Reconstruction & World Modeling | early proxy tasks | 0 | 6 | 3 | The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs. |
21
+
22
+ ## Task Mapping With Two Baselines
23
+
24
+ | Task | Primary direction | Related directions | Minimal | Neural MLP | Readout |
25
+ | --- | --- | --- | ---: | ---: | --- |
26
+ | `timeline_action` | C | C:direct, A:proxy | 0.0500 macro-F1 | 0.0263 macro-F1 | Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes. |
27
+ | `timeline_subtask` | C | C:direct, D:proxy | 0.0495 macro-F1 | 0.0175 macro-F1 | Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize. |
28
+ | `transition_detection` | C | C:direct, D:diagnostic | 0.6552 macro-F1 | 0.6485 macro-F1 | Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading. |
29
+ | `next_action` | C | C:direct, D:proxy | 0.0593 macro-F1 | 0.0235 macro-F1 | Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test. |
30
+ | `hand_trajectory_forecast` | A | A:direct, C:proxy | 0.8223 MPJPE | 0.1116 MPJPE | Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model. |
31
+ | `contact_prediction` | A | A:direct, C:proxy | 1.0000 macro-F1 | 1.0000 macro-F1 | Both baselines are tied. The public sample is degenerate for this target because one class dominates. |
32
+ | `object_relevance` | C | C:direct, A:proxy, D:proxy | 0.1839 micro-F1 | 0.1798 micro-F1 | Minimal baseline is stronger. Object labels are language-derived and sparse in one episode. |
33
+ | `caption_grounding` | C | C:direct, D:proxy | 0.0172 MRR | 0.0178 MRR | Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding. |
34
+ | `cross_modal_retrieval` | C | C:diagnostic, B:proxy, D:proxy | 0.2634 MRR | 0.1530 MRR | Minimal baseline is stronger. Retrieval proves alignment signal, not geometric reconstruction. |
35
+ | `modality_reconstruction` | B | B:proxy, D:proxy | -0.0160 R2 | -0.0102 R2 | Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction. |
36
+ | `temporal_order` | C | C:diagnostic, D:diagnostic | 0.5487 F1 | 0.8718 F1 | Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling. |
37
+ | `misalignment_detection` | C | C:diagnostic, B:diagnostic, D:diagnostic | 0.4866 F1 | 0.7335 F1 | Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping. |
38
+
39
+ ## Next-Step Interpretation
40
+
41
+ ### A. Human Modeling & Motion Understanding
42
+
43
+ The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.
44
+
45
+ - Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.
46
+ - Train sequence models over multi-episode motion trajectories instead of isolated windows.
47
+ - Evaluate affordance prediction on held-out objects and held-out episodes.
48
+
49
+ ### B. 3D/4D Reconstruction & Neural Rendering
50
+
51
+ The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.
52
+
53
+ - Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.
54
+ - Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.
55
+ - Evaluate novel-view synthesis and temporal consistency across held-out views/time.
56
+
57
+ ### C. Egocentric Vision & Interaction
58
+
59
+ Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.
60
+
61
+ - Move from single-episode chronological splits to held-out-episode splits.
62
+ - Add audio features and stronger multimodal backbones for action, intent, and grounding.
63
+ - Evaluate long-horizon task success prediction and action-conditioned generation.
64
+
65
+ ### D. Scene Reconstruction & World Modeling
66
+
67
+ The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.
68
+
69
+ - Convert windows into persistent object/scene-state nodes with timestamps and camera poses.
70
+ - Add map consistency, object permanence, and spatial relation prediction tasks.
71
+ - Train held-out-episode world models that predict future observations and task state.
artifacts/episode_task_suite/research_directions/research_direction_task_map.csv ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ direction,direction_name,task,task_name,family,relationship,primary_direction,metric_name,minimal_metric,neural_mlp_metric,better_baseline,why,current_limit
2
+ C,Egocentric Vision & Interaction,timeline_action,Timeline action recognition,supervised,direct,C,macro-F1,0.05,0.0263157894737,minimal,Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.,Chronological single-episode split creates unseen future action classes.
3
+ A,Human Modeling & Motion Understanding,timeline_action,Timeline action recognition,supervised,proxy,C,macro-F1,0.05,0.0263157894737,minimal,Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.,Chronological single-episode split creates unseen future action classes.
4
+ C,Egocentric Vision & Interaction,timeline_subtask,Timeline subtask recognition,supervised,direct,C,macro-F1,0.0495412112118,0.0175182481752,minimal,Segments egocentric task state and provides a first proxy for symbolic world/task state.,Single-episode ordering makes future subtasks hard to generalize.
5
+ D,Scene Reconstruction & World Modeling,timeline_subtask,Timeline subtask recognition,supervised,proxy,C,macro-F1,0.0495412112118,0.0175182481752,minimal,Segments egocentric task state and provides a first proxy for symbolic world/task state.,Single-episode ordering makes future subtasks hard to generalize.
6
+ C,Egocentric Vision & Interaction,transition_detection,Action transition detection,diagnostic,direct,C,macro-F1,0.655182926829,0.648484848485,minimal,Localizes egocentric task boundaries and diagnoses temporal state changes.,"Boundary class is sparse, so accuracy alone is misleading."
7
+ D,Scene Reconstruction & World Modeling,transition_detection,Action transition detection,diagnostic,diagnostic,C,macro-F1,0.655182926829,0.648484848485,minimal,Localizes egocentric task boundaries and diagnoses temporal state changes.,"Boundary class is sparse, so accuracy alone is misleading."
8
+ C,Egocentric Vision & Interaction,next_action,Short-horizon next action,supervised,direct,C,macro-F1,0.0592592592593,0.0235294117647,minimal,Tests action intention/task-flow prediction from egocentric context.,Unseen future labels dominate the single-episode chronological test.
9
+ D,Scene Reconstruction & World Modeling,next_action,Short-horizon next action,supervised,proxy,C,macro-F1,0.0592592592593,0.0235294117647,minimal,Tests action intention/task-flow prediction from egocentric context.,Unseen future labels dominate the single-episode chronological test.
10
+ A,Human Modeling & Motion Understanding,hand_trajectory_forecast,Hand trajectory forecasting,forecast,direct,A,MPJPE,0.822264492512,0.11163123697,neural_mlp,Directly predicts human hand motion and supports hand-object interaction modeling.,Forecasting is window-level and not yet a full sequence or policy model.
11
+ C,Egocentric Vision & Interaction,hand_trajectory_forecast,Hand trajectory forecasting,forecast,proxy,A,MPJPE,0.822264492512,0.11163123697,neural_mlp,Directly predicts human hand motion and supports hand-object interaction modeling.,Forecasting is window-level and not yet a full sequence or policy model.
12
+ A,Human Modeling & Motion Understanding,contact_prediction,Body/object contact prediction,supervised,direct,A,macro-F1,1,1,tie,"Targets physical interaction state, a core affordance and manipulation signal.",The public sample is degenerate for this target because one class dominates.
13
+ C,Egocentric Vision & Interaction,contact_prediction,Body/object contact prediction,supervised,proxy,A,macro-F1,1,1,tie,"Targets physical interaction state, a core affordance and manipulation signal.",The public sample is degenerate for this target because one class dominates.
14
+ C,Egocentric Vision & Interaction,object_relevance,Relevant object set prediction,supervised,direct,C,micro-F1,0.183930300097,0.179758308157,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode.
15
+ A,Human Modeling & Motion Understanding,object_relevance,Relevant object set prediction,supervised,proxy,C,micro-F1,0.183930300097,0.179758308157,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode.
16
+ D,Scene Reconstruction & World Modeling,object_relevance,Relevant object set prediction,supervised,proxy,C,micro-F1,0.183930300097,0.179758308157,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode.
17
+ C,Egocentric Vision & Interaction,caption_grounding,Caption-to-window grounding,retrieval,direct,C,MRR,0.0171839460838,0.0178111116104,neural_mlp,Grounds language annotation into egocentric sensor time and task state.,Bag-of-objects language features are too weak for rich grounding.
18
+ D,Scene Reconstruction & World Modeling,caption_grounding,Caption-to-window grounding,retrieval,proxy,C,MRR,0.0171839460838,0.0178111116104,neural_mlp,Grounds language annotation into egocentric sensor time and task state.,Bag-of-objects language features are too weak for rich grounding.
19
+ C,Egocentric Vision & Interaction,cross_modal_retrieval,Cross-modal retrieval,retrieval,diagnostic,C,MRR,0.263359840066,0.15300700222,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval proves alignment signal, not geometric reconstruction."
20
+ B,3D/4D Reconstruction & Neural Rendering,cross_modal_retrieval,Cross-modal retrieval,retrieval,proxy,C,MRR,0.263359840066,0.15300700222,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval proves alignment signal, not geometric reconstruction."
21
+ D,Scene Reconstruction & World Modeling,cross_modal_retrieval,Cross-modal retrieval,retrieval,proxy,C,MRR,0.263359840066,0.15300700222,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval proves alignment signal, not geometric reconstruction."
22
+ B,3D/4D Reconstruction & Neural Rendering,modality_reconstruction,Modality reconstruction,forecast,proxy,B,R2,-0.0160228467711,-0.0101981718914,neural_mlp,Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.,"Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction."
23
+ D,Scene Reconstruction & World Modeling,modality_reconstruction,Modality reconstruction,forecast,proxy,B,R2,-0.0160228467711,-0.0101981718914,neural_mlp,Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.,"Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction."
24
+ C,Egocentric Vision & Interaction,temporal_order,Temporal order verification,diagnostic,diagnostic,C,F1,0.548736462094,0.871794871795,neural_mlp,Checks whether features encode local time direction and task progression.,"Only local adjacent ordering, not long-horizon causal modeling."
25
+ D,Scene Reconstruction & World Modeling,temporal_order,Temporal order verification,diagnostic,diagnostic,C,F1,0.548736462094,0.871794871795,neural_mlp,Checks whether features encode local time direction and task progression.,"Only local adjacent ordering, not long-horizon causal modeling."
26
+ C,Egocentric Vision & Interaction,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.486567164179,0.733524355301,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping.
27
+ B,3D/4D Reconstruction & Neural Rendering,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.486567164179,0.733524355301,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping.
28
+ D,Scene Reconstruction & World Modeling,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.486567164179,0.733524355301,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping.
artifacts/episode_task_suite/research_directions/research_direction_taxonomy.json ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "source": "results/episode_task_suite/summary_report.json",
3
+ "dataset_scope": {
4
+ "sample_episode_count": 1,
5
+ "num_frames": 5821,
6
+ "num_windows": 1161,
7
+ "feature_dim": 8378,
8
+ "warning": "Single public sample episode; this supports pipeline/task evidence, not cross-episode generalization claims."
9
+ },
10
+ "baselines": {
11
+ "minimal": "Interpretable softmax, logistic, ridge, and retrieval heads over the 8,378-d window feature vector.",
12
+ "neural_mlp": "Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts."
13
+ },
14
+ "directions": {
15
+ "A": {
16
+ "id": "human_motion",
17
+ "name": "Human Modeling & Motion Understanding",
18
+ "focus": "Human/hand/body motion, deformation priors, human-object interaction, affordance modeling.",
19
+ "preferred_background": "Human pose/shape estimation, SMPL-style models, motion capture, or motion generation.",
20
+ "current_status": "partially implemented",
21
+ "current_readout": "The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.",
22
+ "next_steps": [
23
+ "Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.",
24
+ "Train sequence models over multi-episode motion trajectories instead of isolated windows.",
25
+ "Evaluate affordance prediction on held-out objects and held-out episodes."
26
+ ],
27
+ "tasks": [
28
+ "timeline_action",
29
+ "hand_trajectory_forecast",
30
+ "contact_prediction",
31
+ "object_relevance"
32
+ ],
33
+ "counts": {
34
+ "direct": 2,
35
+ "proxy": 2,
36
+ "diagnostic": 0,
37
+ "total_links": 4
38
+ }
39
+ },
40
+ "B": {
41
+ "id": "reconstruction_rendering",
42
+ "name": "3D/4D Reconstruction & Neural Rendering",
43
+ "focus": "Multi-view dynamic scene reconstruction, NeRF/Gaussian Splatting, novel-view synthesis.",
44
+ "preferred_background": "3D reconstruction, neural rendering, camera calibration, and bundle adjustment.",
45
+ "current_status": "proxy tasks only",
46
+ "current_readout": "The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.",
47
+ "next_steps": [
48
+ "Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.",
49
+ "Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.",
50
+ "Evaluate novel-view synthesis and temporal consistency across held-out views/time."
51
+ ],
52
+ "tasks": [
53
+ "cross_modal_retrieval",
54
+ "modality_reconstruction",
55
+ "misalignment_detection"
56
+ ],
57
+ "counts": {
58
+ "direct": 0,
59
+ "proxy": 2,
60
+ "diagnostic": 1,
61
+ "total_links": 3
62
+ }
63
+ },
64
+ "C": {
65
+ "id": "egocentric_interaction",
66
+ "name": "Egocentric Vision & Interaction",
67
+ "focus": "Egocentric action and intention understanding, hand-object interaction, gaze/attention modeling, task structure modeling.",
68
+ "preferred_background": "Video understanding, action recognition, or egocentric vision.",
69
+ "current_status": "strongest implemented track",
70
+ "current_readout": "Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.",
71
+ "next_steps": [
72
+ "Move from single-episode chronological splits to held-out-episode splits.",
73
+ "Add audio features and stronger multimodal backbones for action, intent, and grounding.",
74
+ "Evaluate long-horizon task success prediction and action-conditioned generation."
75
+ ],
76
+ "tasks": [
77
+ "timeline_action",
78
+ "timeline_subtask",
79
+ "transition_detection",
80
+ "next_action",
81
+ "hand_trajectory_forecast",
82
+ "contact_prediction",
83
+ "object_relevance",
84
+ "caption_grounding",
85
+ "cross_modal_retrieval",
86
+ "temporal_order",
87
+ "misalignment_detection"
88
+ ],
89
+ "counts": {
90
+ "direct": 6,
91
+ "proxy": 2,
92
+ "diagnostic": 3,
93
+ "total_links": 11
94
+ }
95
+ },
96
+ "D": {
97
+ "id": "world_modeling",
98
+ "name": "Scene Reconstruction & World Modeling",
99
+ "focus": "Long-term consistent 3D/4D scene mapping, scene graphs, object- and space-centric representations, spatial reasoning.",
100
+ "preferred_background": "Large-scale mapping, semantic reconstruction, or agent world models.",
101
+ "current_status": "early proxy tasks",
102
+ "current_readout": "The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.",
103
+ "next_steps": [
104
+ "Convert windows into persistent object/scene-state nodes with timestamps and camera poses.",
105
+ "Add map consistency, object permanence, and spatial relation prediction tasks.",
106
+ "Train held-out-episode world models that predict future observations and task state."
107
+ ],
108
+ "tasks": [
109
+ "timeline_subtask",
110
+ "transition_detection",
111
+ "next_action",
112
+ "object_relevance",
113
+ "caption_grounding",
114
+ "cross_modal_retrieval",
115
+ "modality_reconstruction",
116
+ "temporal_order",
117
+ "misalignment_detection"
118
+ ],
119
+ "counts": {
120
+ "direct": 0,
121
+ "proxy": 6,
122
+ "diagnostic": 3,
123
+ "total_links": 9
124
+ }
125
+ }
126
+ },
127
+ "tasks": {
128
+ "timeline_action": {
129
+ "name": "Timeline action recognition",
130
+ "family": "supervised",
131
+ "input": "all featurized modalities",
132
+ "output": "current action label",
133
+ "primary_direction": "C",
134
+ "direction_roles": {
135
+ "C": "direct",
136
+ "A": "proxy"
137
+ },
138
+ "why": "Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.",
139
+ "current_limit": "Chronological single-episode split creates unseen future action classes.",
140
+ "metric": {
141
+ "key": "macro_f1",
142
+ "name": "macro-F1",
143
+ "direction": "higher",
144
+ "minimal": 0.05,
145
+ "neural_mlp": 0.02631578947368421,
146
+ "better_baseline": "minimal"
147
+ }
148
+ },
149
+ "timeline_subtask": {
150
+ "name": "Timeline subtask recognition",
151
+ "family": "supervised",
152
+ "input": "all featurized modalities",
153
+ "output": "current subtask label",
154
+ "primary_direction": "C",
155
+ "direction_roles": {
156
+ "C": "direct",
157
+ "D": "proxy"
158
+ },
159
+ "why": "Segments egocentric task state and provides a first proxy for symbolic world/task state.",
160
+ "current_limit": "Single-episode ordering makes future subtasks hard to generalize.",
161
+ "metric": {
162
+ "key": "macro_f1",
163
+ "name": "macro-F1",
164
+ "direction": "higher",
165
+ "minimal": 0.04954121121178666,
166
+ "neural_mlp": 0.017518248175182476,
167
+ "better_baseline": "minimal"
168
+ }
169
+ },
170
+ "transition_detection": {
171
+ "name": "Action transition detection",
172
+ "family": "diagnostic",
173
+ "input": "all featurized modalities",
174
+ "output": "boundary vs steady state",
175
+ "primary_direction": "C",
176
+ "direction_roles": {
177
+ "C": "direct",
178
+ "D": "diagnostic"
179
+ },
180
+ "why": "Localizes egocentric task boundaries and diagnoses temporal state changes.",
181
+ "current_limit": "Boundary class is sparse, so accuracy alone is misleading.",
182
+ "metric": {
183
+ "key": "macro_f1",
184
+ "name": "macro-F1",
185
+ "direction": "higher",
186
+ "minimal": 0.6551829268292684,
187
+ "neural_mlp": 0.6484848484848484,
188
+ "better_baseline": "minimal"
189
+ }
190
+ },
191
+ "next_action": {
192
+ "name": "Short-horizon next action",
193
+ "family": "supervised",
194
+ "input": "current multimodal window",
195
+ "output": "action 20 frames later",
196
+ "primary_direction": "C",
197
+ "direction_roles": {
198
+ "C": "direct",
199
+ "D": "proxy"
200
+ },
201
+ "why": "Tests action intention/task-flow prediction from egocentric context.",
202
+ "current_limit": "Unseen future labels dominate the single-episode chronological test.",
203
+ "metric": {
204
+ "key": "macro_f1",
205
+ "name": "macro-F1",
206
+ "direction": "higher",
207
+ "minimal": 0.05925925925925927,
208
+ "neural_mlp": 0.023529411764705882,
209
+ "better_baseline": "minimal"
210
+ }
211
+ },
212
+ "hand_trajectory_forecast": {
213
+ "name": "Hand trajectory forecasting",
214
+ "family": "forecast",
215
+ "input": "current multimodal window",
216
+ "output": "future left/right hand 3D joints",
217
+ "primary_direction": "A",
218
+ "direction_roles": {
219
+ "A": "direct",
220
+ "C": "proxy"
221
+ },
222
+ "why": "Directly predicts human hand motion and supports hand-object interaction modeling.",
223
+ "current_limit": "Forecasting is window-level and not yet a full sequence or policy model.",
224
+ "metric": {
225
+ "key": "mpjpe",
226
+ "name": "MPJPE",
227
+ "direction": "lower",
228
+ "minimal": 0.8222644925117493,
229
+ "neural_mlp": 0.11163123697042465,
230
+ "better_baseline": "neural_mlp"
231
+ }
232
+ },
233
+ "contact_prediction": {
234
+ "name": "Body/object contact prediction",
235
+ "family": "supervised",
236
+ "input": "non-contact/non-caption features",
237
+ "output": "binary contact label",
238
+ "primary_direction": "A",
239
+ "direction_roles": {
240
+ "A": "direct",
241
+ "C": "proxy"
242
+ },
243
+ "why": "Targets physical interaction state, a core affordance and manipulation signal.",
244
+ "current_limit": "The public sample is degenerate for this target because one class dominates.",
245
+ "metric": {
246
+ "key": "macro_f1",
247
+ "name": "macro-F1",
248
+ "direction": "higher",
249
+ "minimal": 1.0,
250
+ "neural_mlp": 1.0,
251
+ "better_baseline": "tie"
252
+ }
253
+ },
254
+ "object_relevance": {
255
+ "name": "Relevant object set prediction",
256
+ "family": "supervised",
257
+ "input": "non-caption feature blocks",
258
+ "output": "multi-label object set",
259
+ "primary_direction": "C",
260
+ "direction_roles": {
261
+ "C": "direct",
262
+ "A": "proxy",
263
+ "D": "proxy"
264
+ },
265
+ "why": "Connects egocentric activity to manipulated objects and early object-centric state.",
266
+ "current_limit": "Object labels are language-derived and sparse in one episode.",
267
+ "metric": {
268
+ "key": "micro_f1",
269
+ "name": "micro-F1",
270
+ "direction": "higher",
271
+ "minimal": 0.18393030009680542,
272
+ "neural_mlp": 0.1797583081570997,
273
+ "better_baseline": "minimal"
274
+ }
275
+ },
276
+ "caption_grounding": {
277
+ "name": "Caption-to-window grounding",
278
+ "family": "retrieval",
279
+ "input": "caption objects/interaction query and candidate sensor windows",
280
+ "output": "matching time window",
281
+ "primary_direction": "C",
282
+ "direction_roles": {
283
+ "C": "direct",
284
+ "D": "proxy"
285
+ },
286
+ "why": "Grounds language annotation into egocentric sensor time and task state.",
287
+ "current_limit": "Bag-of-objects language features are too weak for rich grounding.",
288
+ "metric": {
289
+ "key": "mrr",
290
+ "name": "MRR",
291
+ "direction": "higher",
292
+ "minimal": 0.017183946083791223,
293
+ "neural_mlp": 0.01781111161035397,
294
+ "better_baseline": "neural_mlp"
295
+ }
296
+ },
297
+ "cross_modal_retrieval": {
298
+ "name": "Cross-modal retrieval",
299
+ "family": "retrieval",
300
+ "input": "motion/IMU/camera query",
301
+ "output": "matching depth/video window",
302
+ "primary_direction": "C",
303
+ "direction_roles": {
304
+ "C": "diagnostic",
305
+ "B": "proxy",
306
+ "D": "proxy"
307
+ },
308
+ "why": "Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.",
309
+ "current_limit": "Retrieval proves alignment signal, not geometric reconstruction.",
310
+ "metric": {
311
+ "key": "mrr",
312
+ "name": "MRR",
313
+ "direction": "higher",
314
+ "minimal": 0.26335984006618296,
315
+ "neural_mlp": 0.1530070022204131,
316
+ "better_baseline": "minimal"
317
+ }
318
+ },
319
+ "modality_reconstruction": {
320
+ "name": "Modality reconstruction",
321
+ "family": "forecast",
322
+ "input": "motion/IMU/camera",
323
+ "output": "depth/video feature vector",
324
+ "primary_direction": "B",
325
+ "direction_roles": {
326
+ "B": "proxy",
327
+ "D": "proxy"
328
+ },
329
+ "why": "Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.",
330
+ "current_limit": "Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction.",
331
+ "metric": {
332
+ "key": "r2",
333
+ "name": "R2",
334
+ "direction": "higher",
335
+ "minimal": -0.016022846771134747,
336
+ "neural_mlp": -0.010198171891414143,
337
+ "better_baseline": "neural_mlp"
338
+ }
339
+ },
340
+ "temporal_order": {
341
+ "name": "Temporal order verification",
342
+ "family": "diagnostic",
343
+ "input": "two adjacent windows",
344
+ "output": "correct vs reversed order",
345
+ "primary_direction": "C",
346
+ "direction_roles": {
347
+ "C": "diagnostic",
348
+ "D": "diagnostic"
349
+ },
350
+ "why": "Checks whether features encode local time direction and task progression.",
351
+ "current_limit": "Only local adjacent ordering, not long-horizon causal modeling.",
352
+ "metric": {
353
+ "key": "f1",
354
+ "name": "F1",
355
+ "direction": "higher",
356
+ "minimal": 0.5487364620938628,
357
+ "neural_mlp": 0.8717948717948718,
358
+ "better_baseline": "neural_mlp"
359
+ }
360
+ },
361
+ "misalignment_detection": {
362
+ "name": "Cross-modal misalignment detection",
363
+ "family": "diagnostic",
364
+ "input": "motion plus visual/depth pair",
365
+ "output": "aligned vs shifted",
366
+ "primary_direction": "C",
367
+ "direction_roles": {
368
+ "C": "diagnostic",
369
+ "B": "diagnostic",
370
+ "D": "diagnostic"
371
+ },
372
+ "why": "Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",
373
+ "current_limit": "Synthetic shifts diagnose alignment but do not solve calibration or mapping.",
374
+ "metric": {
375
+ "key": "f1",
376
+ "name": "F1",
377
+ "direction": "higher",
378
+ "minimal": 0.4865671641791045,
379
+ "neural_mlp": 0.7335243553008597,
380
+ "better_baseline": "neural_mlp"
381
+ }
382
+ }
383
+ }
384
+ }
assets/charts/research_direction_coverage.svg ADDED
scripts/research_direction_taxonomy.py ADDED
@@ -0,0 +1,589 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Organize the 12 Xperience-10M tasks into the four Ropedia research tracks.
3
+
4
+ The script is intentionally deterministic: it reads the committed task metrics,
5
+ adds a hand-audited taxonomy, and writes machine-readable artifacts used by the
6
+ README, website, and Hugging Face pages.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import csv
12
+ import html
13
+ import json
14
+ from collections import OrderedDict
15
+ from pathlib import Path
16
+ from typing import Any
17
+
18
+
19
+ ROOT = Path(__file__).resolve().parents[1]
20
+ RESULTS = ROOT / "results" / "episode_task_suite"
21
+ OUT_DIR = RESULTS / "research_directions"
22
+ DOCS_DATA = ROOT / "docs" / "data"
23
+ CHARTS = ROOT / "docs" / "assets" / "charts"
24
+
25
+ SUMMARY_REPORT = RESULTS / "summary_report.json"
26
+
27
+
28
+ DIRECTIONS: OrderedDict[str, dict[str, Any]] = OrderedDict(
29
+ [
30
+ (
31
+ "A",
32
+ {
33
+ "id": "human_motion",
34
+ "name": "Human Modeling & Motion Understanding",
35
+ "focus": "Human/hand/body motion, deformation priors, human-object interaction, affordance modeling.",
36
+ "preferred_background": "Human pose/shape estimation, SMPL-style models, motion capture, or motion generation.",
37
+ "current_status": "partially implemented",
38
+ "current_readout": "The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.",
39
+ "next_steps": [
40
+ "Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.",
41
+ "Train sequence models over multi-episode motion trajectories instead of isolated windows.",
42
+ "Evaluate affordance prediction on held-out objects and held-out episodes.",
43
+ ],
44
+ },
45
+ ),
46
+ (
47
+ "B",
48
+ {
49
+ "id": "reconstruction_rendering",
50
+ "name": "3D/4D Reconstruction & Neural Rendering",
51
+ "focus": "Multi-view dynamic scene reconstruction, NeRF/Gaussian Splatting, novel-view synthesis.",
52
+ "preferred_background": "3D reconstruction, neural rendering, camera calibration, and bundle adjustment.",
53
+ "current_status": "proxy tasks only",
54
+ "current_readout": "The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.",
55
+ "next_steps": [
56
+ "Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.",
57
+ "Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.",
58
+ "Evaluate novel-view synthesis and temporal consistency across held-out views/time.",
59
+ ],
60
+ },
61
+ ),
62
+ (
63
+ "C",
64
+ {
65
+ "id": "egocentric_interaction",
66
+ "name": "Egocentric Vision & Interaction",
67
+ "focus": "Egocentric action and intention understanding, hand-object interaction, gaze/attention modeling, task structure modeling.",
68
+ "preferred_background": "Video understanding, action recognition, or egocentric vision.",
69
+ "current_status": "strongest implemented track",
70
+ "current_readout": "Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.",
71
+ "next_steps": [
72
+ "Move from single-episode chronological splits to held-out-episode splits.",
73
+ "Add audio features and stronger multimodal backbones for action, intent, and grounding.",
74
+ "Evaluate long-horizon task success prediction and action-conditioned generation.",
75
+ ],
76
+ },
77
+ ),
78
+ (
79
+ "D",
80
+ {
81
+ "id": "world_modeling",
82
+ "name": "Scene Reconstruction & World Modeling",
83
+ "focus": "Long-term consistent 3D/4D scene mapping, scene graphs, object- and space-centric representations, spatial reasoning.",
84
+ "preferred_background": "Large-scale mapping, semantic reconstruction, or agent world models.",
85
+ "current_status": "early proxy tasks",
86
+ "current_readout": "The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.",
87
+ "next_steps": [
88
+ "Convert windows into persistent object/scene-state nodes with timestamps and camera poses.",
89
+ "Add map consistency, object permanence, and spatial relation prediction tasks.",
90
+ "Train held-out-episode world models that predict future observations and task state.",
91
+ ],
92
+ },
93
+ ),
94
+ ]
95
+ )
96
+
97
+
98
+ TASK_TAXONOMY: OrderedDict[str, dict[str, Any]] = OrderedDict(
99
+ [
100
+ (
101
+ "timeline_action",
102
+ {
103
+ "name": "Timeline action recognition",
104
+ "family": "supervised",
105
+ "input": "all featurized modalities",
106
+ "output": "current action label",
107
+ "primary_direction": "C",
108
+ "direction_roles": {"C": "direct", "A": "proxy"},
109
+ "why": "Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.",
110
+ "current_limit": "Chronological single-episode split creates unseen future action classes.",
111
+ },
112
+ ),
113
+ (
114
+ "timeline_subtask",
115
+ {
116
+ "name": "Timeline subtask recognition",
117
+ "family": "supervised",
118
+ "input": "all featurized modalities",
119
+ "output": "current subtask label",
120
+ "primary_direction": "C",
121
+ "direction_roles": {"C": "direct", "D": "proxy"},
122
+ "why": "Segments egocentric task state and provides a first proxy for symbolic world/task state.",
123
+ "current_limit": "Single-episode ordering makes future subtasks hard to generalize.",
124
+ },
125
+ ),
126
+ (
127
+ "transition_detection",
128
+ {
129
+ "name": "Action transition detection",
130
+ "family": "diagnostic",
131
+ "input": "all featurized modalities",
132
+ "output": "boundary vs steady state",
133
+ "primary_direction": "C",
134
+ "direction_roles": {"C": "direct", "D": "diagnostic"},
135
+ "why": "Localizes egocentric task boundaries and diagnoses temporal state changes.",
136
+ "current_limit": "Boundary class is sparse, so accuracy alone is misleading.",
137
+ },
138
+ ),
139
+ (
140
+ "next_action",
141
+ {
142
+ "name": "Short-horizon next action",
143
+ "family": "supervised",
144
+ "input": "current multimodal window",
145
+ "output": "action 20 frames later",
146
+ "primary_direction": "C",
147
+ "direction_roles": {"C": "direct", "D": "proxy"},
148
+ "why": "Tests action intention/task-flow prediction from egocentric context.",
149
+ "current_limit": "Unseen future labels dominate the single-episode chronological test.",
150
+ },
151
+ ),
152
+ (
153
+ "hand_trajectory_forecast",
154
+ {
155
+ "name": "Hand trajectory forecasting",
156
+ "family": "forecast",
157
+ "input": "current multimodal window",
158
+ "output": "future left/right hand 3D joints",
159
+ "primary_direction": "A",
160
+ "direction_roles": {"A": "direct", "C": "proxy"},
161
+ "why": "Directly predicts human hand motion and supports hand-object interaction modeling.",
162
+ "current_limit": "Forecasting is window-level and not yet a full sequence or policy model.",
163
+ },
164
+ ),
165
+ (
166
+ "contact_prediction",
167
+ {
168
+ "name": "Body/object contact prediction",
169
+ "family": "supervised",
170
+ "input": "non-contact/non-caption features",
171
+ "output": "binary contact label",
172
+ "primary_direction": "A",
173
+ "direction_roles": {"A": "direct", "C": "proxy"},
174
+ "why": "Targets physical interaction state, a core affordance and manipulation signal.",
175
+ "current_limit": "The public sample is degenerate for this target because one class dominates.",
176
+ },
177
+ ),
178
+ (
179
+ "object_relevance",
180
+ {
181
+ "name": "Relevant object set prediction",
182
+ "family": "supervised",
183
+ "input": "non-caption feature blocks",
184
+ "output": "multi-label object set",
185
+ "primary_direction": "C",
186
+ "direction_roles": {"C": "direct", "A": "proxy", "D": "proxy"},
187
+ "why": "Connects egocentric activity to manipulated objects and early object-centric state.",
188
+ "current_limit": "Object labels are language-derived and sparse in one episode.",
189
+ },
190
+ ),
191
+ (
192
+ "caption_grounding",
193
+ {
194
+ "name": "Caption-to-window grounding",
195
+ "family": "retrieval",
196
+ "input": "caption objects/interaction query and candidate sensor windows",
197
+ "output": "matching time window",
198
+ "primary_direction": "C",
199
+ "direction_roles": {"C": "direct", "D": "proxy"},
200
+ "why": "Grounds language annotation into egocentric sensor time and task state.",
201
+ "current_limit": "Bag-of-objects language features are too weak for rich grounding.",
202
+ },
203
+ ),
204
+ (
205
+ "cross_modal_retrieval",
206
+ {
207
+ "name": "Cross-modal retrieval",
208
+ "family": "retrieval",
209
+ "input": "motion/IMU/camera query",
210
+ "output": "matching depth/video window",
211
+ "primary_direction": "C",
212
+ "direction_roles": {"C": "diagnostic", "B": "proxy", "D": "proxy"},
213
+ "why": "Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.",
214
+ "current_limit": "Retrieval proves alignment signal, not geometric reconstruction.",
215
+ },
216
+ ),
217
+ (
218
+ "modality_reconstruction",
219
+ {
220
+ "name": "Modality reconstruction",
221
+ "family": "forecast",
222
+ "input": "motion/IMU/camera",
223
+ "output": "depth/video feature vector",
224
+ "primary_direction": "B",
225
+ "direction_roles": {"B": "proxy", "D": "proxy"},
226
+ "why": "Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.",
227
+ "current_limit": "Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction.",
228
+ },
229
+ ),
230
+ (
231
+ "temporal_order",
232
+ {
233
+ "name": "Temporal order verification",
234
+ "family": "diagnostic",
235
+ "input": "two adjacent windows",
236
+ "output": "correct vs reversed order",
237
+ "primary_direction": "C",
238
+ "direction_roles": {"C": "diagnostic", "D": "diagnostic"},
239
+ "why": "Checks whether features encode local time direction and task progression.",
240
+ "current_limit": "Only local adjacent ordering, not long-horizon causal modeling.",
241
+ },
242
+ ),
243
+ (
244
+ "misalignment_detection",
245
+ {
246
+ "name": "Cross-modal misalignment detection",
247
+ "family": "diagnostic",
248
+ "input": "motion plus visual/depth pair",
249
+ "output": "aligned vs shifted",
250
+ "primary_direction": "C",
251
+ "direction_roles": {"C": "diagnostic", "B": "diagnostic", "D": "diagnostic"},
252
+ "why": "Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",
253
+ "current_limit": "Synthetic shifts diagnose alignment but do not solve calibration or mapping.",
254
+ },
255
+ ),
256
+ ]
257
+ )
258
+
259
+
260
+ METRIC_SPECS = {
261
+ "timeline_action": ("macro_f1", "macro-F1", "higher"),
262
+ "timeline_subtask": ("macro_f1", "macro-F1", "higher"),
263
+ "transition_detection": ("macro_f1", "macro-F1", "higher"),
264
+ "next_action": ("macro_f1", "macro-F1", "higher"),
265
+ "hand_trajectory_forecast": ("mpjpe", "MPJPE", "lower"),
266
+ "contact_prediction": ("macro_f1", "macro-F1", "higher"),
267
+ "object_relevance": ("micro_f1", "micro-F1", "higher"),
268
+ "caption_grounding": ("mrr", "MRR", "higher"),
269
+ "cross_modal_retrieval": ("mrr", "MRR", "higher"),
270
+ "modality_reconstruction": ("r2", "R2", "higher"),
271
+ "temporal_order": ("f1", "F1", "higher"),
272
+ "misalignment_detection": ("f1", "F1", "higher"),
273
+ }
274
+
275
+
276
+ def load_summary() -> dict[str, Any]:
277
+ return json.loads(SUMMARY_REPORT.read_text(encoding="utf-8"))
278
+
279
+
280
+ def metric_value(metrics: dict[str, Any] | None, task: str) -> float | None:
281
+ if not metrics:
282
+ return None
283
+ key = METRIC_SPECS[task][0]
284
+ value = metrics.get(key)
285
+ return float(value) if value is not None else None
286
+
287
+
288
+ def choose_better(task: str, minimal: float | None, neural: float | None) -> str:
289
+ if minimal is None or neural is None:
290
+ return "unavailable"
291
+ _, _, direction = METRIC_SPECS[task]
292
+ delta = neural - minimal
293
+ if abs(delta) < 1e-9:
294
+ return "tie"
295
+ if direction == "lower":
296
+ return "neural_mlp" if delta < 0 else "minimal"
297
+ return "neural_mlp" if delta > 0 else "minimal"
298
+
299
+
300
+ def fmt_metric(value: float | None) -> str:
301
+ if value is None:
302
+ return "n/a"
303
+ if abs(value) >= 10:
304
+ return f"{value:.3f}"
305
+ return f"{value:.4f}"
306
+
307
+
308
+ def baseline_readout(label: str) -> str:
309
+ if label == "tie":
310
+ return "Both baselines are tied"
311
+ if label == "minimal":
312
+ return "Minimal baseline is stronger"
313
+ if label == "neural_mlp":
314
+ return "Neural MLP is stronger"
315
+ return "Baseline comparison is unavailable"
316
+
317
+
318
+ def build_taxonomy(summary: dict[str, Any]) -> dict[str, Any]:
319
+ minimal_tasks = summary["tasks"]
320
+ neural_tasks = summary.get("neural_tasks", {})
321
+
322
+ task_records: OrderedDict[str, dict[str, Any]] = OrderedDict()
323
+ direction_counts = {
324
+ code: {"direct": 0, "proxy": 0, "diagnostic": 0, "total_links": 0}
325
+ for code in DIRECTIONS
326
+ }
327
+
328
+ for task, spec in TASK_TAXONOMY.items():
329
+ metric_key, metric_name, metric_direction = METRIC_SPECS[task]
330
+ minimal_metric = metric_value(minimal_tasks.get(task), task)
331
+ neural_metric = metric_value(neural_tasks.get(task), task)
332
+ better = choose_better(task, minimal_metric, neural_metric)
333
+
334
+ roles = spec["direction_roles"]
335
+ for direction_code, role in roles.items():
336
+ direction_counts[direction_code][role] += 1
337
+ direction_counts[direction_code]["total_links"] += 1
338
+
339
+ task_records[task] = {
340
+ **spec,
341
+ "metric": {
342
+ "key": metric_key,
343
+ "name": metric_name,
344
+ "direction": metric_direction,
345
+ "minimal": minimal_metric,
346
+ "neural_mlp": neural_metric,
347
+ "better_baseline": better,
348
+ },
349
+ }
350
+
351
+ direction_records = OrderedDict()
352
+ for code, info in DIRECTIONS.items():
353
+ linked_tasks = [
354
+ task
355
+ for task, spec in task_records.items()
356
+ if code in spec["direction_roles"]
357
+ ]
358
+ direction_records[code] = {
359
+ **info,
360
+ "tasks": linked_tasks,
361
+ "counts": direction_counts[code],
362
+ }
363
+
364
+ return {
365
+ "source": "results/episode_task_suite/summary_report.json",
366
+ "dataset_scope": {
367
+ "sample_episode_count": 1,
368
+ "num_frames": summary.get("num_frames"),
369
+ "num_windows": summary.get("num_windows"),
370
+ "feature_dim": summary.get("feature_dim"),
371
+ "warning": "Single public sample episode; this supports pipeline/task evidence, not cross-episode generalization claims.",
372
+ },
373
+ "baselines": {
374
+ "minimal": "Interpretable softmax, logistic, ridge, and retrieval heads over the 8,378-d window feature vector.",
375
+ "neural_mlp": "Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts.",
376
+ },
377
+ "directions": direction_records,
378
+ "tasks": task_records,
379
+ }
380
+
381
+
382
+ def write_csv(taxonomy: dict[str, Any]) -> None:
383
+ path = OUT_DIR / "research_direction_task_map.csv"
384
+ with path.open("w", newline="", encoding="utf-8") as handle:
385
+ writer = csv.writer(handle, lineterminator="\n")
386
+ writer.writerow(
387
+ [
388
+ "direction",
389
+ "direction_name",
390
+ "task",
391
+ "task_name",
392
+ "family",
393
+ "relationship",
394
+ "primary_direction",
395
+ "metric_name",
396
+ "minimal_metric",
397
+ "neural_mlp_metric",
398
+ "better_baseline",
399
+ "why",
400
+ "current_limit",
401
+ ]
402
+ )
403
+ for task, spec in taxonomy["tasks"].items():
404
+ metric = spec["metric"]
405
+ for direction_code, relationship in spec["direction_roles"].items():
406
+ writer.writerow(
407
+ [
408
+ direction_code,
409
+ taxonomy["directions"][direction_code]["name"],
410
+ task,
411
+ spec["name"],
412
+ spec["family"],
413
+ relationship,
414
+ spec["primary_direction"],
415
+ metric["name"],
416
+ "" if metric["minimal"] is None else f"{metric['minimal']:.12g}",
417
+ "" if metric["neural_mlp"] is None else f"{metric['neural_mlp']:.12g}",
418
+ metric["better_baseline"],
419
+ spec["why"],
420
+ spec["current_limit"],
421
+ ]
422
+ )
423
+
424
+
425
+ def write_markdown(taxonomy: dict[str, Any]) -> None:
426
+ lines = [
427
+ "# Four-Direction Task Taxonomy",
428
+ "",
429
+ "This file is generated by `scripts/research_direction_taxonomy.py` from the committed 12-task metrics.",
430
+ "It maps the current Xperience-10M sample tasks to the four Ropedia research directions without claiming that a single episode solves any full direction.",
431
+ "",
432
+ "## Baseline Families",
433
+ "",
434
+ "| Baseline | Meaning |",
435
+ "| --- | --- |",
436
+ f"| Minimal | {taxonomy['baselines']['minimal']} |",
437
+ f"| Neural MLP | {taxonomy['baselines']['neural_mlp']} |",
438
+ "",
439
+ "## Direction Coverage",
440
+ "",
441
+ "| Direction | Current status | Direct | Proxy | Diagnostic | Current readout |",
442
+ "| --- | --- | ---: | ---: | ---: | --- |",
443
+ ]
444
+ for code, info in taxonomy["directions"].items():
445
+ counts = info["counts"]
446
+ lines.append(
447
+ f"| {code}. {info['name']} | {info['current_status']} | {counts['direct']} | {counts['proxy']} | {counts['diagnostic']} | {info['current_readout']} |"
448
+ )
449
+
450
+ lines.extend(
451
+ [
452
+ "",
453
+ "## Task Mapping With Two Baselines",
454
+ "",
455
+ "| Task | Primary direction | Related directions | Minimal | Neural MLP | Readout |",
456
+ "| --- | --- | --- | ---: | ---: | --- |",
457
+ ]
458
+ )
459
+ for task, spec in taxonomy["tasks"].items():
460
+ metric = spec["metric"]
461
+ related = ", ".join(
462
+ f"{code}:{role}" for code, role in spec["direction_roles"].items()
463
+ )
464
+ minimal = f"{fmt_metric(metric['minimal'])} {metric['name']}"
465
+ neural = f"{fmt_metric(metric['neural_mlp'])} {metric['name']}"
466
+ readout = f"{baseline_readout(metric['better_baseline'])}. {spec['current_limit']}"
467
+ lines.append(
468
+ f"| `{task}` | {spec['primary_direction']} | {related} | {minimal} | {neural} | {readout} |"
469
+ )
470
+
471
+ lines.extend(["", "## Next-Step Interpretation", ""])
472
+ for code, info in taxonomy["directions"].items():
473
+ lines.append(f"### {code}. {info['name']}")
474
+ lines.append("")
475
+ lines.append(info["current_readout"])
476
+ lines.append("")
477
+ for step in info["next_steps"]:
478
+ lines.append(f"- {step}")
479
+ lines.append("")
480
+
481
+ (OUT_DIR / "research_direction_summary.md").write_text(
482
+ "\n".join(lines).rstrip() + "\n", encoding="utf-8"
483
+ )
484
+
485
+
486
+ def svg_text(x: int, y: int, text: str, size: int = 16, weight: int = 500, color: str = "#16213a") -> str:
487
+ return (
488
+ f'<text x="{x}" y="{y}" font-size="{size}" font-weight="{weight}" '
489
+ f'fill="{color}">{html.escape(text)}</text>'
490
+ )
491
+
492
+
493
+ def write_svg(taxonomy: dict[str, Any]) -> None:
494
+ width = 1180
495
+ height = 700
496
+ margin = 58
497
+ card_w = 515
498
+ card_h = 220
499
+ colors = {"direct": "#1f6c9f", "proxy": "#2e7775", "diagnostic": "#956400"}
500
+ cards = []
501
+
502
+ for idx, (code, info) in enumerate(taxonomy["directions"].items()):
503
+ row = idx // 2
504
+ col = idx % 2
505
+ x = margin + col * (card_w + 34)
506
+ y = 130 + row * (card_h + 34)
507
+ counts = info["counts"]
508
+ total = max(1, counts["direct"] + counts["proxy"] + counts["diagnostic"])
509
+ bar_x = x + 24
510
+ bar_y = y + 132
511
+ bar_w = card_w - 48
512
+ cursor = bar_x
513
+ segments = []
514
+ for key in ("direct", "proxy", "diagnostic"):
515
+ seg_w = round(bar_w * counts[key] / total)
516
+ if counts[key] > 0:
517
+ segments.append(
518
+ f'<rect x="{cursor}" y="{bar_y}" width="{seg_w}" height="16" rx="8" fill="{colors[key]}"/>'
519
+ )
520
+ cursor += seg_w
521
+
522
+ task_labels = ", ".join(info["tasks"][:5])
523
+ if len(info["tasks"]) > 5:
524
+ task_labels += f", +{len(info['tasks']) - 5}"
525
+
526
+ cards.append(
527
+ "\n".join(
528
+ [
529
+ f'<rect x="{x}" y="{y}" width="{card_w}" height="{card_h}" rx="8" fill="#ffffff" stroke="#d9e1ea"/>',
530
+ svg_text(x + 24, y + 42, f"{code}. {info['name']}", 21, 700),
531
+ svg_text(x + 24, y + 75, info["current_status"], 15, 700, "#566273"),
532
+ svg_text(x + 24, y + 108, f"Tasks: {task_labels}", 14, 500, "#30394a"),
533
+ *segments,
534
+ svg_text(x + 24, y + 174, f"Direct {counts['direct']}", 14, 700, colors["direct"]),
535
+ svg_text(x + 150, y + 174, f"Proxy {counts['proxy']}", 14, 700, colors["proxy"]),
536
+ svg_text(x + 270, y + 174, f"Diagnostic {counts['diagnostic']}", 14, 700, colors["diagnostic"]),
537
+ ]
538
+ )
539
+ )
540
+
541
+ legend = []
542
+ lx = margin
543
+ for key, label in (
544
+ ("direct", "Direct task"),
545
+ ("proxy", "Proxy / prerequisite"),
546
+ ("diagnostic", "Diagnostic probe"),
547
+ ):
548
+ legend.extend(
549
+ [
550
+ f'<rect x="{lx}" y="622" width="16" height="16" rx="4" fill="{colors[key]}"/>',
551
+ svg_text(lx + 24, 636, label, 14, 600, "#30394a"),
552
+ ]
553
+ )
554
+ lx += 200
555
+
556
+ svg = f"""<svg xmlns="http://www.w3.org/2000/svg" width="{width}" height="{height}" viewBox="0 0 {width} {height}" role="img" aria-label="Xperience-10M task coverage across four research directions">
557
+ <rect width="100%" height="100%" fill="#f7f9fb"/>
558
+ {svg_text(margin, 64, "Xperience-10M 12-Task Suite: Four Research Directions", 30, 800)}
559
+ {svg_text(margin, 96, "One public sample episode, two baseline families, explicit direct/proxy/diagnostic coverage.", 16, 500, "#566273")}
560
+ {"".join(cards)}
561
+ {"".join(legend)}
562
+ {svg_text(margin, 670, "Generated from results/episode_task_suite/summary_report.json and scripts/research_direction_taxonomy.py", 13, 500, "#6d7787")}
563
+ </svg>
564
+ """
565
+ (CHARTS / "research_direction_coverage.svg").write_text(svg, encoding="utf-8")
566
+
567
+
568
+ def main() -> None:
569
+ OUT_DIR.mkdir(parents=True, exist_ok=True)
570
+ DOCS_DATA.mkdir(parents=True, exist_ok=True)
571
+ CHARTS.mkdir(parents=True, exist_ok=True)
572
+
573
+ taxonomy = build_taxonomy(load_summary())
574
+ json_text = json.dumps(taxonomy, indent=2, ensure_ascii=False)
575
+ (OUT_DIR / "research_direction_taxonomy.json").write_text(json_text + "\n", encoding="utf-8")
576
+ (DOCS_DATA / "research_directions.json").write_text(json_text + "\n", encoding="utf-8")
577
+ write_csv(taxonomy)
578
+ write_markdown(taxonomy)
579
+ write_svg(taxonomy)
580
+
581
+ print(f"Wrote {OUT_DIR / 'research_direction_taxonomy.json'}")
582
+ print(f"Wrote {OUT_DIR / 'research_direction_task_map.csv'}")
583
+ print(f"Wrote {OUT_DIR / 'research_direction_summary.md'}")
584
+ print(f"Wrote {DOCS_DATA / 'research_directions.json'}")
585
+ print(f"Wrote {CHARTS / 'research_direction_coverage.svg'}")
586
+
587
+
588
+ if __name__ == "__main__":
589
+ main()