Publish Xperience-10M minimal and neural task baseline cards
Browse files- README.md +17 -0
- artifacts/episode_task_suite/research_directions/research_direction_summary.md +71 -0
- artifacts/episode_task_suite/research_directions/research_direction_task_map.csv +28 -0
- artifacts/episode_task_suite/research_directions/research_direction_taxonomy.json +384 -0
- assets/charts/research_direction_coverage.svg +41 -0
- scripts/research_direction_taxonomy.py +589 -0
README.md
CHANGED
|
@@ -94,6 +94,7 @@ transfers them to H20 for manifest building, training, and evaluation.
|
|
| 94 |
| `artifacts/episode_task_suite/neural_mlp/**/model.pt` | stores the neural MLP checkpoints |
|
| 95 |
| `artifacts/**/metrics.json` | records the committed metric values |
|
| 96 |
| `artifacts/**/feature_manifest.json` | maps feature blocks back to source modalities |
|
|
|
|
| 97 |
| `assets/task_architectures.png` | shows the shared pipeline and all 12 heads |
|
| 98 |
| `assets/task_suite_infographic.png` | presents the 12 heads with public-sample modality thumbnails and verified metrics |
|
| 99 |
|
|
@@ -104,6 +105,7 @@ transfers them to H20 for manifest building, training, and evaluation.
|
|
| 104 |
- `artifacts/episode_task_suite/neural_mlp/**/history.json`: neural training traces
|
| 105 |
- `artifacts/**/metrics.json`: committed metrics
|
| 106 |
- `artifacts/**/feature_manifest.json`: feature block boundaries where relevant
|
|
|
|
| 107 |
- `scripts/*.py`: training and visualization scripts
|
| 108 |
- `notes/*.md`: interpretation and reproducibility notes
|
| 109 |
|
|
@@ -127,6 +129,21 @@ https://huggingface.co/collections/cy0307/ropedia-episode-task-suite
|
|
| 127 |
|
| 128 |

|
| 129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
## Metrics Snapshot
|
| 131 |
|
| 132 |
| Task | Neural MLP metric | Minimal metric |
|
|
|
|
| 94 |
| `artifacts/episode_task_suite/neural_mlp/**/model.pt` | stores the neural MLP checkpoints |
|
| 95 |
| `artifacts/**/metrics.json` | records the committed metric values |
|
| 96 |
| `artifacts/**/feature_manifest.json` | maps feature blocks back to source modalities |
|
| 97 |
+
| `artifacts/episode_task_suite/research_directions/` | maps every task to the four Ropedia research directions with minimal-vs-neural readouts |
|
| 98 |
| `assets/task_architectures.png` | shows the shared pipeline and all 12 heads |
|
| 99 |
| `assets/task_suite_infographic.png` | presents the 12 heads with public-sample modality thumbnails and verified metrics |
|
| 100 |
|
|
|
|
| 105 |
- `artifacts/episode_task_suite/neural_mlp/**/history.json`: neural training traces
|
| 106 |
- `artifacts/**/metrics.json`: committed metrics
|
| 107 |
- `artifacts/**/feature_manifest.json`: feature block boundaries where relevant
|
| 108 |
+
- `artifacts/episode_task_suite/research_directions/*.json|*.csv|*.md`: four-track task taxonomy
|
| 109 |
- `scripts/*.py`: training and visualization scripts
|
| 110 |
- `notes/*.md`: interpretation and reproducibility notes
|
| 111 |
|
|
|
|
| 129 |
|
| 130 |

|
| 131 |
|
| 132 |
+
## Four Research Directions
|
| 133 |
+
|
| 134 |
+
The baselines are also grouped by the four Ropedia research tracks:
|
| 135 |
+
|
| 136 |
+
| Direction | Current status | Baseline evidence |
|
| 137 |
+
| --- | --- | --- |
|
| 138 |
+
| A. Human Modeling & Motion Understanding | partially implemented | hand trajectory forecasting improves from `0.8223` to `0.1116` MPJPE with the neural MLP; contact is degenerate in this sample |
|
| 139 |
+
| B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | cross-modal retrieval, feature reconstruction, and misalignment are prerequisites, not full neural rendering |
|
| 140 |
+
| C. Egocentric Vision & Interaction | strongest implemented track | action/subtask/transition/next-action/object/caption tasks plus alignment/order diagnostics |
|
| 141 |
+
| D. Scene Reconstruction & World Modeling | early proxy tasks | state, object, retrieval, reconstruction, and temporal tasks are first probes before scene graphs or maps |
|
| 142 |
+
|
| 143 |
+
Primary taxonomy file:
|
| 144 |
+
|
| 145 |
+
`artifacts/episode_task_suite/research_directions/research_direction_taxonomy.json`
|
| 146 |
+
|
| 147 |
## Metrics Snapshot
|
| 148 |
|
| 149 |
| Task | Neural MLP metric | Minimal metric |
|
artifacts/episode_task_suite/research_directions/research_direction_summary.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Four-Direction Task Taxonomy
|
| 2 |
+
|
| 3 |
+
This file is generated by `scripts/research_direction_taxonomy.py` from the committed 12-task metrics.
|
| 4 |
+
It maps the current Xperience-10M sample tasks to the four Ropedia research directions without claiming that a single episode solves any full direction.
|
| 5 |
+
|
| 6 |
+
## Baseline Families
|
| 7 |
+
|
| 8 |
+
| Baseline | Meaning |
|
| 9 |
+
| --- | --- |
|
| 10 |
+
| Minimal | Interpretable softmax, logistic, ridge, and retrieval heads over the 8,378-d window feature vector. |
|
| 11 |
+
| Neural MLP | Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts. |
|
| 12 |
+
|
| 13 |
+
## Direction Coverage
|
| 14 |
+
|
| 15 |
+
| Direction | Current status | Direct | Proxy | Diagnostic | Current readout |
|
| 16 |
+
| --- | --- | ---: | ---: | ---: | --- |
|
| 17 |
+
| A. Human Modeling & Motion Understanding | partially implemented | 2 | 2 | 0 | The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors. |
|
| 18 |
+
| B. 3D/4D Reconstruction & Neural Rendering | proxy tasks only | 0 | 2 | 1 | The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry. |
|
| 19 |
+
| C. Egocentric Vision & Interaction | strongest implemented track | 6 | 2 | 3 | Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment. |
|
| 20 |
+
| D. Scene Reconstruction & World Modeling | early proxy tasks | 0 | 6 | 3 | The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs. |
|
| 21 |
+
|
| 22 |
+
## Task Mapping With Two Baselines
|
| 23 |
+
|
| 24 |
+
| Task | Primary direction | Related directions | Minimal | Neural MLP | Readout |
|
| 25 |
+
| --- | --- | --- | ---: | ---: | --- |
|
| 26 |
+
| `timeline_action` | C | C:direct, A:proxy | 0.0500 macro-F1 | 0.0263 macro-F1 | Minimal baseline is stronger. Chronological single-episode split creates unseen future action classes. |
|
| 27 |
+
| `timeline_subtask` | C | C:direct, D:proxy | 0.0495 macro-F1 | 0.0175 macro-F1 | Minimal baseline is stronger. Single-episode ordering makes future subtasks hard to generalize. |
|
| 28 |
+
| `transition_detection` | C | C:direct, D:diagnostic | 0.6552 macro-F1 | 0.6485 macro-F1 | Minimal baseline is stronger. Boundary class is sparse, so accuracy alone is misleading. |
|
| 29 |
+
| `next_action` | C | C:direct, D:proxy | 0.0593 macro-F1 | 0.0235 macro-F1 | Minimal baseline is stronger. Unseen future labels dominate the single-episode chronological test. |
|
| 30 |
+
| `hand_trajectory_forecast` | A | A:direct, C:proxy | 0.8223 MPJPE | 0.1116 MPJPE | Neural MLP is stronger. Forecasting is window-level and not yet a full sequence or policy model. |
|
| 31 |
+
| `contact_prediction` | A | A:direct, C:proxy | 1.0000 macro-F1 | 1.0000 macro-F1 | Both baselines are tied. The public sample is degenerate for this target because one class dominates. |
|
| 32 |
+
| `object_relevance` | C | C:direct, A:proxy, D:proxy | 0.1839 micro-F1 | 0.1798 micro-F1 | Minimal baseline is stronger. Object labels are language-derived and sparse in one episode. |
|
| 33 |
+
| `caption_grounding` | C | C:direct, D:proxy | 0.0172 MRR | 0.0178 MRR | Neural MLP is stronger. Bag-of-objects language features are too weak for rich grounding. |
|
| 34 |
+
| `cross_modal_retrieval` | C | C:diagnostic, B:proxy, D:proxy | 0.2634 MRR | 0.1530 MRR | Minimal baseline is stronger. Retrieval proves alignment signal, not geometric reconstruction. |
|
| 35 |
+
| `modality_reconstruction` | B | B:proxy, D:proxy | -0.0160 R2 | -0.0102 R2 | Neural MLP is stronger. Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction. |
|
| 36 |
+
| `temporal_order` | C | C:diagnostic, D:diagnostic | 0.5487 F1 | 0.8718 F1 | Neural MLP is stronger. Only local adjacent ordering, not long-horizon causal modeling. |
|
| 37 |
+
| `misalignment_detection` | C | C:diagnostic, B:diagnostic, D:diagnostic | 0.4866 F1 | 0.7335 F1 | Neural MLP is stronger. Synthetic shifts diagnose alignment but do not solve calibration or mapping. |
|
| 38 |
+
|
| 39 |
+
## Next-Step Interpretation
|
| 40 |
+
|
| 41 |
+
### A. Human Modeling & Motion Understanding
|
| 42 |
+
|
| 43 |
+
The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.
|
| 44 |
+
|
| 45 |
+
- Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.
|
| 46 |
+
- Train sequence models over multi-episode motion trajectories instead of isolated windows.
|
| 47 |
+
- Evaluate affordance prediction on held-out objects and held-out episodes.
|
| 48 |
+
|
| 49 |
+
### B. 3D/4D Reconstruction & Neural Rendering
|
| 50 |
+
|
| 51 |
+
The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.
|
| 52 |
+
|
| 53 |
+
- Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.
|
| 54 |
+
- Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.
|
| 55 |
+
- Evaluate novel-view synthesis and temporal consistency across held-out views/time.
|
| 56 |
+
|
| 57 |
+
### C. Egocentric Vision & Interaction
|
| 58 |
+
|
| 59 |
+
Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.
|
| 60 |
+
|
| 61 |
+
- Move from single-episode chronological splits to held-out-episode splits.
|
| 62 |
+
- Add audio features and stronger multimodal backbones for action, intent, and grounding.
|
| 63 |
+
- Evaluate long-horizon task success prediction and action-conditioned generation.
|
| 64 |
+
|
| 65 |
+
### D. Scene Reconstruction & World Modeling
|
| 66 |
+
|
| 67 |
+
The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.
|
| 68 |
+
|
| 69 |
+
- Convert windows into persistent object/scene-state nodes with timestamps and camera poses.
|
| 70 |
+
- Add map consistency, object permanence, and spatial relation prediction tasks.
|
| 71 |
+
- Train held-out-episode world models that predict future observations and task state.
|
artifacts/episode_task_suite/research_directions/research_direction_task_map.csv
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
direction,direction_name,task,task_name,family,relationship,primary_direction,metric_name,minimal_metric,neural_mlp_metric,better_baseline,why,current_limit
|
| 2 |
+
C,Egocentric Vision & Interaction,timeline_action,Timeline action recognition,supervised,direct,C,macro-F1,0.05,0.0263157894737,minimal,Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.,Chronological single-episode split creates unseen future action classes.
|
| 3 |
+
A,Human Modeling & Motion Understanding,timeline_action,Timeline action recognition,supervised,proxy,C,macro-F1,0.05,0.0263157894737,minimal,Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.,Chronological single-episode split creates unseen future action classes.
|
| 4 |
+
C,Egocentric Vision & Interaction,timeline_subtask,Timeline subtask recognition,supervised,direct,C,macro-F1,0.0495412112118,0.0175182481752,minimal,Segments egocentric task state and provides a first proxy for symbolic world/task state.,Single-episode ordering makes future subtasks hard to generalize.
|
| 5 |
+
D,Scene Reconstruction & World Modeling,timeline_subtask,Timeline subtask recognition,supervised,proxy,C,macro-F1,0.0495412112118,0.0175182481752,minimal,Segments egocentric task state and provides a first proxy for symbolic world/task state.,Single-episode ordering makes future subtasks hard to generalize.
|
| 6 |
+
C,Egocentric Vision & Interaction,transition_detection,Action transition detection,diagnostic,direct,C,macro-F1,0.655182926829,0.648484848485,minimal,Localizes egocentric task boundaries and diagnoses temporal state changes.,"Boundary class is sparse, so accuracy alone is misleading."
|
| 7 |
+
D,Scene Reconstruction & World Modeling,transition_detection,Action transition detection,diagnostic,diagnostic,C,macro-F1,0.655182926829,0.648484848485,minimal,Localizes egocentric task boundaries and diagnoses temporal state changes.,"Boundary class is sparse, so accuracy alone is misleading."
|
| 8 |
+
C,Egocentric Vision & Interaction,next_action,Short-horizon next action,supervised,direct,C,macro-F1,0.0592592592593,0.0235294117647,minimal,Tests action intention/task-flow prediction from egocentric context.,Unseen future labels dominate the single-episode chronological test.
|
| 9 |
+
D,Scene Reconstruction & World Modeling,next_action,Short-horizon next action,supervised,proxy,C,macro-F1,0.0592592592593,0.0235294117647,minimal,Tests action intention/task-flow prediction from egocentric context.,Unseen future labels dominate the single-episode chronological test.
|
| 10 |
+
A,Human Modeling & Motion Understanding,hand_trajectory_forecast,Hand trajectory forecasting,forecast,direct,A,MPJPE,0.822264492512,0.11163123697,neural_mlp,Directly predicts human hand motion and supports hand-object interaction modeling.,Forecasting is window-level and not yet a full sequence or policy model.
|
| 11 |
+
C,Egocentric Vision & Interaction,hand_trajectory_forecast,Hand trajectory forecasting,forecast,proxy,A,MPJPE,0.822264492512,0.11163123697,neural_mlp,Directly predicts human hand motion and supports hand-object interaction modeling.,Forecasting is window-level and not yet a full sequence or policy model.
|
| 12 |
+
A,Human Modeling & Motion Understanding,contact_prediction,Body/object contact prediction,supervised,direct,A,macro-F1,1,1,tie,"Targets physical interaction state, a core affordance and manipulation signal.",The public sample is degenerate for this target because one class dominates.
|
| 13 |
+
C,Egocentric Vision & Interaction,contact_prediction,Body/object contact prediction,supervised,proxy,A,macro-F1,1,1,tie,"Targets physical interaction state, a core affordance and manipulation signal.",The public sample is degenerate for this target because one class dominates.
|
| 14 |
+
C,Egocentric Vision & Interaction,object_relevance,Relevant object set prediction,supervised,direct,C,micro-F1,0.183930300097,0.179758308157,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode.
|
| 15 |
+
A,Human Modeling & Motion Understanding,object_relevance,Relevant object set prediction,supervised,proxy,C,micro-F1,0.183930300097,0.179758308157,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode.
|
| 16 |
+
D,Scene Reconstruction & World Modeling,object_relevance,Relevant object set prediction,supervised,proxy,C,micro-F1,0.183930300097,0.179758308157,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode.
|
| 17 |
+
C,Egocentric Vision & Interaction,caption_grounding,Caption-to-window grounding,retrieval,direct,C,MRR,0.0171839460838,0.0178111116104,neural_mlp,Grounds language annotation into egocentric sensor time and task state.,Bag-of-objects language features are too weak for rich grounding.
|
| 18 |
+
D,Scene Reconstruction & World Modeling,caption_grounding,Caption-to-window grounding,retrieval,proxy,C,MRR,0.0171839460838,0.0178111116104,neural_mlp,Grounds language annotation into egocentric sensor time and task state.,Bag-of-objects language features are too weak for rich grounding.
|
| 19 |
+
C,Egocentric Vision & Interaction,cross_modal_retrieval,Cross-modal retrieval,retrieval,diagnostic,C,MRR,0.263359840066,0.15300700222,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval proves alignment signal, not geometric reconstruction."
|
| 20 |
+
B,3D/4D Reconstruction & Neural Rendering,cross_modal_retrieval,Cross-modal retrieval,retrieval,proxy,C,MRR,0.263359840066,0.15300700222,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval proves alignment signal, not geometric reconstruction."
|
| 21 |
+
D,Scene Reconstruction & World Modeling,cross_modal_retrieval,Cross-modal retrieval,retrieval,proxy,C,MRR,0.263359840066,0.15300700222,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval proves alignment signal, not geometric reconstruction."
|
| 22 |
+
B,3D/4D Reconstruction & Neural Rendering,modality_reconstruction,Modality reconstruction,forecast,proxy,B,R2,-0.0160228467711,-0.0101981718914,neural_mlp,Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.,"Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction."
|
| 23 |
+
D,Scene Reconstruction & World Modeling,modality_reconstruction,Modality reconstruction,forecast,proxy,B,R2,-0.0160228467711,-0.0101981718914,neural_mlp,Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.,"Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction."
|
| 24 |
+
C,Egocentric Vision & Interaction,temporal_order,Temporal order verification,diagnostic,diagnostic,C,F1,0.548736462094,0.871794871795,neural_mlp,Checks whether features encode local time direction and task progression.,"Only local adjacent ordering, not long-horizon causal modeling."
|
| 25 |
+
D,Scene Reconstruction & World Modeling,temporal_order,Temporal order verification,diagnostic,diagnostic,C,F1,0.548736462094,0.871794871795,neural_mlp,Checks whether features encode local time direction and task progression.,"Only local adjacent ordering, not long-horizon causal modeling."
|
| 26 |
+
C,Egocentric Vision & Interaction,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.486567164179,0.733524355301,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping.
|
| 27 |
+
B,3D/4D Reconstruction & Neural Rendering,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.486567164179,0.733524355301,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping.
|
| 28 |
+
D,Scene Reconstruction & World Modeling,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.486567164179,0.733524355301,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping.
|
artifacts/episode_task_suite/research_directions/research_direction_taxonomy.json
ADDED
|
@@ -0,0 +1,384 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"source": "results/episode_task_suite/summary_report.json",
|
| 3 |
+
"dataset_scope": {
|
| 4 |
+
"sample_episode_count": 1,
|
| 5 |
+
"num_frames": 5821,
|
| 6 |
+
"num_windows": 1161,
|
| 7 |
+
"feature_dim": 8378,
|
| 8 |
+
"warning": "Single public sample episode; this supports pipeline/task evidence, not cross-episode generalization claims."
|
| 9 |
+
},
|
| 10 |
+
"baselines": {
|
| 11 |
+
"minimal": "Interpretable softmax, logistic, ridge, and retrieval heads over the 8,378-d window feature vector.",
|
| 12 |
+
"neural_mlp": "Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts."
|
| 13 |
+
},
|
| 14 |
+
"directions": {
|
| 15 |
+
"A": {
|
| 16 |
+
"id": "human_motion",
|
| 17 |
+
"name": "Human Modeling & Motion Understanding",
|
| 18 |
+
"focus": "Human/hand/body motion, deformation priors, human-object interaction, affordance modeling.",
|
| 19 |
+
"preferred_background": "Human pose/shape estimation, SMPL-style models, motion capture, or motion generation.",
|
| 20 |
+
"current_status": "partially implemented",
|
| 21 |
+
"current_readout": "The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.",
|
| 22 |
+
"next_steps": [
|
| 23 |
+
"Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.",
|
| 24 |
+
"Train sequence models over multi-episode motion trajectories instead of isolated windows.",
|
| 25 |
+
"Evaluate affordance prediction on held-out objects and held-out episodes."
|
| 26 |
+
],
|
| 27 |
+
"tasks": [
|
| 28 |
+
"timeline_action",
|
| 29 |
+
"hand_trajectory_forecast",
|
| 30 |
+
"contact_prediction",
|
| 31 |
+
"object_relevance"
|
| 32 |
+
],
|
| 33 |
+
"counts": {
|
| 34 |
+
"direct": 2,
|
| 35 |
+
"proxy": 2,
|
| 36 |
+
"diagnostic": 0,
|
| 37 |
+
"total_links": 4
|
| 38 |
+
}
|
| 39 |
+
},
|
| 40 |
+
"B": {
|
| 41 |
+
"id": "reconstruction_rendering",
|
| 42 |
+
"name": "3D/4D Reconstruction & Neural Rendering",
|
| 43 |
+
"focus": "Multi-view dynamic scene reconstruction, NeRF/Gaussian Splatting, novel-view synthesis.",
|
| 44 |
+
"preferred_background": "3D reconstruction, neural rendering, camera calibration, and bundle adjustment.",
|
| 45 |
+
"current_status": "proxy tasks only",
|
| 46 |
+
"current_readout": "The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.",
|
| 47 |
+
"next_steps": [
|
| 48 |
+
"Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.",
|
| 49 |
+
"Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.",
|
| 50 |
+
"Evaluate novel-view synthesis and temporal consistency across held-out views/time."
|
| 51 |
+
],
|
| 52 |
+
"tasks": [
|
| 53 |
+
"cross_modal_retrieval",
|
| 54 |
+
"modality_reconstruction",
|
| 55 |
+
"misalignment_detection"
|
| 56 |
+
],
|
| 57 |
+
"counts": {
|
| 58 |
+
"direct": 0,
|
| 59 |
+
"proxy": 2,
|
| 60 |
+
"diagnostic": 1,
|
| 61 |
+
"total_links": 3
|
| 62 |
+
}
|
| 63 |
+
},
|
| 64 |
+
"C": {
|
| 65 |
+
"id": "egocentric_interaction",
|
| 66 |
+
"name": "Egocentric Vision & Interaction",
|
| 67 |
+
"focus": "Egocentric action and intention understanding, hand-object interaction, gaze/attention modeling, task structure modeling.",
|
| 68 |
+
"preferred_background": "Video understanding, action recognition, or egocentric vision.",
|
| 69 |
+
"current_status": "strongest implemented track",
|
| 70 |
+
"current_readout": "Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.",
|
| 71 |
+
"next_steps": [
|
| 72 |
+
"Move from single-episode chronological splits to held-out-episode splits.",
|
| 73 |
+
"Add audio features and stronger multimodal backbones for action, intent, and grounding.",
|
| 74 |
+
"Evaluate long-horizon task success prediction and action-conditioned generation."
|
| 75 |
+
],
|
| 76 |
+
"tasks": [
|
| 77 |
+
"timeline_action",
|
| 78 |
+
"timeline_subtask",
|
| 79 |
+
"transition_detection",
|
| 80 |
+
"next_action",
|
| 81 |
+
"hand_trajectory_forecast",
|
| 82 |
+
"contact_prediction",
|
| 83 |
+
"object_relevance",
|
| 84 |
+
"caption_grounding",
|
| 85 |
+
"cross_modal_retrieval",
|
| 86 |
+
"temporal_order",
|
| 87 |
+
"misalignment_detection"
|
| 88 |
+
],
|
| 89 |
+
"counts": {
|
| 90 |
+
"direct": 6,
|
| 91 |
+
"proxy": 2,
|
| 92 |
+
"diagnostic": 3,
|
| 93 |
+
"total_links": 11
|
| 94 |
+
}
|
| 95 |
+
},
|
| 96 |
+
"D": {
|
| 97 |
+
"id": "world_modeling",
|
| 98 |
+
"name": "Scene Reconstruction & World Modeling",
|
| 99 |
+
"focus": "Long-term consistent 3D/4D scene mapping, scene graphs, object- and space-centric representations, spatial reasoning.",
|
| 100 |
+
"preferred_background": "Large-scale mapping, semantic reconstruction, or agent world models.",
|
| 101 |
+
"current_status": "early proxy tasks",
|
| 102 |
+
"current_readout": "The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.",
|
| 103 |
+
"next_steps": [
|
| 104 |
+
"Convert windows into persistent object/scene-state nodes with timestamps and camera poses.",
|
| 105 |
+
"Add map consistency, object permanence, and spatial relation prediction tasks.",
|
| 106 |
+
"Train held-out-episode world models that predict future observations and task state."
|
| 107 |
+
],
|
| 108 |
+
"tasks": [
|
| 109 |
+
"timeline_subtask",
|
| 110 |
+
"transition_detection",
|
| 111 |
+
"next_action",
|
| 112 |
+
"object_relevance",
|
| 113 |
+
"caption_grounding",
|
| 114 |
+
"cross_modal_retrieval",
|
| 115 |
+
"modality_reconstruction",
|
| 116 |
+
"temporal_order",
|
| 117 |
+
"misalignment_detection"
|
| 118 |
+
],
|
| 119 |
+
"counts": {
|
| 120 |
+
"direct": 0,
|
| 121 |
+
"proxy": 6,
|
| 122 |
+
"diagnostic": 3,
|
| 123 |
+
"total_links": 9
|
| 124 |
+
}
|
| 125 |
+
}
|
| 126 |
+
},
|
| 127 |
+
"tasks": {
|
| 128 |
+
"timeline_action": {
|
| 129 |
+
"name": "Timeline action recognition",
|
| 130 |
+
"family": "supervised",
|
| 131 |
+
"input": "all featurized modalities",
|
| 132 |
+
"output": "current action label",
|
| 133 |
+
"primary_direction": "C",
|
| 134 |
+
"direction_roles": {
|
| 135 |
+
"C": "direct",
|
| 136 |
+
"A": "proxy"
|
| 137 |
+
},
|
| 138 |
+
"why": "Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.",
|
| 139 |
+
"current_limit": "Chronological single-episode split creates unseen future action classes.",
|
| 140 |
+
"metric": {
|
| 141 |
+
"key": "macro_f1",
|
| 142 |
+
"name": "macro-F1",
|
| 143 |
+
"direction": "higher",
|
| 144 |
+
"minimal": 0.05,
|
| 145 |
+
"neural_mlp": 0.02631578947368421,
|
| 146 |
+
"better_baseline": "minimal"
|
| 147 |
+
}
|
| 148 |
+
},
|
| 149 |
+
"timeline_subtask": {
|
| 150 |
+
"name": "Timeline subtask recognition",
|
| 151 |
+
"family": "supervised",
|
| 152 |
+
"input": "all featurized modalities",
|
| 153 |
+
"output": "current subtask label",
|
| 154 |
+
"primary_direction": "C",
|
| 155 |
+
"direction_roles": {
|
| 156 |
+
"C": "direct",
|
| 157 |
+
"D": "proxy"
|
| 158 |
+
},
|
| 159 |
+
"why": "Segments egocentric task state and provides a first proxy for symbolic world/task state.",
|
| 160 |
+
"current_limit": "Single-episode ordering makes future subtasks hard to generalize.",
|
| 161 |
+
"metric": {
|
| 162 |
+
"key": "macro_f1",
|
| 163 |
+
"name": "macro-F1",
|
| 164 |
+
"direction": "higher",
|
| 165 |
+
"minimal": 0.04954121121178666,
|
| 166 |
+
"neural_mlp": 0.017518248175182476,
|
| 167 |
+
"better_baseline": "minimal"
|
| 168 |
+
}
|
| 169 |
+
},
|
| 170 |
+
"transition_detection": {
|
| 171 |
+
"name": "Action transition detection",
|
| 172 |
+
"family": "diagnostic",
|
| 173 |
+
"input": "all featurized modalities",
|
| 174 |
+
"output": "boundary vs steady state",
|
| 175 |
+
"primary_direction": "C",
|
| 176 |
+
"direction_roles": {
|
| 177 |
+
"C": "direct",
|
| 178 |
+
"D": "diagnostic"
|
| 179 |
+
},
|
| 180 |
+
"why": "Localizes egocentric task boundaries and diagnoses temporal state changes.",
|
| 181 |
+
"current_limit": "Boundary class is sparse, so accuracy alone is misleading.",
|
| 182 |
+
"metric": {
|
| 183 |
+
"key": "macro_f1",
|
| 184 |
+
"name": "macro-F1",
|
| 185 |
+
"direction": "higher",
|
| 186 |
+
"minimal": 0.6551829268292684,
|
| 187 |
+
"neural_mlp": 0.6484848484848484,
|
| 188 |
+
"better_baseline": "minimal"
|
| 189 |
+
}
|
| 190 |
+
},
|
| 191 |
+
"next_action": {
|
| 192 |
+
"name": "Short-horizon next action",
|
| 193 |
+
"family": "supervised",
|
| 194 |
+
"input": "current multimodal window",
|
| 195 |
+
"output": "action 20 frames later",
|
| 196 |
+
"primary_direction": "C",
|
| 197 |
+
"direction_roles": {
|
| 198 |
+
"C": "direct",
|
| 199 |
+
"D": "proxy"
|
| 200 |
+
},
|
| 201 |
+
"why": "Tests action intention/task-flow prediction from egocentric context.",
|
| 202 |
+
"current_limit": "Unseen future labels dominate the single-episode chronological test.",
|
| 203 |
+
"metric": {
|
| 204 |
+
"key": "macro_f1",
|
| 205 |
+
"name": "macro-F1",
|
| 206 |
+
"direction": "higher",
|
| 207 |
+
"minimal": 0.05925925925925927,
|
| 208 |
+
"neural_mlp": 0.023529411764705882,
|
| 209 |
+
"better_baseline": "minimal"
|
| 210 |
+
}
|
| 211 |
+
},
|
| 212 |
+
"hand_trajectory_forecast": {
|
| 213 |
+
"name": "Hand trajectory forecasting",
|
| 214 |
+
"family": "forecast",
|
| 215 |
+
"input": "current multimodal window",
|
| 216 |
+
"output": "future left/right hand 3D joints",
|
| 217 |
+
"primary_direction": "A",
|
| 218 |
+
"direction_roles": {
|
| 219 |
+
"A": "direct",
|
| 220 |
+
"C": "proxy"
|
| 221 |
+
},
|
| 222 |
+
"why": "Directly predicts human hand motion and supports hand-object interaction modeling.",
|
| 223 |
+
"current_limit": "Forecasting is window-level and not yet a full sequence or policy model.",
|
| 224 |
+
"metric": {
|
| 225 |
+
"key": "mpjpe",
|
| 226 |
+
"name": "MPJPE",
|
| 227 |
+
"direction": "lower",
|
| 228 |
+
"minimal": 0.8222644925117493,
|
| 229 |
+
"neural_mlp": 0.11163123697042465,
|
| 230 |
+
"better_baseline": "neural_mlp"
|
| 231 |
+
}
|
| 232 |
+
},
|
| 233 |
+
"contact_prediction": {
|
| 234 |
+
"name": "Body/object contact prediction",
|
| 235 |
+
"family": "supervised",
|
| 236 |
+
"input": "non-contact/non-caption features",
|
| 237 |
+
"output": "binary contact label",
|
| 238 |
+
"primary_direction": "A",
|
| 239 |
+
"direction_roles": {
|
| 240 |
+
"A": "direct",
|
| 241 |
+
"C": "proxy"
|
| 242 |
+
},
|
| 243 |
+
"why": "Targets physical interaction state, a core affordance and manipulation signal.",
|
| 244 |
+
"current_limit": "The public sample is degenerate for this target because one class dominates.",
|
| 245 |
+
"metric": {
|
| 246 |
+
"key": "macro_f1",
|
| 247 |
+
"name": "macro-F1",
|
| 248 |
+
"direction": "higher",
|
| 249 |
+
"minimal": 1.0,
|
| 250 |
+
"neural_mlp": 1.0,
|
| 251 |
+
"better_baseline": "tie"
|
| 252 |
+
}
|
| 253 |
+
},
|
| 254 |
+
"object_relevance": {
|
| 255 |
+
"name": "Relevant object set prediction",
|
| 256 |
+
"family": "supervised",
|
| 257 |
+
"input": "non-caption feature blocks",
|
| 258 |
+
"output": "multi-label object set",
|
| 259 |
+
"primary_direction": "C",
|
| 260 |
+
"direction_roles": {
|
| 261 |
+
"C": "direct",
|
| 262 |
+
"A": "proxy",
|
| 263 |
+
"D": "proxy"
|
| 264 |
+
},
|
| 265 |
+
"why": "Connects egocentric activity to manipulated objects and early object-centric state.",
|
| 266 |
+
"current_limit": "Object labels are language-derived and sparse in one episode.",
|
| 267 |
+
"metric": {
|
| 268 |
+
"key": "micro_f1",
|
| 269 |
+
"name": "micro-F1",
|
| 270 |
+
"direction": "higher",
|
| 271 |
+
"minimal": 0.18393030009680542,
|
| 272 |
+
"neural_mlp": 0.1797583081570997,
|
| 273 |
+
"better_baseline": "minimal"
|
| 274 |
+
}
|
| 275 |
+
},
|
| 276 |
+
"caption_grounding": {
|
| 277 |
+
"name": "Caption-to-window grounding",
|
| 278 |
+
"family": "retrieval",
|
| 279 |
+
"input": "caption objects/interaction query and candidate sensor windows",
|
| 280 |
+
"output": "matching time window",
|
| 281 |
+
"primary_direction": "C",
|
| 282 |
+
"direction_roles": {
|
| 283 |
+
"C": "direct",
|
| 284 |
+
"D": "proxy"
|
| 285 |
+
},
|
| 286 |
+
"why": "Grounds language annotation into egocentric sensor time and task state.",
|
| 287 |
+
"current_limit": "Bag-of-objects language features are too weak for rich grounding.",
|
| 288 |
+
"metric": {
|
| 289 |
+
"key": "mrr",
|
| 290 |
+
"name": "MRR",
|
| 291 |
+
"direction": "higher",
|
| 292 |
+
"minimal": 0.017183946083791223,
|
| 293 |
+
"neural_mlp": 0.01781111161035397,
|
| 294 |
+
"better_baseline": "neural_mlp"
|
| 295 |
+
}
|
| 296 |
+
},
|
| 297 |
+
"cross_modal_retrieval": {
|
| 298 |
+
"name": "Cross-modal retrieval",
|
| 299 |
+
"family": "retrieval",
|
| 300 |
+
"input": "motion/IMU/camera query",
|
| 301 |
+
"output": "matching depth/video window",
|
| 302 |
+
"primary_direction": "C",
|
| 303 |
+
"direction_roles": {
|
| 304 |
+
"C": "diagnostic",
|
| 305 |
+
"B": "proxy",
|
| 306 |
+
"D": "proxy"
|
| 307 |
+
},
|
| 308 |
+
"why": "Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.",
|
| 309 |
+
"current_limit": "Retrieval proves alignment signal, not geometric reconstruction.",
|
| 310 |
+
"metric": {
|
| 311 |
+
"key": "mrr",
|
| 312 |
+
"name": "MRR",
|
| 313 |
+
"direction": "higher",
|
| 314 |
+
"minimal": 0.26335984006618296,
|
| 315 |
+
"neural_mlp": 0.1530070022204131,
|
| 316 |
+
"better_baseline": "minimal"
|
| 317 |
+
}
|
| 318 |
+
},
|
| 319 |
+
"modality_reconstruction": {
|
| 320 |
+
"name": "Modality reconstruction",
|
| 321 |
+
"family": "forecast",
|
| 322 |
+
"input": "motion/IMU/camera",
|
| 323 |
+
"output": "depth/video feature vector",
|
| 324 |
+
"primary_direction": "B",
|
| 325 |
+
"direction_roles": {
|
| 326 |
+
"B": "proxy",
|
| 327 |
+
"D": "proxy"
|
| 328 |
+
},
|
| 329 |
+
"why": "Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.",
|
| 330 |
+
"current_limit": "Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction.",
|
| 331 |
+
"metric": {
|
| 332 |
+
"key": "r2",
|
| 333 |
+
"name": "R2",
|
| 334 |
+
"direction": "higher",
|
| 335 |
+
"minimal": -0.016022846771134747,
|
| 336 |
+
"neural_mlp": -0.010198171891414143,
|
| 337 |
+
"better_baseline": "neural_mlp"
|
| 338 |
+
}
|
| 339 |
+
},
|
| 340 |
+
"temporal_order": {
|
| 341 |
+
"name": "Temporal order verification",
|
| 342 |
+
"family": "diagnostic",
|
| 343 |
+
"input": "two adjacent windows",
|
| 344 |
+
"output": "correct vs reversed order",
|
| 345 |
+
"primary_direction": "C",
|
| 346 |
+
"direction_roles": {
|
| 347 |
+
"C": "diagnostic",
|
| 348 |
+
"D": "diagnostic"
|
| 349 |
+
},
|
| 350 |
+
"why": "Checks whether features encode local time direction and task progression.",
|
| 351 |
+
"current_limit": "Only local adjacent ordering, not long-horizon causal modeling.",
|
| 352 |
+
"metric": {
|
| 353 |
+
"key": "f1",
|
| 354 |
+
"name": "F1",
|
| 355 |
+
"direction": "higher",
|
| 356 |
+
"minimal": 0.5487364620938628,
|
| 357 |
+
"neural_mlp": 0.8717948717948718,
|
| 358 |
+
"better_baseline": "neural_mlp"
|
| 359 |
+
}
|
| 360 |
+
},
|
| 361 |
+
"misalignment_detection": {
|
| 362 |
+
"name": "Cross-modal misalignment detection",
|
| 363 |
+
"family": "diagnostic",
|
| 364 |
+
"input": "motion plus visual/depth pair",
|
| 365 |
+
"output": "aligned vs shifted",
|
| 366 |
+
"primary_direction": "C",
|
| 367 |
+
"direction_roles": {
|
| 368 |
+
"C": "diagnostic",
|
| 369 |
+
"B": "diagnostic",
|
| 370 |
+
"D": "diagnostic"
|
| 371 |
+
},
|
| 372 |
+
"why": "Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",
|
| 373 |
+
"current_limit": "Synthetic shifts diagnose alignment but do not solve calibration or mapping.",
|
| 374 |
+
"metric": {
|
| 375 |
+
"key": "f1",
|
| 376 |
+
"name": "F1",
|
| 377 |
+
"direction": "higher",
|
| 378 |
+
"minimal": 0.4865671641791045,
|
| 379 |
+
"neural_mlp": 0.7335243553008597,
|
| 380 |
+
"better_baseline": "neural_mlp"
|
| 381 |
+
}
|
| 382 |
+
}
|
| 383 |
+
}
|
| 384 |
+
}
|
assets/charts/research_direction_coverage.svg
ADDED
|
|
scripts/research_direction_taxonomy.py
ADDED
|
@@ -0,0 +1,589 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Organize the 12 Xperience-10M tasks into the four Ropedia research tracks.
|
| 3 |
+
|
| 4 |
+
The script is intentionally deterministic: it reads the committed task metrics,
|
| 5 |
+
adds a hand-audited taxonomy, and writes machine-readable artifacts used by the
|
| 6 |
+
README, website, and Hugging Face pages.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import csv
|
| 12 |
+
import html
|
| 13 |
+
import json
|
| 14 |
+
from collections import OrderedDict
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from typing import Any
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
ROOT = Path(__file__).resolve().parents[1]
|
| 20 |
+
RESULTS = ROOT / "results" / "episode_task_suite"
|
| 21 |
+
OUT_DIR = RESULTS / "research_directions"
|
| 22 |
+
DOCS_DATA = ROOT / "docs" / "data"
|
| 23 |
+
CHARTS = ROOT / "docs" / "assets" / "charts"
|
| 24 |
+
|
| 25 |
+
SUMMARY_REPORT = RESULTS / "summary_report.json"
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
DIRECTIONS: OrderedDict[str, dict[str, Any]] = OrderedDict(
|
| 29 |
+
[
|
| 30 |
+
(
|
| 31 |
+
"A",
|
| 32 |
+
{
|
| 33 |
+
"id": "human_motion",
|
| 34 |
+
"name": "Human Modeling & Motion Understanding",
|
| 35 |
+
"focus": "Human/hand/body motion, deformation priors, human-object interaction, affordance modeling.",
|
| 36 |
+
"preferred_background": "Human pose/shape estimation, SMPL-style models, motion capture, or motion generation.",
|
| 37 |
+
"current_status": "partially implemented",
|
| 38 |
+
"current_readout": "The sample supports hand trajectory forecasting and contact/object probes, but it does not yet include a full body/shape model or multi-person priors.",
|
| 39 |
+
"next_steps": [
|
| 40 |
+
"Add SMPL/SMPL-X or MANO-style body/hand parameter targets where available.",
|
| 41 |
+
"Train sequence models over multi-episode motion trajectories instead of isolated windows.",
|
| 42 |
+
"Evaluate affordance prediction on held-out objects and held-out episodes.",
|
| 43 |
+
],
|
| 44 |
+
},
|
| 45 |
+
),
|
| 46 |
+
(
|
| 47 |
+
"B",
|
| 48 |
+
{
|
| 49 |
+
"id": "reconstruction_rendering",
|
| 50 |
+
"name": "3D/4D Reconstruction & Neural Rendering",
|
| 51 |
+
"focus": "Multi-view dynamic scene reconstruction, NeRF/Gaussian Splatting, novel-view synthesis.",
|
| 52 |
+
"preferred_background": "3D reconstruction, neural rendering, camera calibration, and bundle adjustment.",
|
| 53 |
+
"current_status": "proxy tasks only",
|
| 54 |
+
"current_readout": "The current suite checks cross-modal alignment and depth/video reconstruction proxies; it does not yet train a renderer or reconstruct geometry.",
|
| 55 |
+
"next_steps": [
|
| 56 |
+
"Use calibrated multi-view video plus SLAM pose to build per-episode camera trajectories.",
|
| 57 |
+
"Add depth-supervised point clouds, TSDF, Gaussian Splatting, or NeRF baselines.",
|
| 58 |
+
"Evaluate novel-view synthesis and temporal consistency across held-out views/time.",
|
| 59 |
+
],
|
| 60 |
+
},
|
| 61 |
+
),
|
| 62 |
+
(
|
| 63 |
+
"C",
|
| 64 |
+
{
|
| 65 |
+
"id": "egocentric_interaction",
|
| 66 |
+
"name": "Egocentric Vision & Interaction",
|
| 67 |
+
"focus": "Egocentric action and intention understanding, hand-object interaction, gaze/attention modeling, task structure modeling.",
|
| 68 |
+
"preferred_background": "Video understanding, action recognition, or egocentric vision.",
|
| 69 |
+
"current_status": "strongest implemented track",
|
| 70 |
+
"current_readout": "Most of the 12 tasks directly target egocentric action, task state, interaction, grounding, and alignment.",
|
| 71 |
+
"next_steps": [
|
| 72 |
+
"Move from single-episode chronological splits to held-out-episode splits.",
|
| 73 |
+
"Add audio features and stronger multimodal backbones for action, intent, and grounding.",
|
| 74 |
+
"Evaluate long-horizon task success prediction and action-conditioned generation.",
|
| 75 |
+
],
|
| 76 |
+
},
|
| 77 |
+
),
|
| 78 |
+
(
|
| 79 |
+
"D",
|
| 80 |
+
{
|
| 81 |
+
"id": "world_modeling",
|
| 82 |
+
"name": "Scene Reconstruction & World Modeling",
|
| 83 |
+
"focus": "Long-term consistent 3D/4D scene mapping, scene graphs, object- and space-centric representations, spatial reasoning.",
|
| 84 |
+
"preferred_background": "Large-scale mapping, semantic reconstruction, or agent world models.",
|
| 85 |
+
"current_status": "early proxy tasks",
|
| 86 |
+
"current_readout": "The current tasks probe temporal structure, object relevance, cross-modal retrieval, and modality prediction, but they do not yet build persistent maps or scene graphs.",
|
| 87 |
+
"next_steps": [
|
| 88 |
+
"Convert windows into persistent object/scene-state nodes with timestamps and camera poses.",
|
| 89 |
+
"Add map consistency, object permanence, and spatial relation prediction tasks.",
|
| 90 |
+
"Train held-out-episode world models that predict future observations and task state.",
|
| 91 |
+
],
|
| 92 |
+
},
|
| 93 |
+
),
|
| 94 |
+
]
|
| 95 |
+
)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
TASK_TAXONOMY: OrderedDict[str, dict[str, Any]] = OrderedDict(
|
| 99 |
+
[
|
| 100 |
+
(
|
| 101 |
+
"timeline_action",
|
| 102 |
+
{
|
| 103 |
+
"name": "Timeline action recognition",
|
| 104 |
+
"family": "supervised",
|
| 105 |
+
"input": "all featurized modalities",
|
| 106 |
+
"output": "current action label",
|
| 107 |
+
"primary_direction": "C",
|
| 108 |
+
"direction_roles": {"C": "direct", "A": "proxy"},
|
| 109 |
+
"why": "Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.",
|
| 110 |
+
"current_limit": "Chronological single-episode split creates unseen future action classes.",
|
| 111 |
+
},
|
| 112 |
+
),
|
| 113 |
+
(
|
| 114 |
+
"timeline_subtask",
|
| 115 |
+
{
|
| 116 |
+
"name": "Timeline subtask recognition",
|
| 117 |
+
"family": "supervised",
|
| 118 |
+
"input": "all featurized modalities",
|
| 119 |
+
"output": "current subtask label",
|
| 120 |
+
"primary_direction": "C",
|
| 121 |
+
"direction_roles": {"C": "direct", "D": "proxy"},
|
| 122 |
+
"why": "Segments egocentric task state and provides a first proxy for symbolic world/task state.",
|
| 123 |
+
"current_limit": "Single-episode ordering makes future subtasks hard to generalize.",
|
| 124 |
+
},
|
| 125 |
+
),
|
| 126 |
+
(
|
| 127 |
+
"transition_detection",
|
| 128 |
+
{
|
| 129 |
+
"name": "Action transition detection",
|
| 130 |
+
"family": "diagnostic",
|
| 131 |
+
"input": "all featurized modalities",
|
| 132 |
+
"output": "boundary vs steady state",
|
| 133 |
+
"primary_direction": "C",
|
| 134 |
+
"direction_roles": {"C": "direct", "D": "diagnostic"},
|
| 135 |
+
"why": "Localizes egocentric task boundaries and diagnoses temporal state changes.",
|
| 136 |
+
"current_limit": "Boundary class is sparse, so accuracy alone is misleading.",
|
| 137 |
+
},
|
| 138 |
+
),
|
| 139 |
+
(
|
| 140 |
+
"next_action",
|
| 141 |
+
{
|
| 142 |
+
"name": "Short-horizon next action",
|
| 143 |
+
"family": "supervised",
|
| 144 |
+
"input": "current multimodal window",
|
| 145 |
+
"output": "action 20 frames later",
|
| 146 |
+
"primary_direction": "C",
|
| 147 |
+
"direction_roles": {"C": "direct", "D": "proxy"},
|
| 148 |
+
"why": "Tests action intention/task-flow prediction from egocentric context.",
|
| 149 |
+
"current_limit": "Unseen future labels dominate the single-episode chronological test.",
|
| 150 |
+
},
|
| 151 |
+
),
|
| 152 |
+
(
|
| 153 |
+
"hand_trajectory_forecast",
|
| 154 |
+
{
|
| 155 |
+
"name": "Hand trajectory forecasting",
|
| 156 |
+
"family": "forecast",
|
| 157 |
+
"input": "current multimodal window",
|
| 158 |
+
"output": "future left/right hand 3D joints",
|
| 159 |
+
"primary_direction": "A",
|
| 160 |
+
"direction_roles": {"A": "direct", "C": "proxy"},
|
| 161 |
+
"why": "Directly predicts human hand motion and supports hand-object interaction modeling.",
|
| 162 |
+
"current_limit": "Forecasting is window-level and not yet a full sequence or policy model.",
|
| 163 |
+
},
|
| 164 |
+
),
|
| 165 |
+
(
|
| 166 |
+
"contact_prediction",
|
| 167 |
+
{
|
| 168 |
+
"name": "Body/object contact prediction",
|
| 169 |
+
"family": "supervised",
|
| 170 |
+
"input": "non-contact/non-caption features",
|
| 171 |
+
"output": "binary contact label",
|
| 172 |
+
"primary_direction": "A",
|
| 173 |
+
"direction_roles": {"A": "direct", "C": "proxy"},
|
| 174 |
+
"why": "Targets physical interaction state, a core affordance and manipulation signal.",
|
| 175 |
+
"current_limit": "The public sample is degenerate for this target because one class dominates.",
|
| 176 |
+
},
|
| 177 |
+
),
|
| 178 |
+
(
|
| 179 |
+
"object_relevance",
|
| 180 |
+
{
|
| 181 |
+
"name": "Relevant object set prediction",
|
| 182 |
+
"family": "supervised",
|
| 183 |
+
"input": "non-caption feature blocks",
|
| 184 |
+
"output": "multi-label object set",
|
| 185 |
+
"primary_direction": "C",
|
| 186 |
+
"direction_roles": {"C": "direct", "A": "proxy", "D": "proxy"},
|
| 187 |
+
"why": "Connects egocentric activity to manipulated objects and early object-centric state.",
|
| 188 |
+
"current_limit": "Object labels are language-derived and sparse in one episode.",
|
| 189 |
+
},
|
| 190 |
+
),
|
| 191 |
+
(
|
| 192 |
+
"caption_grounding",
|
| 193 |
+
{
|
| 194 |
+
"name": "Caption-to-window grounding",
|
| 195 |
+
"family": "retrieval",
|
| 196 |
+
"input": "caption objects/interaction query and candidate sensor windows",
|
| 197 |
+
"output": "matching time window",
|
| 198 |
+
"primary_direction": "C",
|
| 199 |
+
"direction_roles": {"C": "direct", "D": "proxy"},
|
| 200 |
+
"why": "Grounds language annotation into egocentric sensor time and task state.",
|
| 201 |
+
"current_limit": "Bag-of-objects language features are too weak for rich grounding.",
|
| 202 |
+
},
|
| 203 |
+
),
|
| 204 |
+
(
|
| 205 |
+
"cross_modal_retrieval",
|
| 206 |
+
{
|
| 207 |
+
"name": "Cross-modal retrieval",
|
| 208 |
+
"family": "retrieval",
|
| 209 |
+
"input": "motion/IMU/camera query",
|
| 210 |
+
"output": "matching depth/video window",
|
| 211 |
+
"primary_direction": "C",
|
| 212 |
+
"direction_roles": {"C": "diagnostic", "B": "proxy", "D": "proxy"},
|
| 213 |
+
"why": "Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.",
|
| 214 |
+
"current_limit": "Retrieval proves alignment signal, not geometric reconstruction.",
|
| 215 |
+
},
|
| 216 |
+
),
|
| 217 |
+
(
|
| 218 |
+
"modality_reconstruction",
|
| 219 |
+
{
|
| 220 |
+
"name": "Modality reconstruction",
|
| 221 |
+
"family": "forecast",
|
| 222 |
+
"input": "motion/IMU/camera",
|
| 223 |
+
"output": "depth/video feature vector",
|
| 224 |
+
"primary_direction": "B",
|
| 225 |
+
"direction_roles": {"B": "proxy", "D": "proxy"},
|
| 226 |
+
"why": "Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.",
|
| 227 |
+
"current_limit": "Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction.",
|
| 228 |
+
},
|
| 229 |
+
),
|
| 230 |
+
(
|
| 231 |
+
"temporal_order",
|
| 232 |
+
{
|
| 233 |
+
"name": "Temporal order verification",
|
| 234 |
+
"family": "diagnostic",
|
| 235 |
+
"input": "two adjacent windows",
|
| 236 |
+
"output": "correct vs reversed order",
|
| 237 |
+
"primary_direction": "C",
|
| 238 |
+
"direction_roles": {"C": "diagnostic", "D": "diagnostic"},
|
| 239 |
+
"why": "Checks whether features encode local time direction and task progression.",
|
| 240 |
+
"current_limit": "Only local adjacent ordering, not long-horizon causal modeling.",
|
| 241 |
+
},
|
| 242 |
+
),
|
| 243 |
+
(
|
| 244 |
+
"misalignment_detection",
|
| 245 |
+
{
|
| 246 |
+
"name": "Cross-modal misalignment detection",
|
| 247 |
+
"family": "diagnostic",
|
| 248 |
+
"input": "motion plus visual/depth pair",
|
| 249 |
+
"output": "aligned vs shifted",
|
| 250 |
+
"primary_direction": "C",
|
| 251 |
+
"direction_roles": {"C": "diagnostic", "B": "diagnostic", "D": "diagnostic"},
|
| 252 |
+
"why": "Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",
|
| 253 |
+
"current_limit": "Synthetic shifts diagnose alignment but do not solve calibration or mapping.",
|
| 254 |
+
},
|
| 255 |
+
),
|
| 256 |
+
]
|
| 257 |
+
)
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
METRIC_SPECS = {
|
| 261 |
+
"timeline_action": ("macro_f1", "macro-F1", "higher"),
|
| 262 |
+
"timeline_subtask": ("macro_f1", "macro-F1", "higher"),
|
| 263 |
+
"transition_detection": ("macro_f1", "macro-F1", "higher"),
|
| 264 |
+
"next_action": ("macro_f1", "macro-F1", "higher"),
|
| 265 |
+
"hand_trajectory_forecast": ("mpjpe", "MPJPE", "lower"),
|
| 266 |
+
"contact_prediction": ("macro_f1", "macro-F1", "higher"),
|
| 267 |
+
"object_relevance": ("micro_f1", "micro-F1", "higher"),
|
| 268 |
+
"caption_grounding": ("mrr", "MRR", "higher"),
|
| 269 |
+
"cross_modal_retrieval": ("mrr", "MRR", "higher"),
|
| 270 |
+
"modality_reconstruction": ("r2", "R2", "higher"),
|
| 271 |
+
"temporal_order": ("f1", "F1", "higher"),
|
| 272 |
+
"misalignment_detection": ("f1", "F1", "higher"),
|
| 273 |
+
}
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
def load_summary() -> dict[str, Any]:
|
| 277 |
+
return json.loads(SUMMARY_REPORT.read_text(encoding="utf-8"))
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
def metric_value(metrics: dict[str, Any] | None, task: str) -> float | None:
|
| 281 |
+
if not metrics:
|
| 282 |
+
return None
|
| 283 |
+
key = METRIC_SPECS[task][0]
|
| 284 |
+
value = metrics.get(key)
|
| 285 |
+
return float(value) if value is not None else None
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
def choose_better(task: str, minimal: float | None, neural: float | None) -> str:
|
| 289 |
+
if minimal is None or neural is None:
|
| 290 |
+
return "unavailable"
|
| 291 |
+
_, _, direction = METRIC_SPECS[task]
|
| 292 |
+
delta = neural - minimal
|
| 293 |
+
if abs(delta) < 1e-9:
|
| 294 |
+
return "tie"
|
| 295 |
+
if direction == "lower":
|
| 296 |
+
return "neural_mlp" if delta < 0 else "minimal"
|
| 297 |
+
return "neural_mlp" if delta > 0 else "minimal"
|
| 298 |
+
|
| 299 |
+
|
| 300 |
+
def fmt_metric(value: float | None) -> str:
|
| 301 |
+
if value is None:
|
| 302 |
+
return "n/a"
|
| 303 |
+
if abs(value) >= 10:
|
| 304 |
+
return f"{value:.3f}"
|
| 305 |
+
return f"{value:.4f}"
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
def baseline_readout(label: str) -> str:
|
| 309 |
+
if label == "tie":
|
| 310 |
+
return "Both baselines are tied"
|
| 311 |
+
if label == "minimal":
|
| 312 |
+
return "Minimal baseline is stronger"
|
| 313 |
+
if label == "neural_mlp":
|
| 314 |
+
return "Neural MLP is stronger"
|
| 315 |
+
return "Baseline comparison is unavailable"
|
| 316 |
+
|
| 317 |
+
|
| 318 |
+
def build_taxonomy(summary: dict[str, Any]) -> dict[str, Any]:
|
| 319 |
+
minimal_tasks = summary["tasks"]
|
| 320 |
+
neural_tasks = summary.get("neural_tasks", {})
|
| 321 |
+
|
| 322 |
+
task_records: OrderedDict[str, dict[str, Any]] = OrderedDict()
|
| 323 |
+
direction_counts = {
|
| 324 |
+
code: {"direct": 0, "proxy": 0, "diagnostic": 0, "total_links": 0}
|
| 325 |
+
for code in DIRECTIONS
|
| 326 |
+
}
|
| 327 |
+
|
| 328 |
+
for task, spec in TASK_TAXONOMY.items():
|
| 329 |
+
metric_key, metric_name, metric_direction = METRIC_SPECS[task]
|
| 330 |
+
minimal_metric = metric_value(minimal_tasks.get(task), task)
|
| 331 |
+
neural_metric = metric_value(neural_tasks.get(task), task)
|
| 332 |
+
better = choose_better(task, minimal_metric, neural_metric)
|
| 333 |
+
|
| 334 |
+
roles = spec["direction_roles"]
|
| 335 |
+
for direction_code, role in roles.items():
|
| 336 |
+
direction_counts[direction_code][role] += 1
|
| 337 |
+
direction_counts[direction_code]["total_links"] += 1
|
| 338 |
+
|
| 339 |
+
task_records[task] = {
|
| 340 |
+
**spec,
|
| 341 |
+
"metric": {
|
| 342 |
+
"key": metric_key,
|
| 343 |
+
"name": metric_name,
|
| 344 |
+
"direction": metric_direction,
|
| 345 |
+
"minimal": minimal_metric,
|
| 346 |
+
"neural_mlp": neural_metric,
|
| 347 |
+
"better_baseline": better,
|
| 348 |
+
},
|
| 349 |
+
}
|
| 350 |
+
|
| 351 |
+
direction_records = OrderedDict()
|
| 352 |
+
for code, info in DIRECTIONS.items():
|
| 353 |
+
linked_tasks = [
|
| 354 |
+
task
|
| 355 |
+
for task, spec in task_records.items()
|
| 356 |
+
if code in spec["direction_roles"]
|
| 357 |
+
]
|
| 358 |
+
direction_records[code] = {
|
| 359 |
+
**info,
|
| 360 |
+
"tasks": linked_tasks,
|
| 361 |
+
"counts": direction_counts[code],
|
| 362 |
+
}
|
| 363 |
+
|
| 364 |
+
return {
|
| 365 |
+
"source": "results/episode_task_suite/summary_report.json",
|
| 366 |
+
"dataset_scope": {
|
| 367 |
+
"sample_episode_count": 1,
|
| 368 |
+
"num_frames": summary.get("num_frames"),
|
| 369 |
+
"num_windows": summary.get("num_windows"),
|
| 370 |
+
"feature_dim": summary.get("feature_dim"),
|
| 371 |
+
"warning": "Single public sample episode; this supports pipeline/task evidence, not cross-episode generalization claims.",
|
| 372 |
+
},
|
| 373 |
+
"baselines": {
|
| 374 |
+
"minimal": "Interpretable softmax, logistic, ridge, and retrieval heads over the 8,378-d window feature vector.",
|
| 375 |
+
"neural_mlp": "Small PyTorch MLP classifiers/regressors using the same features, splits, and task contracts.",
|
| 376 |
+
},
|
| 377 |
+
"directions": direction_records,
|
| 378 |
+
"tasks": task_records,
|
| 379 |
+
}
|
| 380 |
+
|
| 381 |
+
|
| 382 |
+
def write_csv(taxonomy: dict[str, Any]) -> None:
|
| 383 |
+
path = OUT_DIR / "research_direction_task_map.csv"
|
| 384 |
+
with path.open("w", newline="", encoding="utf-8") as handle:
|
| 385 |
+
writer = csv.writer(handle, lineterminator="\n")
|
| 386 |
+
writer.writerow(
|
| 387 |
+
[
|
| 388 |
+
"direction",
|
| 389 |
+
"direction_name",
|
| 390 |
+
"task",
|
| 391 |
+
"task_name",
|
| 392 |
+
"family",
|
| 393 |
+
"relationship",
|
| 394 |
+
"primary_direction",
|
| 395 |
+
"metric_name",
|
| 396 |
+
"minimal_metric",
|
| 397 |
+
"neural_mlp_metric",
|
| 398 |
+
"better_baseline",
|
| 399 |
+
"why",
|
| 400 |
+
"current_limit",
|
| 401 |
+
]
|
| 402 |
+
)
|
| 403 |
+
for task, spec in taxonomy["tasks"].items():
|
| 404 |
+
metric = spec["metric"]
|
| 405 |
+
for direction_code, relationship in spec["direction_roles"].items():
|
| 406 |
+
writer.writerow(
|
| 407 |
+
[
|
| 408 |
+
direction_code,
|
| 409 |
+
taxonomy["directions"][direction_code]["name"],
|
| 410 |
+
task,
|
| 411 |
+
spec["name"],
|
| 412 |
+
spec["family"],
|
| 413 |
+
relationship,
|
| 414 |
+
spec["primary_direction"],
|
| 415 |
+
metric["name"],
|
| 416 |
+
"" if metric["minimal"] is None else f"{metric['minimal']:.12g}",
|
| 417 |
+
"" if metric["neural_mlp"] is None else f"{metric['neural_mlp']:.12g}",
|
| 418 |
+
metric["better_baseline"],
|
| 419 |
+
spec["why"],
|
| 420 |
+
spec["current_limit"],
|
| 421 |
+
]
|
| 422 |
+
)
|
| 423 |
+
|
| 424 |
+
|
| 425 |
+
def write_markdown(taxonomy: dict[str, Any]) -> None:
|
| 426 |
+
lines = [
|
| 427 |
+
"# Four-Direction Task Taxonomy",
|
| 428 |
+
"",
|
| 429 |
+
"This file is generated by `scripts/research_direction_taxonomy.py` from the committed 12-task metrics.",
|
| 430 |
+
"It maps the current Xperience-10M sample tasks to the four Ropedia research directions without claiming that a single episode solves any full direction.",
|
| 431 |
+
"",
|
| 432 |
+
"## Baseline Families",
|
| 433 |
+
"",
|
| 434 |
+
"| Baseline | Meaning |",
|
| 435 |
+
"| --- | --- |",
|
| 436 |
+
f"| Minimal | {taxonomy['baselines']['minimal']} |",
|
| 437 |
+
f"| Neural MLP | {taxonomy['baselines']['neural_mlp']} |",
|
| 438 |
+
"",
|
| 439 |
+
"## Direction Coverage",
|
| 440 |
+
"",
|
| 441 |
+
"| Direction | Current status | Direct | Proxy | Diagnostic | Current readout |",
|
| 442 |
+
"| --- | --- | ---: | ---: | ---: | --- |",
|
| 443 |
+
]
|
| 444 |
+
for code, info in taxonomy["directions"].items():
|
| 445 |
+
counts = info["counts"]
|
| 446 |
+
lines.append(
|
| 447 |
+
f"| {code}. {info['name']} | {info['current_status']} | {counts['direct']} | {counts['proxy']} | {counts['diagnostic']} | {info['current_readout']} |"
|
| 448 |
+
)
|
| 449 |
+
|
| 450 |
+
lines.extend(
|
| 451 |
+
[
|
| 452 |
+
"",
|
| 453 |
+
"## Task Mapping With Two Baselines",
|
| 454 |
+
"",
|
| 455 |
+
"| Task | Primary direction | Related directions | Minimal | Neural MLP | Readout |",
|
| 456 |
+
"| --- | --- | --- | ---: | ---: | --- |",
|
| 457 |
+
]
|
| 458 |
+
)
|
| 459 |
+
for task, spec in taxonomy["tasks"].items():
|
| 460 |
+
metric = spec["metric"]
|
| 461 |
+
related = ", ".join(
|
| 462 |
+
f"{code}:{role}" for code, role in spec["direction_roles"].items()
|
| 463 |
+
)
|
| 464 |
+
minimal = f"{fmt_metric(metric['minimal'])} {metric['name']}"
|
| 465 |
+
neural = f"{fmt_metric(metric['neural_mlp'])} {metric['name']}"
|
| 466 |
+
readout = f"{baseline_readout(metric['better_baseline'])}. {spec['current_limit']}"
|
| 467 |
+
lines.append(
|
| 468 |
+
f"| `{task}` | {spec['primary_direction']} | {related} | {minimal} | {neural} | {readout} |"
|
| 469 |
+
)
|
| 470 |
+
|
| 471 |
+
lines.extend(["", "## Next-Step Interpretation", ""])
|
| 472 |
+
for code, info in taxonomy["directions"].items():
|
| 473 |
+
lines.append(f"### {code}. {info['name']}")
|
| 474 |
+
lines.append("")
|
| 475 |
+
lines.append(info["current_readout"])
|
| 476 |
+
lines.append("")
|
| 477 |
+
for step in info["next_steps"]:
|
| 478 |
+
lines.append(f"- {step}")
|
| 479 |
+
lines.append("")
|
| 480 |
+
|
| 481 |
+
(OUT_DIR / "research_direction_summary.md").write_text(
|
| 482 |
+
"\n".join(lines).rstrip() + "\n", encoding="utf-8"
|
| 483 |
+
)
|
| 484 |
+
|
| 485 |
+
|
| 486 |
+
def svg_text(x: int, y: int, text: str, size: int = 16, weight: int = 500, color: str = "#16213a") -> str:
|
| 487 |
+
return (
|
| 488 |
+
f'<text x="{x}" y="{y}" font-size="{size}" font-weight="{weight}" '
|
| 489 |
+
f'fill="{color}">{html.escape(text)}</text>'
|
| 490 |
+
)
|
| 491 |
+
|
| 492 |
+
|
| 493 |
+
def write_svg(taxonomy: dict[str, Any]) -> None:
|
| 494 |
+
width = 1180
|
| 495 |
+
height = 700
|
| 496 |
+
margin = 58
|
| 497 |
+
card_w = 515
|
| 498 |
+
card_h = 220
|
| 499 |
+
colors = {"direct": "#1f6c9f", "proxy": "#2e7775", "diagnostic": "#956400"}
|
| 500 |
+
cards = []
|
| 501 |
+
|
| 502 |
+
for idx, (code, info) in enumerate(taxonomy["directions"].items()):
|
| 503 |
+
row = idx // 2
|
| 504 |
+
col = idx % 2
|
| 505 |
+
x = margin + col * (card_w + 34)
|
| 506 |
+
y = 130 + row * (card_h + 34)
|
| 507 |
+
counts = info["counts"]
|
| 508 |
+
total = max(1, counts["direct"] + counts["proxy"] + counts["diagnostic"])
|
| 509 |
+
bar_x = x + 24
|
| 510 |
+
bar_y = y + 132
|
| 511 |
+
bar_w = card_w - 48
|
| 512 |
+
cursor = bar_x
|
| 513 |
+
segments = []
|
| 514 |
+
for key in ("direct", "proxy", "diagnostic"):
|
| 515 |
+
seg_w = round(bar_w * counts[key] / total)
|
| 516 |
+
if counts[key] > 0:
|
| 517 |
+
segments.append(
|
| 518 |
+
f'<rect x="{cursor}" y="{bar_y}" width="{seg_w}" height="16" rx="8" fill="{colors[key]}"/>'
|
| 519 |
+
)
|
| 520 |
+
cursor += seg_w
|
| 521 |
+
|
| 522 |
+
task_labels = ", ".join(info["tasks"][:5])
|
| 523 |
+
if len(info["tasks"]) > 5:
|
| 524 |
+
task_labels += f", +{len(info['tasks']) - 5}"
|
| 525 |
+
|
| 526 |
+
cards.append(
|
| 527 |
+
"\n".join(
|
| 528 |
+
[
|
| 529 |
+
f'<rect x="{x}" y="{y}" width="{card_w}" height="{card_h}" rx="8" fill="#ffffff" stroke="#d9e1ea"/>',
|
| 530 |
+
svg_text(x + 24, y + 42, f"{code}. {info['name']}", 21, 700),
|
| 531 |
+
svg_text(x + 24, y + 75, info["current_status"], 15, 700, "#566273"),
|
| 532 |
+
svg_text(x + 24, y + 108, f"Tasks: {task_labels}", 14, 500, "#30394a"),
|
| 533 |
+
*segments,
|
| 534 |
+
svg_text(x + 24, y + 174, f"Direct {counts['direct']}", 14, 700, colors["direct"]),
|
| 535 |
+
svg_text(x + 150, y + 174, f"Proxy {counts['proxy']}", 14, 700, colors["proxy"]),
|
| 536 |
+
svg_text(x + 270, y + 174, f"Diagnostic {counts['diagnostic']}", 14, 700, colors["diagnostic"]),
|
| 537 |
+
]
|
| 538 |
+
)
|
| 539 |
+
)
|
| 540 |
+
|
| 541 |
+
legend = []
|
| 542 |
+
lx = margin
|
| 543 |
+
for key, label in (
|
| 544 |
+
("direct", "Direct task"),
|
| 545 |
+
("proxy", "Proxy / prerequisite"),
|
| 546 |
+
("diagnostic", "Diagnostic probe"),
|
| 547 |
+
):
|
| 548 |
+
legend.extend(
|
| 549 |
+
[
|
| 550 |
+
f'<rect x="{lx}" y="622" width="16" height="16" rx="4" fill="{colors[key]}"/>',
|
| 551 |
+
svg_text(lx + 24, 636, label, 14, 600, "#30394a"),
|
| 552 |
+
]
|
| 553 |
+
)
|
| 554 |
+
lx += 200
|
| 555 |
+
|
| 556 |
+
svg = f"""<svg xmlns="http://www.w3.org/2000/svg" width="{width}" height="{height}" viewBox="0 0 {width} {height}" role="img" aria-label="Xperience-10M task coverage across four research directions">
|
| 557 |
+
<rect width="100%" height="100%" fill="#f7f9fb"/>
|
| 558 |
+
{svg_text(margin, 64, "Xperience-10M 12-Task Suite: Four Research Directions", 30, 800)}
|
| 559 |
+
{svg_text(margin, 96, "One public sample episode, two baseline families, explicit direct/proxy/diagnostic coverage.", 16, 500, "#566273")}
|
| 560 |
+
{"".join(cards)}
|
| 561 |
+
{"".join(legend)}
|
| 562 |
+
{svg_text(margin, 670, "Generated from results/episode_task_suite/summary_report.json and scripts/research_direction_taxonomy.py", 13, 500, "#6d7787")}
|
| 563 |
+
</svg>
|
| 564 |
+
"""
|
| 565 |
+
(CHARTS / "research_direction_coverage.svg").write_text(svg, encoding="utf-8")
|
| 566 |
+
|
| 567 |
+
|
| 568 |
+
def main() -> None:
|
| 569 |
+
OUT_DIR.mkdir(parents=True, exist_ok=True)
|
| 570 |
+
DOCS_DATA.mkdir(parents=True, exist_ok=True)
|
| 571 |
+
CHARTS.mkdir(parents=True, exist_ok=True)
|
| 572 |
+
|
| 573 |
+
taxonomy = build_taxonomy(load_summary())
|
| 574 |
+
json_text = json.dumps(taxonomy, indent=2, ensure_ascii=False)
|
| 575 |
+
(OUT_DIR / "research_direction_taxonomy.json").write_text(json_text + "\n", encoding="utf-8")
|
| 576 |
+
(DOCS_DATA / "research_directions.json").write_text(json_text + "\n", encoding="utf-8")
|
| 577 |
+
write_csv(taxonomy)
|
| 578 |
+
write_markdown(taxonomy)
|
| 579 |
+
write_svg(taxonomy)
|
| 580 |
+
|
| 581 |
+
print(f"Wrote {OUT_DIR / 'research_direction_taxonomy.json'}")
|
| 582 |
+
print(f"Wrote {OUT_DIR / 'research_direction_task_map.csv'}")
|
| 583 |
+
print(f"Wrote {OUT_DIR / 'research_direction_summary.md'}")
|
| 584 |
+
print(f"Wrote {DOCS_DATA / 'research_directions.json'}")
|
| 585 |
+
print(f"Wrote {CHARTS / 'research_direction_coverage.svg'}")
|
| 586 |
+
|
| 587 |
+
|
| 588 |
+
if __name__ == "__main__":
|
| 589 |
+
main()
|