# Unified 20-Task Suite

The public Xperience-10M sample task surface is one unified set of 20 tasks.
All task contracts are presented together under the same window, split,
feature, baseline, and leakage-control contract.

Historical artifact paths containing `tier2_task_suite` are kept for stable
links, but they should be read as provenance directories inside the unified task suite, not
as a separate benchmark tier.

## Shared Setup

- Episode scope: `1` public sample episode.
- Frames/windows: `5,821` frames and `1,161` aligned windows.
- Windowing: `20` frames per window, stride `5` frames.
- Feature vector: `8,546` dimensions from the shared feature manifest.
- Split: chronological 70/30 train/test by time within the sample episode.
- Baselines: minimal interpretable heads and compact neural MLP heads.
- Raw data: MP4/HDF5/RRD files are not redistributed.

## Task Table

| # | Task | Artifact id | Input -> output | Primary metric | Minimal | Neural |
| ---: | --- | --- | --- | --- | ---: | ---: |
| 1 | Action Recognition | `timeline_action` | 20-frame multimodal window -> current action class | macro-F1 (higher better) | 0.0500 | 0.0148 |
| 2 | Procedure Step Recognition | `timeline_subtask` | 20-frame multimodal window -> current procedure step | macro-F1 (higher better) | 0.0506 | 0.0281 |
| 3 | Action Boundary Detection | `transition_detection` | current window with boundary target -> boundary or steady | macro-F1 (higher better) | 0.6118 | 0.5862 |
| 4 | Next-Action Prediction | `next_action` | current window at time t -> action at t+20 frames | macro-F1 (higher better) | 0.0593 | 0.0419 |
| 5 | Hand Trajectory Forecasting | `hand_trajectory_forecast` | current multimodal window -> future hand-joint trajectory | MPJPE (lower better) | 0.8647 | 0.1079 |
| 6 | Contact State Prediction | `contact_prediction` | non-contact, non-caption features -> contact or no contact | macro-F1 (higher better) | 1.0000 | 1.0000 |
| 7 | Object Relevance Prediction | `object_relevance` | non-caption multimodal features -> relevant object set | micro-F1 (higher better) | 0.1803 | 0.1679 |
| 8 | Language Grounding | `caption_grounding` | text-like query and candidate windows -> ranked matching moments | MRR (higher better) | 0.0160 | 0.0168 |
| 9 | Cross-Modal Retrieval | `cross_modal_retrieval` | motion/IMU/pose query; depth/video candidates -> ranked visual windows | MRR (higher better) | 0.2693 | 0.1300 |
| 10 | Cross-Modal Reconstruction | `modality_reconstruction` | motion, IMU, and camera/pose features -> reconstructed depth/video vector | R2 (higher better) | -0.0153 | -0.0102 |
| 11 | Temporal Order Verification | `temporal_order` | two adjacent windows plus difference vector -> correct or reversed | F1 (higher better) | 0.5400 | 0.8520 |
| 12 | Multimodal Synchronization Detection | `misalignment_detection` | motion-side and visual/depth-side feature groups -> aligned or shifted | F1 (higher better) | 0.5052 | 0.7153 |
| 13 | Long-Horizon Next-Action Forecasting | `long_horizon_next_action` | Current 20-frame non-caption multimodal window. -> Action label five seconds later. | macro-F1 (higher better) | 0.0750 | 0.0655 |
| 14 | Long-Horizon Next-Subtask Forecasting | `next_subtask_forecast` | Current 20-frame non-caption multimodal window. -> Procedure subtask label five seconds later. | macro-F1 (higher better) | 0.0455 | 0.0507 |
| 15 | Interaction Text Prediction | `interaction_text_prediction` | Current 20-frame sensor window with caption-text features removed. -> Raw annotation interaction phrase for the same window. | macro-F1 (higher better) | 0.0444 | 0.0381 |
| 16 | Action-Object Relation Prediction | `action_object_relation` | Current 20-frame sensor window with caption-text features removed. -> Joint action plus active object-set relation. | macro-F1 (higher better) | 0.0000 | 0.0000 |
| 17 | Future Object-Set Forecasting | `object_set_forecast` | Current 20-frame sensor window with caption-text features removed. -> Object set active five seconds later. | micro-F1 (higher better) | 0.1694 | 0.1972 |
| 18 | IMU-to-Hand Pose Reconstruction | `imu_to_hand_pose` | Current IMU acceleration/gyroscope feature block only. -> Current left/right hand joint feature blocks. | MAE (lower better) | 0.0420 | 0.0426 |
| 19 | Camera-View Synchronization Retrieval | `camera_view_sync_retrieval` | Fisheye camera-1 feature query projected into fisheye camera-3 feature space. -> The synchronized held-out camera-3 window. | MRR (higher better) | 0.4943 | 0.2409 |
| 20 | Time-to-Next-Transition Regression | `time_to_transition` | Current 20-frame non-caption multimodal window. -> Frames until the next action-label boundary, capped at 200 frames. | MAE frames (lower better) | 10.5374 | 10.5545 |

## Machine-Readable Copy

The JSON mirror is `docs/data/task_suite_20.json`.