# Unified 20-Task Suite The public Xperience-10M sample task surface is one unified set of 20 tasks. All task contracts are presented together under the same window, split, feature, baseline, and leakage-control contract. Historical artifact paths containing `tier2_task_suite` are kept for stable links, but they should be read as provenance directories inside the unified task suite, not as a separate benchmark tier. ## Shared Setup - Episode scope: `1` public sample episode. - Frames/windows: `5,821` frames and `1,161` aligned windows. - Windowing: `20` frames per window, stride `5` frames. - Feature vector: `8,546` dimensions from the shared feature manifest. - Split: chronological 70/30 train/test by time within the sample episode. - Baselines: minimal interpretable heads and compact neural MLP heads. - Raw data: MP4/HDF5/RRD files are not redistributed. ## Task Table | # | Task | Artifact id | Input -> output | Primary metric | Minimal | Neural | | ---: | --- | --- | --- | --- | ---: | ---: | | 1 | Action Recognition | `timeline_action` | 20-frame multimodal window -> current action class | macro-F1 (higher better) | 0.0500 | 0.0148 | | 2 | Procedure Step Recognition | `timeline_subtask` | 20-frame multimodal window -> current procedure step | macro-F1 (higher better) | 0.0506 | 0.0281 | | 3 | Action Boundary Detection | `transition_detection` | current window with boundary target -> boundary or steady | macro-F1 (higher better) | 0.6118 | 0.5862 | | 4 | Next-Action Prediction | `next_action` | current window at time t -> action at t+20 frames | macro-F1 (higher better) | 0.0593 | 0.0419 | | 5 | Hand Trajectory Forecasting | `hand_trajectory_forecast` | current multimodal window -> future hand-joint trajectory | MPJPE (lower better) | 0.8647 | 0.1079 | | 6 | Contact State Prediction | `contact_prediction` | non-contact, non-caption features -> contact or no contact | macro-F1 (higher better) | 1.0000 | 1.0000 | | 7 | Object Relevance Prediction | `object_relevance` | non-caption multimodal features -> relevant object set | micro-F1 (higher better) | 0.1803 | 0.1679 | | 8 | Language Grounding | `caption_grounding` | text-like query and candidate windows -> ranked matching moments | MRR (higher better) | 0.0160 | 0.0168 | | 9 | Cross-Modal Retrieval | `cross_modal_retrieval` | motion/IMU/pose query; depth/video candidates -> ranked visual windows | MRR (higher better) | 0.2693 | 0.1300 | | 10 | Cross-Modal Reconstruction | `modality_reconstruction` | motion, IMU, and camera/pose features -> reconstructed depth/video vector | R2 (higher better) | -0.0153 | -0.0102 | | 11 | Temporal Order Verification | `temporal_order` | two adjacent windows plus difference vector -> correct or reversed | F1 (higher better) | 0.5400 | 0.8520 | | 12 | Multimodal Synchronization Detection | `misalignment_detection` | motion-side and visual/depth-side feature groups -> aligned or shifted | F1 (higher better) | 0.5052 | 0.7153 | | 13 | Long-Horizon Next-Action Forecasting | `long_horizon_next_action` | Current 20-frame non-caption multimodal window. -> Action label five seconds later. | macro-F1 (higher better) | 0.0750 | 0.0655 | | 14 | Long-Horizon Next-Subtask Forecasting | `next_subtask_forecast` | Current 20-frame non-caption multimodal window. -> Procedure subtask label five seconds later. | macro-F1 (higher better) | 0.0455 | 0.0507 | | 15 | Interaction Text Prediction | `interaction_text_prediction` | Current 20-frame sensor window with caption-text features removed. -> Raw annotation interaction phrase for the same window. | macro-F1 (higher better) | 0.0444 | 0.0381 | | 16 | Action-Object Relation Prediction | `action_object_relation` | Current 20-frame sensor window with caption-text features removed. -> Joint action plus active object-set relation. | macro-F1 (higher better) | 0.0000 | 0.0000 | | 17 | Future Object-Set Forecasting | `object_set_forecast` | Current 20-frame sensor window with caption-text features removed. -> Object set active five seconds later. | micro-F1 (higher better) | 0.1694 | 0.1972 | | 18 | IMU-to-Hand Pose Reconstruction | `imu_to_hand_pose` | Current IMU acceleration/gyroscope feature block only. -> Current left/right hand joint feature blocks. | MAE (lower better) | 0.0420 | 0.0426 | | 19 | Camera-View Synchronization Retrieval | `camera_view_sync_retrieval` | Fisheye camera-1 feature query projected into fisheye camera-3 feature space. -> The synchronized held-out camera-3 window. | MRR (higher better) | 0.4943 | 0.2409 | | 20 | Time-to-Next-Transition Regression | `time_to_transition` | Current 20-frame non-caption multimodal window. -> Frames until the next action-label boundary, capped at 200 frames. | MAE frames (lower better) | 10.5374 | 10.5545 | ## Machine-Readable Copy The JSON mirror is `docs/data/task_suite_20.json`.