# Research Takeaways This generated note summarizes what the current public Xperience-10M sample pipeline actually shows. It is built from committed metric artifacts, not from hand-edited score text. ## Scope - validated episodes: 1 - frames: 5,821 - aligned windows: 1,161 - current feature dimension: 8,378 - raw Xperience-10M data is not redistributed - audio is documented and visualized, but not yet featurized ## Takeaways ### One episode can become a real benchmark contract The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,378-dimensional feature contract. | Metric | Value | | --- | ---: | | `frames` | 5,821 | | `windows` | 1,161 | | `feature_dim` | 8,378 | Source: `docs/data/summary_metrics.json`. Current scope: This benchmark defines the task contract; cross-episode generalization is evaluated in the multi-episode stage. ### Chronological splits expose action-class shift Earlier all-feature action classifiers reach high macro-F1 on their local split, but the 12-task chronological action/subtask heads are much harder because later held-out windows include unseen labels. | Metric | Value | | --- | ---: | | `all_feature_action_macro_f1` | 0.9791 | | `suite_action_macro_f1` | 0.0500 | | `suite_subtask_macro_f1` | 0.0495 | | `unseen_action_test_classes` | 4 | Source: `results/episode_task_suite/summary_report.json`. Current scope: This split is useful for studying label shift; broad action-recognition conclusions need held-out episodes. ### Small neural heads help dynamic and temporal probes The MLP heads substantially improve hand trajectory forecasting, temporal-order verification, and motion/visual synchronization. | Metric | Value | | --- | ---: | | `hand_mpjpe_minimal` | 0.8223 | | `hand_mpjpe_neural` | 0.1116 | | `hand_mpjpe_relative_improvement` | 0.8642 | | `temporal_order_f1_minimal` | 0.5487 | | `temporal_order_f1_neural` | 0.8718 | | `misalignment_f1_minimal` | 0.4866 | | `misalignment_f1_neural` | 0.7335 | Source: `results/episode_task_suite/neural_mlp/*/metrics.json`. Current scope: These gains are measured within one episode and are candidates for held-out-episode testing. ### Retrieval and reconstruction remain the harder multimodal problems Ridge/cosine retrieval remains stronger than the neural projection on this sample, and cross-modal reconstruction still has negative R2. | Metric | Value | | --- | ---: | | `retrieval_mrr_minimal` | 0.2634 | | `retrieval_mrr_neural` | 0.1530 | | `retrieval_top5_minimal` | 0.3764 | | `reconstruction_r2_minimal` | -0.0160 | | `reconstruction_r2_neural` | -0.0102 | Source: `results/episode_task_suite/cross_modal_retrieval/metrics.json`. Current scope: The current reconstruction task predicts feature vectors; depth, mesh, NeRF, and Gaussian-splatting outputs are future task variants. ### The next scientific unit is held-out episodes, not more adjacent windows The prepared Qwen3-Omni path targets 32 episodes from 32 sessions, but it remains data-gated until access and held-out evaluation complete. | Metric | Value | | --- | ---: | | `target_episodes` | 32 | | `selected_sessions` | 32 | | `valid_candidates` | 680 | Source: `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md`. Current scope: The 32-episode Qwen3-Omni fine-tune requires gated data staging and held-out evaluation. ## How To Read These Results - High single-episode scores are useful pipeline checks for the current task contracts. - Low chronological action/subtask scores are informative because they expose later-label shift. - Neural gains on trajectory/order/alignment make those tasks good candidates for the next fine-tuning stage. - Retrieval and reconstruction remain the main multimodal representation challenges. - The next credible model-quality result needs held-out episodes.