Research Takeaways

This generated note summarizes what the current public Xperience-10M sample pipeline actually shows. It is built from committed metric artifacts, not from hand-edited score text.

Scope

validated episodes: 1
frames: 5,821
aligned windows: 1,161
current feature dimension: 8,378
raw Xperience-10M data is not redistributed
audio is documented and visualized, but not yet featurized

Takeaways

One episode can become a real benchmark contract

The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,378-dimensional feature contract.

Metric	Value
`frames`	5,821
`windows`	1,161
`feature_dim`	8,378

Source: docs/data/summary_metrics.json.

Current scope: This benchmark defines the task contract; cross-episode generalization is evaluated in the multi-episode stage.

Chronological splits expose action-class shift

Earlier all-feature action classifiers reach high macro-F1 on their local split, but the 12-task chronological action/subtask heads are much harder because later held-out windows include unseen labels.

Metric	Value
`all_feature_action_macro_f1`	0.9791
`suite_action_macro_f1`	0.0500
`suite_subtask_macro_f1`	0.0495
`unseen_action_test_classes`	4

Source: results/episode_task_suite/summary_report.json.

Current scope: This split is useful for studying label shift; broad action-recognition conclusions need held-out episodes.

Small neural heads help dynamic and temporal probes

The MLP heads substantially improve hand trajectory forecasting, temporal-order verification, and motion/visual synchronization.

Metric	Value
`hand_mpjpe_minimal`	0.8223
`hand_mpjpe_neural`	0.1116
`hand_mpjpe_relative_improvement`	0.8642
`temporal_order_f1_minimal`	0.5487
`temporal_order_f1_neural`	0.8718
`misalignment_f1_minimal`	0.4866
`misalignment_f1_neural`	0.7335

Source: results/episode_task_suite/neural_mlp/*/metrics.json.

Current scope: These gains are measured within one episode and are candidates for held-out-episode testing.

Retrieval and reconstruction remain the harder multimodal problems

Ridge/cosine retrieval remains stronger than the neural projection on this sample, and cross-modal reconstruction still has negative R2.

Metric	Value
`retrieval_mrr_minimal`	0.2634
`retrieval_mrr_neural`	0.1530
`retrieval_top5_minimal`	0.3764
`reconstruction_r2_minimal`	-0.0160
`reconstruction_r2_neural`	-0.0102

Source: results/episode_task_suite/cross_modal_retrieval/metrics.json.

Current scope: The current reconstruction task predicts feature vectors; depth, mesh, NeRF, and Gaussian-splatting outputs are future task variants.

The next scientific unit is held-out episodes, not more adjacent windows

The prepared Qwen3-Omni path targets 32 episodes from 32 sessions, but it remains data-gated until access and held-out evaluation complete.

Metric	Value
`target_episodes`	32
`selected_sessions`	32
`valid_candidates`	680

Source: results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md.

Current scope: The 32-episode Qwen3-Omni fine-tune requires gated data staging and held-out evaluation.

How To Read These Results

High single-episode scores are useful pipeline checks for the current task contracts.
Low chronological action/subtask scores are informative because they expose later-label shift.
Neural gains on trajectory/order/alignment make those tasks good candidates for the next fine-tuning stage.
Retrieval and reconstruction remain the main multimodal representation challenges.
The next credible model-quality result needs held-out episodes.