File size: 3,833 Bytes
d735235
 
 
 
cf07180
d735235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf07180
d735235
 
 
 
 
 
 
 
 
 
 
 
 
 
cf07180
d735235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf07180
d735235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf07180
d735235
 
 
 
 
 
 
 
 
 
 
 
 
cf07180
d735235
 
 
cf07180
d735235
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Research Takeaways

This generated note summarizes what the current public Xperience-10M sample
pipeline actually shows. It is built from committed metric artifacts, not
from hand-edited score text.

## Scope

- validated episodes: 1
- frames: 5,821
- aligned windows: 1,161
- current feature dimension: 8,378
- raw Xperience-10M data is not redistributed
- audio is documented and visualized, but not yet featurized

## Takeaways

### One episode can become a real benchmark contract

The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,378-dimensional feature contract.

| Metric | Value |
| --- | ---: |
| `frames` | 5,821 |
| `windows` | 1,161 |
| `feature_dim` | 8,378 |

Source: `docs/data/summary_metrics.json`.

Current scope: This benchmark defines the task contract; cross-episode generalization is evaluated in the multi-episode stage.

### Chronological splits expose action-class shift

Earlier all-feature action classifiers reach high macro-F1 on their local split, but the 12-task chronological action/subtask heads are much harder because later held-out windows include unseen labels.

| Metric | Value |
| --- | ---: |
| `all_feature_action_macro_f1` | 0.9791 |
| `suite_action_macro_f1` | 0.0500 |
| `suite_subtask_macro_f1` | 0.0495 |
| `unseen_action_test_classes` | 4 |

Source: `results/episode_task_suite/summary_report.json`.

Current scope: This split is useful for studying label shift; broad action-recognition conclusions need held-out episodes.

### Small neural heads help dynamic and temporal probes

The MLP heads substantially improve hand trajectory forecasting, temporal-order verification, and motion/visual synchronization.

| Metric | Value |
| --- | ---: |
| `hand_mpjpe_minimal` | 0.8223 |
| `hand_mpjpe_neural` | 0.1116 |
| `hand_mpjpe_relative_improvement` | 0.8642 |
| `temporal_order_f1_minimal` | 0.5487 |
| `temporal_order_f1_neural` | 0.8718 |
| `misalignment_f1_minimal` | 0.4866 |
| `misalignment_f1_neural` | 0.7335 |

Source: `results/episode_task_suite/neural_mlp/*/metrics.json`.

Current scope: These gains are measured within one episode and are candidates for held-out-episode testing.

### Retrieval and reconstruction remain the harder multimodal problems

Ridge/cosine retrieval remains stronger than the neural projection on this sample, and cross-modal reconstruction still has negative R2.

| Metric | Value |
| --- | ---: |
| `retrieval_mrr_minimal` | 0.2634 |
| `retrieval_mrr_neural` | 0.1530 |
| `retrieval_top5_minimal` | 0.3764 |
| `reconstruction_r2_minimal` | -0.0160 |
| `reconstruction_r2_neural` | -0.0102 |

Source: `results/episode_task_suite/cross_modal_retrieval/metrics.json`.

Current scope: The current reconstruction task predicts feature vectors; depth, mesh, NeRF, and Gaussian-splatting outputs are future task variants.

### The next scientific unit is held-out episodes, not more adjacent windows

The prepared Qwen3-Omni path targets 32 episodes from 32 sessions, but it remains data-gated until access and held-out evaluation complete.

| Metric | Value |
| --- | ---: |
| `target_episodes` | 32 |
| `selected_sessions` | 32 |
| `valid_candidates` | 680 |

Source: `results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md`.

Current scope: The 32-episode Qwen3-Omni fine-tune requires gated data staging and held-out evaluation.

## How To Read These Results

- High single-episode scores are useful pipeline checks for the current task contracts.
- Low chronological action/subtask scores are informative because they expose later-label shift.
- Neural gains on trajectory/order/alignment make those tasks good candidates for the next fine-tuning stage.
- Retrieval and reconstruction remain the main multimodal representation challenges.
- The next credible model-quality result needs held-out episodes.