File size: 3,300 Bytes
6b76d01
c614c4e
6b76d01
 
 
c614c4e
 
 
6b76d01
 
 
c614c4e
 
 
 
 
6b76d01
 
 
 
 
 
 
 
 
 
c614c4e
 
 
6b76d01
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Tasks 13-20 Baselines

These eight tasks are part of the unified 20-task public-sample suite. They reuse the same 20-frame windows, 5-frame stride, shared feature tensor, chronological split, and minimal/neural baseline discipline as tasks 1-12.

The file and directory names still contain `tier2_task_suite` for backwards-compatible public links, but this is not a separate benchmark tier.

## Setup Alignment

- Tasks 1-12: `12`
- Tasks 13-20: `8`
- Unified task contracts: `20`
- Long-horizon offset: `100` frames, about `5.0` seconds at 20 FPS
- Raw public-sample HDF5 is required to regenerate the interaction/object targets; raw media/HDF5 files are not redistributed.

## Results

| # | Task | Input | Output | Minimal | Neural MLP | Meaning |
| ---: | --- | --- | --- | ---: | ---: | --- |
| 13 | Long-Horizon Next-Action Forecasting | Current 20-frame non-caption multimodal window. | Action label five seconds later. | 0.0750 macro-F1 | 0.0655 macro-F1 | Tests whether the current state carries enough procedure context to forecast beyond the one-second core next-action task. |
| 14 | Long-Horizon Next-Subtask Forecasting | Current 20-frame non-caption multimodal window. | Procedure subtask label five seconds later. | 0.0455 macro-F1 | 0.0507 macro-F1 | Moves from immediate action anticipation to higher-level procedure-state prediction. |
| 15 | Interaction Text Prediction | Current 20-frame sensor window with caption-text features removed. | Raw annotation interaction phrase for the same window. | 0.0444 macro-F1 | 0.0381 macro-F1 | Uses the raw caption JSON interaction field as a language target instead of only the hashed text feature. |
| 16 | Action-Object Relation Prediction | Current 20-frame sensor window with caption-text features removed. | Joint action plus active object-set relation. | 0.0000 macro-F1 | 0.0000 macro-F1 | Evaluates whether a model can bind what action is happening to which objects are involved. |
| 17 | Future Object-Set Forecasting | Current 20-frame sensor window with caption-text features removed. | Object set active five seconds later. | 0.1694 micro-F1 | 0.1972 micro-F1 | Predicts which objects will become relevant soon, not only which objects are relevant now. |
| 18 | IMU-to-Hand Pose Reconstruction | Current IMU acceleration/gyroscope feature block only. | Current left/right hand joint feature blocks. | 0.0420 MAE | 0.0426 MAE | A sensor-bridge probe for how much hand configuration can be recovered from inertial motion alone. |
| 19 | Camera-View Synchronization Retrieval | Fisheye camera-1 feature query projected into fisheye camera-3 feature space. | The synchronized held-out camera-3 window. | 0.4943 MRR | 0.2409 MRR | Stress-tests multi-camera time alignment beyond the core cross-modal retrieval task. |
| 20 | Time-to-Next-Transition Regression | Current 20-frame non-caption multimodal window. | Frames until the next action-label boundary, capped at 200 frames. | 10.5374 MAE frames | 10.5545 MAE frames | Turns boundary detection into a continuous timing estimate for procedural control. |

## Interpretation Boundary

Tasks 13-20 are sample-level baselines in the same unified public-sample suite. They prove that the sample can support richer task contracts, but they do not prove cross-episode model quality.