| # Audio Ablation and Raw-Audio Upgrade |
|
|
| This report is generated from committed task-suite artifacts plus the local public-sample MP4 audio stream. |
| It measures whether audio changes each single-episode task under the same chronological split. |
|
|
| ## Raw Audio Feature |
|
|
| - Source: `local_public_sample/fisheye_cam0.mp4` |
| - Has audio: `True` |
| - Sample rate: `16000` |
| - Window feature dim: `588` |
| - Feature: Per-window raw waveform STFT log-mel statistics plus delta and waveform envelope statistics. |
|
|
| ## Task Deltas |
|
|
| | Task | Metric | Current audio | No audio | Current audio delta | Raw replaces audio | Raw replacement delta | |
| | --- | --- | ---: | ---: | ---: | ---: | ---: | |
| | Current Action Recognition | macro_f1 | 0.0091 | 0.0088 | 0.0003 | 0.0013 | -0.0077 | |
| | Current Subtask Recognition | macro_f1 | 0.0113 | 0.0112 | 0.0001 | 0.0008 | -0.0104 | |
| | Action Transition Detection | macro_f1 | 0.4621 | 0.4687 | -0.0066 | 0.4792 | 0.0171 | |
| | Next-Action Prediction | macro_f1 | 0.0106 | 0.0107 | -0.0001 | 0.0060 | -0.0046 | |
| | Future Hand Motion Forecasting | mae | 4.4664 | 4.3038 | -0.1626 | 4.3059 | 0.1605 | |
| | Contact State Prediction | macro_f1 | 1.0000 | 1.0000 | 0.0000 | 1.0000 | 0.0000 | |
| | Relevant Object Prediction | micro_f1 | 0.1581 | 0.1479 | 0.0102 | 0.1787 | 0.0206 | |
| | Language-to-Time Grounding | mrr | 0.0321 | 0.0272 | 0.0049 | 0.0248 | -0.0072 | |
| | Cross-Modal Window Retrieval | mrr | 0.3751 | 0.3892 | -0.0141 | 0.3275 | -0.0476 | |
| | Sensor-to-Visual Reconstruction | mae | 9.7942 | 10.4467 | 0.6524 | 8.8307 | 0.9635 | |
| | Temporal Order Verification | macro_f1 | 0.5172 | 0.4943 | 0.0230 | 0.5302 | 0.0129 | |
| | Cross-Modal Misalignment Detection | macro_f1 | 0.4173 | 0.4226 | -0.0052 | 0.4438 | 0.0264 | |
|
|
| ## Aggregate |
|
|
| - Mean current-audio delta: `0.041849794979543296` |
| - Tasks where current handcrafted audio improves the primary metric: `6` |
| - Mean raw-replacement delta vs current handcrafted audio: `0.09362598132150173` |
| - Tasks where raw log-mel replacement improves over current handcrafted audio: `6` |
|
|
| Positive deltas always mean better according to each task's primary metric. For MAE tasks, lower MAE is converted into a positive improvement. |
|
|