128-Episode Task Suite Enhancement Pack

Run id: task_suite_enhancement_128_v1_20260608

This non-overwriting enhancement pack records how to push the current 128-episode task suite harder without adding more raw episodes.

Current Evidence

Current public export windows: 3808
Window split counts: train 2848 / val 512 / test 448
Selected episode split: train 96 / val 16 / test 16
Windowed episode ids in baseline CSV: train 89 / val 16 / test 14
Qwen3 v4 JSON validity: 1.0000
Qwen3 v4 action macro-F1: 0.001868
Qwen3 v4 subtask accuracy: 0.000000
Qwen3 v4 unseen-label sample share: 0.7076

Dense-Window Scenarios

scenario	estimated windows	multiplier	role
`current_export`	3808	1.0	current public 128-episode JSON-task export
`dense_20f_stride20`	30422	7.99	non-overlap dense coverage over each observed episode frame span
`dense_20f_stride10`	60725	15.95	2x overlap action/subtask densification
`dense_20f_stride5`	121331	31.86	high-overlap action boundary and transition stress setting
`medium_40f_stride20`	30303	7.96	subtask/procedure context window
`long_80f_stride40`	15067	3.96	procedure and world-model context window
`multiscale_20s10_40s20_80s40`	106095	27.86	recommended no-new-episode v5 export: short action windows plus medium/long procedure context

Highest-Priority Bottlenecks

task	priority	simple score	bottleneck	next action
Next-Action Prediction	highest	0.000200	fine-grained label explosion and held-out unseen labels	add hierarchical action/subtask families plus label-normalized scoring
Action Recognition	highest	0.000175	fine-grained label explosion and held-out unseen labels	add hierarchical action/subtask families plus label-normalized scoring
Procedure Step Recognition	highest	0.000000	fine-grained label explosion and held-out unseen labels	add hierarchical action/subtask families plus label-normalized scoring
Cross-Modal Retrieval	high		missing raw 128-episode feature blocks	export compact raw-feature shards for this task before model comparison
Hand Trajectory Forecasting	high		missing raw 128-episode feature blocks	export compact raw-feature shards for this task before model comparison
Multimodal Synchronization Detection	high		missing raw 128-episode feature blocks	export compact raw-feature shards for this task before model comparison
Cross-Modal Reconstruction	high		missing raw 128-episode feature blocks	export compact raw-feature shards for this task before model comparison
Language Grounding	medium	0.012786	weak public-safe metadata/text baseline	add dense windows and stronger fusion baselines before interpreting model quality

Recommended Next Run

Use multiscale_20s10_40s20_80s40 as the next export target, then train a Qwen3 v5 hierarchical-target LoRA/partial-unfreeze run against the unchanged 96/16/16 episode split.

In parallel, export compact raw 128-episode feature shards for trajectory, retrieval, reconstruction, and synchronization tasks so the simple and neural baselines can be fully aligned beyond the JSON-supported labels.

The current artifacts remain the baseline; future runs should write new run ids and publish separate verified packages.