ropedia-xperience-10m-task-baselines / TASK_SUITE_ENHANCEMENT_128.md

Publish Ropedia Xperience-10M task baseline cards

7c58b77 verified 19 days ago

3.43 kB

	# 128-Episode Task Suite Enhancement Pack

	Run id: `task_suite_enhancement_128_v1_20260608`

	This non-overwriting enhancement pack records how to push the current 128-episode task suite harder without adding more raw episodes.

	## Current Evidence

	- Current public export windows: `3808`
	- Window split counts: `train 2848 / val 512 / test 448`
	- Selected episode split: `train 96 / val 16 / test 16`
	- Windowed episode ids in baseline CSV: `train 89 / val 16 / test 14`
	- Qwen3 v4 JSON validity: `1.0000`
	- Qwen3 v4 action macro-F1: `0.001868`
	- Qwen3 v4 subtask accuracy: `0.000000`
	- Qwen3 v4 unseen-label sample share: `0.7076`

	## Dense-Window Scenarios

	\| scenario \| estimated windows \| multiplier \| role \|
	\| --- \| ---: \| ---: \| --- \|
	\| `current_export` \| 3808 \| 1.0 \| current public 128-episode JSON-task export \|
	\| `dense_20f_stride20` \| 30422 \| 7.99 \| non-overlap dense coverage over each observed episode frame span \|
	\| `dense_20f_stride10` \| 60725 \| 15.95 \| 2x overlap action/subtask densification \|
	\| `dense_20f_stride5` \| 121331 \| 31.86 \| high-overlap action boundary and transition stress setting \|
	\| `medium_40f_stride20` \| 30303 \| 7.96 \| subtask/procedure context window \|
	\| `long_80f_stride40` \| 15067 \| 3.96 \| procedure and world-model context window \|
	\| `multiscale_20s10_40s20_80s40` \| 106095 \| 27.86 \| recommended no-new-episode v5 export: short action windows plus medium/long procedure context \|

	## Highest-Priority Bottlenecks

	\| task \| priority \| simple score \| bottleneck \| next action \|
	\| --- \| --- \| ---: \| --- \| --- \|
	\| Next-Action Prediction \| highest \| 0.000200 \| fine-grained label explosion and held-out unseen labels \| add hierarchical action/subtask families plus label-normalized scoring \|
	\| Action Recognition \| highest \| 0.000175 \| fine-grained label explosion and held-out unseen labels \| add hierarchical action/subtask families plus label-normalized scoring \|
	\| Procedure Step Recognition \| highest \| 0.000000 \| fine-grained label explosion and held-out unseen labels \| add hierarchical action/subtask families plus label-normalized scoring \|
	\| Cross-Modal Retrieval \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Hand Trajectory Forecasting \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Multimodal Synchronization Detection \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Cross-Modal Reconstruction \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Language Grounding \| medium \| 0.012786 \| weak public-safe metadata/text baseline \| add dense windows and stronger fusion baselines before interpreting model quality \|

	## Recommended Next Run

	Use `multiscale_20s10_40s20_80s40` as the next export target, then train a Qwen3 v5 hierarchical-target LoRA/partial-unfreeze run against the unchanged 96/16/16 episode split.

	In parallel, export compact raw 128-episode feature shards for trajectory, retrieval, reconstruction, and synchronization tasks so the simple and neural baselines can be fully aligned beyond the JSON-supported labels.

	The current artifacts remain the baseline; future runs should write new run ids and publish separate verified packages.

	# 128-Episode Task Suite Enhancement Pack

	Run id: `task_suite_enhancement_128_v1_20260608`

	This non-overwriting enhancement pack records how to push the current 128-episode task suite harder without adding more raw episodes.

	## Current Evidence

	- Current public export windows: `3808`
	- Window split counts: `train 2848 / val 512 / test 448`
	- Selected episode split: `train 96 / val 16 / test 16`
	- Windowed episode ids in baseline CSV: `train 89 / val 16 / test 14`
	- Qwen3 v4 JSON validity: `1.0000`
	- Qwen3 v4 action macro-F1: `0.001868`
	- Qwen3 v4 subtask accuracy: `0.000000`
	- Qwen3 v4 unseen-label sample share: `0.7076`

	## Dense-Window Scenarios

	\| scenario \| estimated windows \| multiplier \| role \|
	\| --- \| ---: \| ---: \| --- \|
	\| `current_export` \| 3808 \| 1.0 \| current public 128-episode JSON-task export \|
	\| `dense_20f_stride20` \| 30422 \| 7.99 \| non-overlap dense coverage over each observed episode frame span \|
	\| `dense_20f_stride10` \| 60725 \| 15.95 \| 2x overlap action/subtask densification \|
	\| `dense_20f_stride5` \| 121331 \| 31.86 \| high-overlap action boundary and transition stress setting \|
	\| `medium_40f_stride20` \| 30303 \| 7.96 \| subtask/procedure context window \|
	\| `long_80f_stride40` \| 15067 \| 3.96 \| procedure and world-model context window \|
	\| `multiscale_20s10_40s20_80s40` \| 106095 \| 27.86 \| recommended no-new-episode v5 export: short action windows plus medium/long procedure context \|

	## Highest-Priority Bottlenecks

	\| task \| priority \| simple score \| bottleneck \| next action \|
	\| --- \| --- \| ---: \| --- \| --- \|
	\| Next-Action Prediction \| highest \| 0.000200 \| fine-grained label explosion and held-out unseen labels \| add hierarchical action/subtask families plus label-normalized scoring \|
	\| Action Recognition \| highest \| 0.000175 \| fine-grained label explosion and held-out unseen labels \| add hierarchical action/subtask families plus label-normalized scoring \|
	\| Procedure Step Recognition \| highest \| 0.000000 \| fine-grained label explosion and held-out unseen labels \| add hierarchical action/subtask families plus label-normalized scoring \|
	\| Cross-Modal Retrieval \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Hand Trajectory Forecasting \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Multimodal Synchronization Detection \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Cross-Modal Reconstruction \| high \| \| missing raw 128-episode feature blocks \| export compact raw-feature shards for this task before model comparison \|
	\| Language Grounding \| medium \| 0.012786 \| weak public-safe metadata/text baseline \| add dense windows and stronger fusion baselines before interpreting model quality \|

	## Recommended Next Run

	Use `multiscale_20s10_40s20_80s40` as the next export target, then train a Qwen3 v5 hierarchical-target LoRA/partial-unfreeze run against the unchanged 96/16/16 episode split.

	In parallel, export compact raw 128-episode feature shards for trajectory, retrieval, reconstruction, and synchronization tasks so the simple and neural baselines can be fully aligned beyond the JSON-supported labels.

	The current artifacts remain the baseline; future runs should write new run ids and publish separate verified packages.