Publish Ropedia Xperience-10M task baseline cards

a8124a8 verified 24 days ago

8.73 kB

	# Episode Task Suite

	Script:

	```text
	scripts/episode_task_suite.py
	```

	This script turns the single public Xperience-10M sample episode into many end-to-end tasks. It is designed for learning, debugging, and task design. It is not a generalization benchmark because the data is still one episode.

	Run:

	```bash
	cd /path/to/Ropedia
	source .venv/bin/activate
	python scripts/episode_task_suite.py
	```

	Output:

	```text
	outputs/episode_task_suite/
	```

	Shared setup:

	```text
	sample episode: 5821 frames
	windows: 1161
	window size: 20 frames
	stride: 5 frames
	feature dim: 8546
	split: chronological, first 70% train and last 30% test
	```

	## Implemented Tasks

	\| Task \| Input \| Output \| Main artifact \|
	\| --- \| --- \| --- \| --- \|
	\| `timeline_action` \| all modality window \| current action label \| `timeline_action/metrics.json` \|
	\| `timeline_subtask` \| all modality window \| current subtask label \| `timeline_subtask/metrics.json` \|
	\| `transition_detection` \| all modality window \| steady vs action boundary \| `transition_detection/metrics.json` \|
	\| `next_action` \| current all modality window \| action 20 frames later \| `next_action/metrics.json` \|
	\| `hand_trajectory_forecast` \| current all modality window \| future 10-frame left/right hand joints \| `hand_trajectory_forecast/predictions.npz` \|
	\| `contact_prediction` \| non-contact modalities \| any body contact in window \| `contact_prediction/metrics.json` \|
	\| `object_relevance` \| non-caption modalities \| relevant object set \| `object_relevance/predictions.csv` \|
	\| `caption_grounding` \| caption objects/interaction query + sensor candidates \| matching time window \| `caption_grounding/metrics.json` \|
	\| `cross_modal_retrieval` \| motion/IMU/camera/audio query \| matching depth/video window \| `cross_modal_retrieval/metrics.json` \|
	\| `modality_reconstruction` \| motion/IMU/camera/audio \| depth/video feature vector \| `modality_reconstruction/predictions.npz` \|
	\| `temporal_order` \| two adjacent windows \| whether order is correct \| `temporal_order/metrics.json` \|
	\| `misalignment_detection` \| motion+visual/audio pair \| aligned vs shifted \| `misalignment_detection/metrics.json` \|

	## Minimal Model Architectures

	All tasks share the same window builder unless a task explicitly removes a
	feature block to avoid label leakage.

	```text
	raw sample episode
	-> 20-frame sliding windows, stride 5
	-> all-modality feature vector X_all, 8,546 dimensions
	-> chronological split, first 70% train and last 30% test
	-> train-only z-score scaler
	-> task-specific minimal head
	```

	The task suite intentionally uses simple heads:

	\| Family \| Formula \| Tasks \|
	\| --- \| --- \| --- \|
	\| Linear softmax \| `softmax(z(X)W + b)`, cross-entropy, L2 \| `timeline_action`, `timeline_subtask`, `transition_detection`, `next_action`, `contact_prediction`, `temporal_order`, `misalignment_detection` \|
	\| Ridge regression/projection \| dual ridge regression with L2=10 on z-scored X/Y \| `hand_trajectory_forecast`, `caption_grounding`, `cross_modal_retrieval`, `modality_reconstruction` \|
	\| Multi-label logistic \| `sigmoid(z(X)W + b)`, weighted object heads \| `object_relevance` \|

	Task-specific architecture details:

	\| Task \| Input tensor/vector \| Minimal head \| Output target \|
	\| --- \| --- \| --- \| --- \|
	\| `timeline_action` \| `X_all`, 8,546d \| class-weighted linear softmax \| current action label \|
	\| `timeline_subtask` \| `X_all`, 8,546d \| class-weighted linear softmax \| current subtask label \|
	\| `transition_detection` \| `X_all`, 8,546d \| class-weighted linear softmax \| steady vs transition near action boundary \|
	\| `next_action` \| `X_all(t)`, 8,546d \| class-weighted linear softmax \| action at `t+20` frames \|
	\| `hand_trajectory_forecast` \| `X_all(t)`, 8,546d \| ridge regression \| future 10 frames of left/right hand joints, 1,260d \|
	\| `contact_prediction` \| all features except `body_contacts` and caption text, 7,503d \| linear softmax on observed labels \| any body contact in window \|
	\| `object_relevance` \| all features except caption text, 7,650d \| multi-label logistic regression \| 34-object multi-hot vector \|
	\| `caption_grounding` \| sensor features, 7,650d, projected into 896d text space \| ridge projection plus cosine ranking \| matching time window for a text query \|
	\| `cross_modal_retrieval` \| motion/IMU/camera/audio, 2,415d, projected into 5,096d visual space \| ridge projection plus cosine ranking \| matching depth/video window \|
	\| `modality_reconstruction` \| motion/IMU/camera/audio, 2,415d \| ridge regression \| depth/video feature vector, 5,096d \|
	\| `temporal_order` \| `[x_t, x_t+1, x_t+1-x_t]`, 25,638d \| binary linear softmax \| correct vs reversed order \|
	\| `misalignment_detection` \| motion plus visual/audio pair, 7,511d \| binary linear softmax \| aligned vs shifted by 8 windows \|

	Diagram:

	```text
	docs/assets/task_architectures.png
	```

	## Neural Baseline

	The suite can also run a lightweight PyTorch MLP baseline for every selected
	task while preserving the NumPy baseline artifacts:

	```bash
	python scripts/episode_task_suite.py \
	--output-dir results/episode_task_suite \
	--include-neural
	```

	This requires `torch`; use `requirements-omni.txt` when the base environment
	does not already include PyTorch.

	The neural path reuses the same windows, features, chronological split, leakage
	filters, and metrics as the minimal heads. It writes parallel artifacts under:

	```text
	results/episode_task_suite/neural_mlp/<task>/
	```

	Each neural task directory contains `metrics.json`, `history.json`, a
	`model.pt` checkpoint, and the same prediction artifact shape used by the
	corresponding minimal task (`predictions.csv` or `predictions.npz`). The suite
	rollup adds a `neural_tasks` section to `summary_report.json`; visualization
	generation adds neural-only and minimal-vs-neural score charts when those
	metrics are present.

	Useful knobs:

	```bash
	python scripts/episode_task_suite.py \
	--include-neural \
	--neural-epochs 80 \
	--neural-hidden-dim 128 \
	--neural-batch-size 128 \
	--neural-device auto
	```

	This neural baseline is intentionally small. It tests whether a nonlinear head
	over the current handcrafted feature vector improves per-task behavior before
	moving to heavier sequence or vision-language models.

	## Qwen/Omni Neural Track

	The Qwen3-Omni scripts remain a separate neural/VLM track under
	`scripts/omni/`. They are better suited for action/subtask adapter checks, sensor-adapter
	experiments, and LoRA fine-tuning than for the full 12-task matrix. A useful
	comparison order is:

	- current NumPy task suite
	- lightweight `neural_mlp` task suite
	- adapter-only smoke tests from `scripts/omni/qwen3_omni_adapter_smoke.py`
	- Qwen3-Omni zero-shot or LoRA runs where GPU/model access is available

	## Current Results

	```text
	timeline_action:
	accuracy: 0.0292
	macro_f1: 0.0500
	note: future test region contains unseen action classes

	timeline_subtask:
	accuracy: 0.0581
	macro_f1: 0.0506
	note: future test region contains unseen subtask classes

	transition_detection:
	accuracy: 0.9080
	macro_f1: 0.6118
	boundary_f1: 0.1250

	next_action:
	accuracy: 0.0345
	macro_f1: 0.0593
	note: same unseen-future-class problem as timeline_action

	hand_trajectory_forecast:
	MPJPE: 0.8647
	final-frame MPJPE: 1.0331

	contact_prediction:
	accuracy: 1.0000
	note: degenerate on this sample because the binary contact label has only one class

	object_relevance:
	micro_f1: 0.1803
	macro_f1: 0.0633

	caption_grounding:
	top1: 0.0029
	top5: 0.0115
	MRR: 0.0160

	cross_modal_retrieval:
	top1: 0.1638
	top5: 0.3678
	top10: 0.4713
	MRR: 0.2693

	modality_reconstruction:
	R2: -0.0153

	temporal_order:
	accuracy: 0.4540
	f1: 0.5400

	misalignment_detection:
	accuracy: 0.5159
	f1: 0.5052
	```

	## How To Read These Results

	Low scores are useful here. They show which tasks are not learnable from this one chronological sample with this minimal model.

	The strongest signal is `cross_modal_retrieval`: motion/IMU/camera/audio features can retrieve the matching depth/video window better than random. That means the modalities are synchronized and contain shared temporal structure.

	The weakest supervised timeline tasks are weak mainly because of the split. The last 30% of a single ordered episode contains actions/subtasks not present in the first 70%, so a classifier trained on the first part cannot predict labels it never saw.

	For serious research, keep the same task code but change the dataset unit:

	```text
	many episodes -> train episodes -> test unseen episodes
	```

	For single-episode learning, these tasks are best used as:

	- data pipeline tests
	- modality ablations
	- label-alignment checks
	- self-supervised retrieval experiments
	- debugging templates before scaling to many episodes