Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # Episode Task Suite | |
| Script: | |
| ```text | |
| scripts/episode_task_suite.py | |
| ``` | |
| This script turns the single public Xperience-10M sample episode into many end-to-end tasks. It is designed for learning, debugging, and task design. It is **not** a generalization benchmark because the data is still one episode. | |
| Run: | |
| ```bash | |
| cd /path/to/Ropedia | |
| source .venv/bin/activate | |
| python scripts/episode_task_suite.py | |
| ``` | |
| Output: | |
| ```text | |
| outputs/episode_task_suite/ | |
| ``` | |
| Shared setup: | |
| ```text | |
| sample episode: 5821 frames | |
| windows: 1161 | |
| window size: 20 frames | |
| stride: 5 frames | |
| feature dim: 8546 | |
| split: chronological, first 70% train and last 30% test | |
| ``` | |
| ## Implemented Tasks | |
| | Task | Input | Output | Main artifact | | |
| | --- | --- | --- | --- | | |
| | `timeline_action` | all modality window | current action label | `timeline_action/metrics.json` | | |
| | `timeline_subtask` | all modality window | current subtask label | `timeline_subtask/metrics.json` | | |
| | `transition_detection` | all modality window | steady vs action boundary | `transition_detection/metrics.json` | | |
| | `next_action` | current all modality window | action 20 frames later | `next_action/metrics.json` | | |
| | `hand_trajectory_forecast` | current all modality window | future 10-frame left/right hand joints | `hand_trajectory_forecast/predictions.npz` | | |
| | `contact_prediction` | non-contact modalities | any body contact in window | `contact_prediction/metrics.json` | | |
| | `object_relevance` | non-caption modalities | relevant object set | `object_relevance/predictions.csv` | | |
| | `caption_grounding` | caption objects/interaction query + sensor candidates | matching time window | `caption_grounding/metrics.json` | | |
| | `cross_modal_retrieval` | motion/IMU/camera/audio query | matching depth/video window | `cross_modal_retrieval/metrics.json` | | |
| | `modality_reconstruction` | motion/IMU/camera/audio | depth/video feature vector | `modality_reconstruction/predictions.npz` | | |
| | `temporal_order` | two adjacent windows | whether order is correct | `temporal_order/metrics.json` | | |
| | `misalignment_detection` | motion+visual/audio pair | aligned vs shifted | `misalignment_detection/metrics.json` | | |
| ## Minimal Model Architectures | |
| All tasks share the same window builder unless a task explicitly removes a | |
| feature block to avoid label leakage. | |
| ```text | |
| raw sample episode | |
| -> 20-frame sliding windows, stride 5 | |
| -> all-modality feature vector X_all, 8,546 dimensions | |
| -> chronological split, first 70% train and last 30% test | |
| -> train-only z-score scaler | |
| -> task-specific minimal head | |
| ``` | |
| The task suite intentionally uses simple heads: | |
| | Family | Formula | Tasks | | |
| | --- | --- | --- | | |
| | Linear softmax | `softmax(z(X)W + b)`, cross-entropy, L2 | `timeline_action`, `timeline_subtask`, `transition_detection`, `next_action`, `contact_prediction`, `temporal_order`, `misalignment_detection` | | |
| | Ridge regression/projection | dual ridge regression with L2=10 on z-scored X/Y | `hand_trajectory_forecast`, `caption_grounding`, `cross_modal_retrieval`, `modality_reconstruction` | | |
| | Multi-label logistic | `sigmoid(z(X)W + b)`, weighted object heads | `object_relevance` | | |
| Task-specific architecture details: | |
| | Task | Input tensor/vector | Minimal head | Output target | | |
| | --- | --- | --- | --- | | |
| | `timeline_action` | `X_all`, 8,546d | class-weighted linear softmax | current action label | | |
| | `timeline_subtask` | `X_all`, 8,546d | class-weighted linear softmax | current subtask label | | |
| | `transition_detection` | `X_all`, 8,546d | class-weighted linear softmax | steady vs transition near action boundary | | |
| | `next_action` | `X_all(t)`, 8,546d | class-weighted linear softmax | action at `t+20` frames | | |
| | `hand_trajectory_forecast` | `X_all(t)`, 8,546d | ridge regression | future 10 frames of left/right hand joints, 1,260d | | |
| | `contact_prediction` | all features except `body_contacts` and caption text, 7,503d | linear softmax on observed labels | any body contact in window | | |
| | `object_relevance` | all features except caption text, 7,650d | multi-label logistic regression | 34-object multi-hot vector | | |
| | `caption_grounding` | sensor features, 7,650d, projected into 896d text space | ridge projection plus cosine ranking | matching time window for a text query | | |
| | `cross_modal_retrieval` | motion/IMU/camera/audio, 2,415d, projected into 5,096d visual space | ridge projection plus cosine ranking | matching depth/video window | | |
| | `modality_reconstruction` | motion/IMU/camera/audio, 2,415d | ridge regression | depth/video feature vector, 5,096d | | |
| | `temporal_order` | `[x_t, x_t+1, x_t+1-x_t]`, 25,638d | binary linear softmax | correct vs reversed order | | |
| | `misalignment_detection` | motion plus visual/audio pair, 7,511d | binary linear softmax | aligned vs shifted by 8 windows | | |
| Diagram: | |
| ```text | |
| docs/assets/task_architectures.png | |
| ``` | |
| ## Neural Baseline | |
| The suite can also run a lightweight PyTorch MLP baseline for every selected | |
| task while preserving the NumPy baseline artifacts: | |
| ```bash | |
| python scripts/episode_task_suite.py \ | |
| --output-dir results/episode_task_suite \ | |
| --include-neural | |
| ``` | |
| This requires `torch`; use `requirements-omni.txt` when the base environment | |
| does not already include PyTorch. | |
| The neural path reuses the same windows, features, chronological split, leakage | |
| filters, and metrics as the minimal heads. It writes parallel artifacts under: | |
| ```text | |
| results/episode_task_suite/neural_mlp/<task>/ | |
| ``` | |
| Each neural task directory contains `metrics.json`, `history.json`, a | |
| `model.pt` checkpoint, and the same prediction artifact shape used by the | |
| corresponding minimal task (`predictions.csv` or `predictions.npz`). The suite | |
| rollup adds a `neural_tasks` section to `summary_report.json`; visualization | |
| generation adds neural-only and minimal-vs-neural score charts when those | |
| metrics are present. | |
| Useful knobs: | |
| ```bash | |
| python scripts/episode_task_suite.py \ | |
| --include-neural \ | |
| --neural-epochs 80 \ | |
| --neural-hidden-dim 128 \ | |
| --neural-batch-size 128 \ | |
| --neural-device auto | |
| ``` | |
| This neural baseline is intentionally small. It tests whether a nonlinear head | |
| over the current handcrafted feature vector improves per-task behavior before | |
| moving to heavier sequence or vision-language models. | |
| ## Qwen/Omni Neural Track | |
| The Qwen3-Omni scripts remain a separate neural/VLM track under | |
| `scripts/omni/`. They are better suited for action/subtask adapter checks, sensor-adapter | |
| experiments, and LoRA fine-tuning than for the full 12-task matrix. A useful | |
| comparison order is: | |
| - current NumPy task suite | |
| - lightweight `neural_mlp` task suite | |
| - adapter-only setup checks from `scripts/omni/qwen3_omni_adapter_smoke.py` | |
| - Qwen3-Omni zero-shot or LoRA runs where GPU/model access is available | |
| ## Current Results | |
| ```text | |
| timeline_action: | |
| accuracy: 0.0292 | |
| macro_f1: 0.0500 | |
| note: future test region contains unseen action classes | |
| timeline_subtask: | |
| accuracy: 0.0581 | |
| macro_f1: 0.0506 | |
| note: future test region contains unseen subtask classes | |
| transition_detection: | |
| accuracy: 0.9080 | |
| macro_f1: 0.6118 | |
| boundary_f1: 0.1250 | |
| next_action: | |
| accuracy: 0.0345 | |
| macro_f1: 0.0593 | |
| note: same unseen-future-class problem as timeline_action | |
| hand_trajectory_forecast: | |
| MPJPE: 0.8647 | |
| final-frame MPJPE: 1.0331 | |
| contact_prediction: | |
| accuracy: 1.0000 | |
| note: degenerate on this sample because the binary contact label has only one class | |
| object_relevance: | |
| micro_f1: 0.1803 | |
| macro_f1: 0.0633 | |
| caption_grounding: | |
| top1: 0.0029 | |
| top5: 0.0115 | |
| MRR: 0.0160 | |
| cross_modal_retrieval: | |
| top1: 0.1638 | |
| top5: 0.3678 | |
| top10: 0.4713 | |
| MRR: 0.2693 | |
| modality_reconstruction: | |
| R2: -0.0153 | |
| temporal_order: | |
| accuracy: 0.4540 | |
| f1: 0.5400 | |
| misalignment_detection: | |
| accuracy: 0.5159 | |
| f1: 0.5052 | |
| ``` | |
| ## How To Read These Results | |
| Low scores are useful here. They show which tasks are not learnable from this one chronological sample with this minimal model. | |
| The strongest signal is `cross_modal_retrieval`: motion/IMU/camera/audio features can retrieve the matching depth/video window better than random. That means the modalities are synchronized and contain shared temporal structure. | |
| The weakest supervised timeline tasks are weak mainly because of the split. The last 30% of a single ordered episode contains actions/subtasks not present in the first 70%, so a classifier trained on the first part cannot predict labels it never saw. | |
| For serious research, keep the same task code but change the dataset unit: | |
| ```text | |
| many episodes -> train episodes -> test unseen episodes | |
| ``` | |
| For single-episode learning, these tasks are best used as: | |
| - data pipeline tests | |
| - modality ablations | |
| - label-alignment checks | |
| - self-supervised retrieval experiments | |
| - debugging templates before scaling to many episodes | |