Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ropedia-xperience-10m-task-baselines / artifacts /episode_task_suite /task_walkthroughs /TASK_WALKTHROUGHS.md
| # Junior-Friendly 12-Task Walkthroughs | |
| This file explains every task in the Xperience-10M episode suite as an input -> process -> output pipeline. | |
| It is generated by `scripts/task_walkthroughs.py` from committed metrics plus hand-curated task explanations. | |
| ## Shared Pipeline | |
| - Read annotation.hdf5 and synchronized video-derived features. | |
| - Slice the episode into 20-frame windows with stride 5. | |
| - Build a 8,546-dimensional aligned feature vector from the synchronized modality groups. | |
| - Construct a task-specific target from labels, future frames, paired windows, or modality splits. | |
| - Train a minimal head and, when enabled, a neural MLP head. | |
| - Write metrics, predictions, and model artifacts for downstream exploration. | |
| ## Task Walkthroughs | |
| ### Action Recognition (`timeline_action`) | |
| **Research name:** Egocentric Action Recognition | |
| **Family:** supervised; multiclass classifier; C. Egocentric Vision & Interaction. | |
| **Goal:** Look at one short multimodal window and name what action is happening now. | |
| **Case study:** In the coffee-making sample, if the 20-frame window is during a pouring moment, the task asks the model to output an action such as Pour coffee or Pour milk into coffee. | |
| **Input:** One 20-frame window represented by the current feature vector: video/audio/depth summaries, pose, SLAM/camera pose, motion capture, IMU, calibration, and language-derived context. | |
| **Middle process modules:** | |
| - Window builder slices the episode into short overlapping windows. | |
| - Feature assembler concatenates all current feature blocks. | |
| - Label builder reads the action annotation for the center of the window. | |
| - Classifier head maps the window vector to one action class. | |
| - Evaluator compares predicted action labels against the held-out chronological segment. | |
| **Output:** A single action class for the current window. | |
| **Metric:** macro-F1 (higher is better). Minimal `0.0500`, neural MLP `0.0148`. | |
| **Junior mental model:** This is like asking: given this tiny movie clip plus sensor readings, what is the person doing right now? | |
| **Current limitation:** The one-episode chronological split contains future action classes that were not present in training, so low test macro-F1 is expected. | |
| ### Procedure Step Recognition (`timeline_subtask`) | |
| **Research name:** Temporal Subtask Recognition | |
| **Family:** supervised; multiclass classifier; C. Egocentric Vision & Interaction. | |
| **Goal:** Predict the higher-level task stage for the current window. | |
| **Case study:** A pouring action may belong to a broader subtask such as preparing or pouring a drink. The model predicts that broader stage instead of a fine action. | |
| **Input:** The same all-modality window vector used by action recognition. | |
| **Middle process modules:** | |
| - Window builder creates the current temporal slice. | |
| - Feature assembler keeps all available modality blocks. | |
| - Subtask label builder maps the current timestamp to a subtask annotation. | |
| - Classifier head predicts the subtask class. | |
| - Evaluator reports class-balanced scores so rare subtasks matter. | |
| **Output:** A single subtask label for the current window. | |
| **Metric:** macro-F1 (higher is better). Minimal `0.0506`, neural MLP `0.0281`. | |
| **Junior mental model:** Action is the verb; subtask is the chapter of the activity. | |
| **Current limitation:** Single-episode ordering means some later subtasks appear only in test, so this is a pipeline check rather than a general benchmark. | |
| ### Action Boundary Detection (`transition_detection`) | |
| **Research name:** Temporal Action Segmentation | |
| **Family:** diagnostic; binary classifier; C. Egocentric Vision & Interaction. | |
| **Goal:** Detect whether the current window is near a boundary between actions. | |
| **Case study:** When the demonstrator changes from preparing to pouring, the model should flag a boundary instead of a steady action window. | |
| **Input:** One all-modality window vector plus labels derived from action-change timestamps. | |
| **Middle process modules:** | |
| - Boundary builder scans action labels over time and marks windows near a change. | |
| - Feature assembler supplies all current modality features. | |
| - Binary classifier predicts steady vs boundary. | |
| - Boundary matcher checks whether predicted boundary times are close to true boundary times. | |
| - Evaluator reports macro-F1 and timing error, not just accuracy. | |
| **Output:** A binary label: boundary or steady. | |
| **Metric:** macro-F1 (higher is better). Minimal `0.6118`, neural MLP `0.5862`. | |
| **Junior mental model:** This is the model's way of saying: something just changed here. | |
| **Current limitation:** Boundaries are rare, so high accuracy can be misleading if the model predicts steady too often. | |
| ### Next-Action Prediction (`next_action`) | |
| **Research name:** Short-Horizon Intention Prediction | |
| **Family:** supervised; future-label classifier; C. Egocentric Vision & Interaction. | |
| **Goal:** Use the current window to guess the action that will happen shortly after it. | |
| **Case study:** If a window shows the person preparing to pour, the target can be the action 20 frames later, such as the start of pouring. | |
| **Input:** The current all-modality window vector at time t. | |
| **Middle process modules:** | |
| - Window builder picks a current time window. | |
| - Future label builder shifts the action target by 20 frames. | |
| - Feature assembler uses only current information, not future features. | |
| - Classifier head predicts the future action class. | |
| - Evaluator checks whether the future action label is correct. | |
| **Output:** A single action class for t+20 frames. | |
| **Metric:** macro-F1 (higher is better). Minimal `0.0593`, neural MLP `0.0419`. | |
| **Junior mental model:** This is short-horizon intention prediction: what will the person do next? | |
| **Current limitation:** The public sample has unseen future classes in the chronological test split, which makes this very hard with one episode. | |
| ### Hand Trajectory Forecasting (`hand_trajectory_forecast`) | |
| **Research name:** 3D Hand Motion Forecasting | |
| **Family:** forecast; continuous regressor; A. Human Modeling & Motion Understanding. | |
| **Goal:** Predict where the hands will move over the next few frames. | |
| **Case study:** When the hand is moving toward a cup or bottle, the model predicts the future 3D hand-joint path. | |
| **Input:** The current all-modality window vector at time t. | |
| **Middle process modules:** | |
| - Window builder chooses the current sensor window. | |
| - Target builder extracts future left/right hand 3D joints from motion capture. | |
| - Regression head predicts a continuous trajectory, not a class label. | |
| - Output reshaper interprets the vector as future frames and joints. | |
| - Evaluator computes MPJPE, the average 3D joint-position error. | |
| **Output:** A future trajectory vector for left and right hand joints. | |
| **Metric:** MPJPE (lower is better). Minimal `0.8647`, neural MLP `0.1079`. | |
| **Junior mental model:** Instead of naming an action, this task draws the next hand path in 3D. | |
| **Current limitation:** It is still a window-level forecast, not a full policy or long-horizon motion generator. | |
| ### Contact State Prediction (`contact_prediction`) | |
| **Research name:** Human-Object Contact Prediction | |
| **Family:** supervised; binary classifier; A. Human Modeling & Motion Understanding. | |
| **Goal:** Predict whether the body or hand is in contact with something. | |
| **Case study:** During manipulation, the hand may touch a cup, table, or bottle. The task asks whether any contact is happening. | |
| **Input:** Non-contact and non-caption feature blocks, so the answer is not directly leaked from the target labels. | |
| **Middle process modules:** | |
| - Feature selector removes contact-label and caption-label blocks. | |
| - Target builder converts contact annotations into a binary label. | |
| - Binary classifier predicts contact vs no contact. | |
| - Evaluator reports macro-F1 and accuracy. | |
| - Degeneracy checker records whether only one class appears. | |
| **Output:** A binary contact label. | |
| **Metric:** macro-F1 (higher is better). Minimal `1.0000`, neural MLP `1.0000`. | |
| **Junior mental model:** This is a simple physical-interaction probe: is the person touching something now? | |
| **Current limitation:** The current public sample is degenerate for this task because one class dominates, so perfect score does not mean the model learned contact physics. | |
| ### Object Relevance Prediction (`object_relevance`) | |
| **Research name:** Object-Centric Interaction Recognition | |
| **Family:** supervised; multi-label classifier; C. Egocentric Vision & Interaction. | |
| **Goal:** Predict which objects matter in the current window. | |
| **Case study:** If the person is pouring milk into coffee, relevant objects may include milk, cup, coffee, or container-like items. | |
| **Input:** Non-caption feature blocks, so the model must infer objects from sensors rather than copying the caption words. | |
| **Middle process modules:** | |
| - Object vocabulary builder collects object labels from annotations. | |
| - Feature selector removes caption-derived label blocks. | |
| - Multi-label target builder creates a multi-hot object vector. | |
| - Sigmoid heads predict each object's relevance independently. | |
| - Evaluator reports micro-F1 and exact-match quality. | |
| **Output:** A multi-label object set for the current window. | |
| **Metric:** micro-F1 (higher is better). Minimal `0.1803`, neural MLP `0.1679`. | |
| **Junior mental model:** A window can involve more than one object, so this is not a one-class classifier. | |
| **Current limitation:** Object labels are sparse and language-derived, so this is currently a weak object-centric probe. | |
| ### Language Grounding (`caption_grounding`) | |
| **Research name:** Language-to-Moment Grounding | |
| **Family:** retrieval; retrieval ranker; C. Egocentric Vision & Interaction. | |
| **Goal:** Given a text-like query from annotation, find the matching time window. | |
| **Case study:** A query like Pour milk into coffee should rank the windows from the actual pouring moment higher than unrelated windows. | |
| **Input:** Caption/object/interaction query features and a set of candidate sensor-window features. | |
| **Middle process modules:** | |
| - Query builder converts annotation words into a compact query representation. | |
| - Candidate builder gathers held-out sensor windows. | |
| - Projection head maps sensor windows into the query space. | |
| - Ranker scores candidates by cosine similarity. | |
| - Evaluator reports MRR and top-k retrieval accuracy. | |
| **Output:** A ranked list of windows, with the correct matching window ideally near rank 1. | |
| **Metric:** MRR (higher is better). Minimal `0.0160`, neural MLP `0.0168`. | |
| **Junior mental model:** This is search: type a description, retrieve the matching moment. | |
| **Current limitation:** Bag-of-objects text features are too simple for rich language grounding. | |
| ### Cross-Modal Retrieval (`cross_modal_retrieval`) | |
| **Research name:** Multimodal Representation Retrieval | |
| **Family:** retrieval; two-tower retrieval head; D. Scene Reconstruction & World Modeling. | |
| **Goal:** Use one group of modalities to retrieve the matching window from another group. | |
| **Case study:** Use motion, IMU, and camera-pose signals from a pouring moment to retrieve the matching depth/video representation for that same moment. | |
| **Input:** Query side: motion, IMU, and camera/pose features. Candidate side: depth and video features. | |
| **Middle process modules:** | |
| - Feature splitter separates query modalities from target modalities. | |
| - Projection head maps the query vector into target-modality space. | |
| - Candidate index stores target vectors from held-out windows. | |
| - Ranker retrieves nearest candidates by cosine similarity. | |
| - Evaluator reports MRR, top-1, top-5, and top-10 accuracy. | |
| **Output:** A ranked list of candidate depth/video windows. | |
| **Metric:** MRR (higher is better). Minimal `0.2693`, neural MLP `0.1300`. | |
| **Junior mental model:** This checks whether different sensors agree about the same moment in time. | |
| **Current limitation:** Good retrieval means useful alignment signal, but it is not yet 3D reconstruction or rendering. | |
| ### Cross-Modal Reconstruction (`modality_reconstruction`) | |
| **Research name:** Modality Feature Reconstruction | |
| **Family:** forecast; feature regressor; B. 3D/4D Reconstruction & Neural Rendering. | |
| **Goal:** Predict one modality feature block from other modality blocks. | |
| **Case study:** Given motion, IMU, and camera-pose signals while the hand moves, predict the matching depth/video feature vector. | |
| **Input:** Motion, IMU, and camera/pose features as input; depth/video features as the regression target. | |
| **Middle process modules:** | |
| - Feature splitter defines source and target modality blocks. | |
| - Scaler normalizes source and target vectors using train statistics. | |
| - Regression head predicts the target feature vector. | |
| - Inverse scaler returns predictions to target scale. | |
| - Evaluator reports MSE, MAE, and R2. | |
| **Output:** A reconstructed depth/video feature vector. | |
| **Metric:** R2 (higher is better). Minimal `-0.0153`, neural MLP `-0.0102`. | |
| **Junior mental model:** This is feature-level imagination: can the model infer what another sensor would see? | |
| **Current limitation:** This reconstructs compressed features, not raw pixels, depth maps, meshes, NeRFs, or Gaussian splats. | |
| ### Temporal Order Verification (`temporal_order`) | |
| **Research name:** Temporal Order Verification | |
| **Family:** diagnostic; pairwise classifier; D. Scene Reconstruction & World Modeling. | |
| **Goal:** Tell whether two nearby windows are in the correct time order. | |
| **Case study:** If window A shows reaching and window B shows pouring, the model should distinguish A then B from B then A. | |
| **Input:** A pair of adjacent window vectors, plus their difference vector. | |
| **Middle process modules:** | |
| - Pair builder creates correct-order and reversed-order examples. | |
| - Feature combiner concatenates first window, second window, and their difference. | |
| - Binary classifier predicts correct vs reversed. | |
| - Evaluator reports F1, precision, and recall. | |
| - Diagnostic reader interprets whether features encode local time direction. | |
| **Output:** A binary label: correct order or reversed order. | |
| **Metric:** F1 (higher is better). Minimal `0.5400`, neural MLP `0.8520`. | |
| **Junior mental model:** This asks whether the representation knows which moment came first. | |
| **Current limitation:** It only tests local ordering, not long-term planning or causality. | |
| ### Multimodal Synchronization Detection (`misalignment_detection`) | |
| **Research name:** Cross-Modal Misalignment Detection | |
| **Family:** diagnostic; pairwise classifier; B. 3D/4D Reconstruction & Neural Rendering. | |
| **Goal:** Detect when modalities that should match are shifted out of sync. | |
| **Case study:** Motion from a pouring moment is paired with video/depth from several windows later. The task asks the model to detect that mismatch. | |
| **Input:** A motion-side feature group and a visual/depth-side feature group, either aligned or artificially shifted. | |
| **Middle process modules:** | |
| - Alignment builder creates positive pairs from the same time window. | |
| - Shift builder creates negative pairs by offsetting one modality group. | |
| - Feature combiner joins both sides into one example. | |
| - Binary classifier predicts aligned vs misaligned. | |
| - Evaluator reports F1 and accuracy. | |
| **Output:** A binary label: aligned or shifted. | |
| **Metric:** F1 (higher is better). Minimal `0.5052`, neural MLP `0.7153`. | |
| **Junior mental model:** This is a synchronization alarm for multimodal data. | |
| **Current limitation:** Synthetic shifts are useful diagnostics but do not solve calibration, reconstruction, or mapping by themselves. | |