cy0307's picture
Publish Xperience-10M minimal and neural task baseline cards
351650d verified
|
Raw
History Blame
13.4 kB

Junior-Friendly 12-Task Walkthroughs

This file explains every task in the Xperience-10M episode suite as an input -> process -> output pipeline. It is generated by scripts/task_walkthroughs.py from committed metrics plus hand-audited task explanations.

Shared Pipeline

  • Read annotation.hdf5 and synchronized video-derived features.
  • Slice the episode into 20-frame windows with stride 5.
  • Build an 8,378-d current feature vector from available modality blocks.
  • Construct a task-specific target from labels, future frames, paired windows, or modality splits.
  • Train a minimal head and, when enabled, a neural MLP head.
  • Write metrics, predictions, and model artifacts for review.

Task Walkthroughs

timeline_action

Goal: Look at one short multimodal window and name what action is happening now.

Case study: In the coffee-making sample, if the 20-frame window is during a pouring moment, the task asks the model to output an action such as Pour coffee or Pour milk into coffee.

Input: One 20-frame window represented by the current 8,378-d feature vector: video/depth summaries, pose, SLAM/camera pose, motion capture, IMU, calibration, and language-derived context.

Middle process modules:

  • Window builder slices the episode into short overlapping windows.
  • Feature assembler concatenates all current feature blocks.
  • Label builder reads the action annotation for the center of the window.
  • Classifier head maps the window vector to one action class.
  • Evaluator compares predicted action labels against the held-out chronological segment.

Output: A single action class for the current window.

Metric: macro-F1 (higher is better). Minimal 0.0500, neural MLP 0.0263.

Junior mental model: This is like asking: given this tiny movie clip plus sensor readings, what is the person doing right now?

Current limitation: The one-episode chronological split contains future action classes that were not present in training, so low test macro-F1 is expected.

timeline_subtask

Goal: Predict the higher-level task stage for the current window.

Case study: A pouring action may belong to a broader subtask such as preparing or pouring a drink. The model predicts that broader stage instead of a fine action.

Input: The same all-modality 8,378-d window vector used by action recognition.

Middle process modules:

  • Window builder creates the current temporal slice.
  • Feature assembler keeps all available modality blocks.
  • Subtask label builder maps the current timestamp to a subtask annotation.
  • Classifier head predicts the subtask class.
  • Evaluator reports class-balanced scores so rare subtasks matter.

Output: A single subtask label for the current window.

Metric: macro-F1 (higher is better). Minimal 0.0495, neural MLP 0.0175.

Junior mental model: Action is the verb; subtask is the chapter of the activity.

Current limitation: Single-episode ordering means some later subtasks appear only in test, so this is a pipeline check rather than a general benchmark.

transition_detection

Goal: Detect whether the current window is near a boundary between actions.

Case study: When the demonstrator changes from preparing to pouring, the model should flag a boundary instead of a steady action window.

Input: One all-modality window vector plus labels derived from action-change timestamps.

Middle process modules:

  • Boundary builder scans action labels over time and marks windows near a change.
  • Feature assembler supplies all current modality features.
  • Binary classifier predicts steady vs boundary.
  • Boundary matcher checks whether predicted boundary times are close to true boundary times.
  • Evaluator reports macro-F1 and timing error, not just accuracy.

Output: A binary label: boundary or steady.

Metric: macro-F1 (higher is better). Minimal 0.6552, neural MLP 0.6485.

Junior mental model: This is the model's way of saying: something just changed here.

Current limitation: Boundaries are rare, so high accuracy can be misleading if the model predicts steady too often.

next_action

Goal: Use the current window to guess the action that will happen shortly after it.

Case study: If a window shows the person preparing to pour, the target can be the action 20 frames later, such as the start of pouring.

Input: The current all-modality window vector at time t.

Middle process modules:

  • Window builder picks a current time window.
  • Future label builder shifts the action target by 20 frames.
  • Feature assembler uses only current information, not future features.
  • Classifier head predicts the future action class.
  • Evaluator checks whether the future action label is correct.

Output: A single action class for t+20 frames.

Metric: macro-F1 (higher is better). Minimal 0.0593, neural MLP 0.0235.

Junior mental model: This is short-horizon intention prediction: what will the person do next?

Current limitation: The public sample has unseen future classes in the chronological test split, which makes this very hard with one episode.

hand_trajectory_forecast

Goal: Predict where the hands will move over the next few frames.

Case study: When the hand is moving toward a cup or bottle, the model predicts the future 3D hand-joint path.

Input: The current all-modality window vector at time t.

Middle process modules:

  • Window builder chooses the current sensor window.
  • Target builder extracts future left/right hand 3D joints from motion capture.
  • Regression head predicts a continuous trajectory, not a class label.
  • Output reshaper interprets the vector as future frames and joints.
  • Evaluator computes MPJPE, the average 3D joint-position error.

Output: A future trajectory vector for left and right hand joints.

Metric: MPJPE (lower is better). Minimal 0.8223, neural MLP 0.1116.

Junior mental model: Instead of naming an action, this task draws the next hand path in 3D.

Current limitation: It is still a window-level forecast, not a full policy or long-horizon motion generator.

contact_prediction

Goal: Predict whether the body or hand is in contact with something.

Case study: During manipulation, the hand may touch a cup, table, or bottle. The task asks whether any contact is happening.

Input: Non-contact and non-caption feature blocks, so the answer is not directly leaked from the target labels.

Middle process modules:

  • Feature selector removes contact-label and caption-label blocks.
  • Target builder converts contact annotations into a binary label.
  • Binary classifier predicts contact vs no contact.
  • Evaluator reports macro-F1 and accuracy.
  • Degeneracy checker records whether only one class appears.

Output: A binary contact label.

Metric: macro-F1 (higher is better). Minimal 1.0000, neural MLP 1.0000.

Junior mental model: This is a simple physical-interaction probe: is the person touching something now?

Current limitation: The current public sample is degenerate for this task because one class dominates, so perfect score does not mean the model learned contact physics.

object_relevance

Goal: Predict which objects matter in the current window.

Case study: If the person is pouring milk into coffee, relevant objects may include milk, cup, coffee, or container-like items.

Input: Non-caption feature blocks, so the model must infer objects from sensors rather than copying the caption words.

Middle process modules:

  • Object vocabulary builder collects object labels from annotations.
  • Feature selector removes caption-derived label blocks.
  • Multi-label target builder creates a multi-hot object vector.
  • Sigmoid heads predict each object's relevance independently.
  • Evaluator reports micro-F1 and exact-match quality.

Output: A multi-label object set for the current window.

Metric: micro-F1 (higher is better). Minimal 0.1839, neural MLP 0.1798.

Junior mental model: A window can involve more than one object, so this is not a one-class classifier.

Current limitation: Object labels are sparse and language-derived, so this is currently a weak object-centric probe.

caption_grounding

Goal: Given a text-like query from annotation, find the matching time window.

Case study: A query like Pour milk into coffee should rank the windows from the actual pouring moment higher than unrelated windows.

Input: Caption/object/interaction query features and a set of candidate sensor-window features.

Middle process modules:

  • Query builder converts annotation words into a compact query representation.
  • Candidate builder gathers held-out sensor windows.
  • Projection head maps sensor windows into the query space.
  • Ranker scores candidates by cosine similarity.
  • Evaluator reports MRR and top-k retrieval accuracy.

Output: A ranked list of windows, with the correct matching window ideally near rank 1.

Metric: MRR (higher is better). Minimal 0.0172, neural MLP 0.0178.

Junior mental model: This is search: type a description, retrieve the matching moment.

Current limitation: Bag-of-objects text features are too simple for rich language grounding.

cross_modal_retrieval

Goal: Use one group of modalities to retrieve the matching window from another group.

Case study: Use motion, IMU, and camera-pose signals from a pouring moment to retrieve the matching depth/video representation for that same moment.

Input: Query side: motion, IMU, and camera/pose features. Candidate side: depth and video features.

Middle process modules:

  • Feature splitter separates query modalities from target modalities.
  • Projection head maps the query vector into target-modality space.
  • Candidate index stores target vectors from held-out windows.
  • Ranker retrieves nearest candidates by cosine similarity.
  • Evaluator reports MRR, top-1, top-5, and top-10 accuracy.

Output: A ranked list of candidate depth/video windows.

Metric: MRR (higher is better). Minimal 0.2634, neural MLP 0.1530.

Junior mental model: This checks whether different sensors agree about the same moment in time.

Current limitation: Good retrieval means useful alignment signal, but it is not yet 3D reconstruction or rendering.

modality_reconstruction

Goal: Predict one modality feature block from other modality blocks.

Case study: Given motion, IMU, and camera-pose signals while the hand moves, predict the matching depth/video feature vector.

Input: Motion, IMU, and camera/pose features as input; depth/video features as the regression target.

Middle process modules:

  • Feature splitter defines source and target modality blocks.
  • Scaler normalizes source and target vectors using train statistics.
  • Regression head predicts the target feature vector.
  • Inverse scaler returns predictions to target scale.
  • Evaluator reports MSE, MAE, and R2.

Output: A reconstructed depth/video feature vector.

Metric: R2 (higher is better). Minimal -0.0160, neural MLP -0.0102.

Junior mental model: This is feature-level imagination: can the model infer what another sensor would see?

Current limitation: This reconstructs compressed features, not raw pixels, depth maps, meshes, NeRFs, or Gaussian splats.

temporal_order

Goal: Tell whether two nearby windows are in the correct time order.

Case study: If window A shows reaching and window B shows pouring, the model should distinguish A then B from B then A.

Input: A pair of adjacent window vectors, plus their difference vector.

Middle process modules:

  • Pair builder creates correct-order and reversed-order examples.
  • Feature combiner concatenates first window, second window, and their difference.
  • Binary classifier predicts correct vs reversed.
  • Evaluator reports F1, precision, and recall.
  • Diagnostic reader interprets whether features encode local time direction.

Output: A binary label: correct order or reversed order.

Metric: F1 (higher is better). Minimal 0.5487, neural MLP 0.8718.

Junior mental model: This asks whether the representation knows which moment came first.

Current limitation: It only tests local ordering, not long-term planning or causality.

misalignment_detection

Goal: Detect when modalities that should match are shifted out of sync.

Case study: Motion from a pouring moment is paired with video/depth from several windows later. The task asks the model to detect that mismatch.

Input: A motion-side feature group and a visual/depth-side feature group, either aligned or artificially shifted.

Middle process modules:

  • Alignment builder creates positive pairs from the same time window.
  • Shift builder creates negative pairs by offsetting one modality group.
  • Feature combiner joins both sides into one example.
  • Binary classifier predicts aligned vs misaligned.
  • Evaluator reports F1 and accuracy.

Output: A binary label: aligned or shifted.

Metric: F1 (higher is better). Minimal 0.4866, neural MLP 0.7335.

Junior mental model: This is a synchronization alarm for multimodal data.

Current limitation: Synthetic shifts are useful diagnostics but do not solve calibration, reconstruction, or mapping by themselves.