# Glossary This glossary defines project-specific terms and adjacent technical field terms that can be easy to confuse across the GitHub repo, website, Hugging Face Space, artifact dataset, model repos, result matrices, and embodied-AI training discussions. Use it with `PUBLIC_READER_MAP.md` when choosing what to read first, and with `docs/data/glossary.json` when a tool needs the same terms in machine-readable form. ## How To Read The Terms | Category | What it clarifies | | --- | --- | | Dataset and scope | Public data boundaries, evidence lines, and how each result family should be read. | | Files and features | Raw sample files, windows, feature manifests, and public-safe derivatives. | | Multimodal sensing | Video, audio, depth, IMU, motion capture, calibration, and synchronization terms. | | Spatial geometry | Camera pose, SLAM, coordinate frames, point clouds, 3D reconstruction, and spatial grounding. | | Temporal and world models | Future prediction, rollouts, forward dynamics, long-horizon forecasting, and temporal leakage. | | Robotics and VLA | Vision-language-action, policies, action chunks, imitation learning, contact, and dexterity. | | Tasks and metrics | Task contracts, scored records, direct scores, compact proxies, and audits. | | Training and evaluation | Splits, held-out evaluation, metric types, prompt/schema checks, adapters, and distributed training. | | Models and runs | Baseline families, Qwen3-Omni, Cosmos3, LoRA adapters, and full-parameter gates. | | Public surfaces | GitHub, website, Hugging Face repos, parity checks, and package validation. | ## Core And Field Terms ### Dataset and scope | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Evidence line | A reading lane for a group of results. | Line 1 is one public sample episode; Line 2 is selected-128 held-out comparison. | Qwen run versions v1-v6, which are model-run lineage. | | Official gated data | Upstream files that require official dataset access. | Raw Xperience-10M MP4/HDF5/RRD files and full source directories remain outside the public repo. | Public-safe metrics, derived features, figures, and manifests. | | Public sample episode | One officially available sample episode. | The fully inspectable Line 1 unit used for raw-file browsing, 20-frame windows, task construction, and single-episode baselines. | The selected-128 comparison rows. | | Selected 128 episodes | A public-safe selected subset of official gated episode paths. | Line 2 uses derived windows/features and keeps links back to official episode ids and gated source paths. | Redistributed raw MP4/HDF5/RRD data. | | Xperience-10M | The upstream embodied human-interaction dataset. | Source dataset behind the public sample, selected-128 features, task suite, and model diagnostics. | This repo, which only redistributes public-safe derived artifacts. | ### Files and features | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | 20-frame window | A fixed short clip slice. | The sample episode is converted into aligned 20-frame units for features, labels, and many task heads. | A full episode or arbitrary video segment. | | annotation.hdf5 | Upstream annotation container for the sample. | Contains original labels/metadata; some public derived files expose processed features instead of every raw text field. | Task result summaries. | | Episode | One recorded interaction sequence. | The basic source unit behind windows, labels, and train/val/test splits. | A 20-frame window. | | Feature manifest | A map from model-input columns to source modalities. | Explains feature groups and dimensions for the sample task suite. | The raw annotation file. | | Interaction text | Natural-language interaction/caption content. | Used by task 15 and some derived text features; public matrices record direct or compact-proxy status. | Numeric action ids or subtask ids. | | Modality | A type of signal. | Video, audio, depth, pose/SLAM, motion capture, inertial, calibration, and language-derived signals. | A task target. | | Raw sample file map | A human-readable inventory of the sample episode files. | Explains videos, annotations, calibration, motion, and derived previews. | A training manifest. | | visualization.rrd | Rerun viewer recording for visual inspection. | Can be downloaded from the official sample dataset and opened in Rerun 0.29.0 to inspect the sample episode. It is not used for published training or metric rows. | MP4 video streams or model inputs. | | Window stride | The frame step between neighboring windows. | Creates overlapping examples while preserving chronological order and leakage controls. | Video frame rate. | ### Multimodal sensing | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Audio waveform | A time-series pressure signal from sound. | The audio ablation measures whether embedded audio helps selected task contracts. | Language captions or text labels. | | Calibration | Parameters that relate sensors to each other and to physical space. | Needed to interpret camera streams, depth, pose, and synchronized multimodal features together. | A model training hyperparameter. | | Camera extrinsics | A camera position and orientation relative to another coordinate frame. | Connects different camera streams and world coordinates. | Camera intrinsics. | | Camera intrinsics | Internal camera parameters such as focal length and distortion. | Explain how image pixels project to rays for geometry tasks. | Camera extrinsics. | | Depth map | A per-pixel estimate of distance from the camera. | Depth-derived signals support spatial and geometry-oriented tasks. | RGB brightness or semantic segmentation. | | Egocentric video | Video captured from a first-person or body-mounted viewpoint. | The sample streams are egocentric views of human interaction and are the visual basis for many tasks. | Third-person robot-camera footage. | | Fisheye camera | A wide-angle camera with strong lens distortion. | Multiple fisheye MP4 streams give broad room coverage but need calibration-aware interpretation. | A rectilinear pinhole camera image. | | IMU | An inertial measurement unit with accelerometer and gyroscope signals. | Supports motion, temporal, and sensor-bridging tasks. | Motion capture skeleton data. | | Metric depth | Depth expressed in physical units rather than arbitrary relative scale. | Useful for distance-sensitive spatial reasoning and reconstruction targets. | Relative monocular depth. | | Motion capture | A system that records body or hand motion over time. | Provides hand/body motion evidence when exposed through public-safe derived features. | Video-only pose estimation. | | RGB frame | A color image frame from a video stream. | Used for visual statistics, previews, and many model inputs. | Depth values or point-cloud coordinates. | | Sensor alignment | Putting different sensor streams into a shared temporal or spatial reference. | Used to make video, audio, pose, depth, IMU, and mocap usable in the same task input. | Model ensembling. | | Stereo camera | A paired-camera setup that supports depth or geometry estimation. | The sample browser exposes stereo streams as part of the visual modality set. | Single-view RGB video. | | Timestamp synchronization | Aligning sensor samples by time. | The task suite assumes aligned windows across modalities so labels and features refer to the same moment. | Randomly joining files with similar names. | ### Spatial geometry | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | 3D reconstruction | Recovering 3D scene structure from sensor data. | One core spatial-intelligence direction for Xperience-style data. | Next-action classification. | | Affordance | An action possibility offered by an object or scene. | Relevant when moving from observed human interaction to robot-action or VLA tasks. | A detected object category alone. | | Camera pose | The camera position and orientation at a time step. | Supports spatial-intelligence tasks, view synchronization, and geometry diagnostics. | The human body pose. | | Coordinate frame | A reference system for positions and orientations. | Needed when comparing camera, body, object, and world measurements. | A video frame. | | Object-centric representation | A representation organized around objects and their relations. | Useful for object relevance, object-set forecast, and action-object relation tasks. | A flat feature vector without object identity. | | Odometry | Motion estimated from sensor changes over time. | A relevant spatial term for ego-motion and camera-pose reasoning. | Ground-truth motion capture. | | Point cloud | A set of 3D points representing scene structure. | A likely target or intermediate representation for spatial-intelligence extensions. | A 2D image grid. | | SLAM | Simultaneous localization and mapping. | A field term for estimating camera motion and scene structure from sensor observations. | A task label or action class. | | Spatial grounding | Linking language or labels to locations, objects, or geometry. | Connects language grounding tasks with 3D/spatial reasoning. | General text classification. | | Trajectory | A sequence of positions over time. | Used for hand motion, camera motion, and future-path tasks. | A single coordinate or label. | ### Temporal and world models | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Action forecasting | Predicting a future action before it happens. | Covered by next-action and long-horizon task contracts. | Recognizing the current action only. | | Autoregressive prediction | Generating each future token, state, or frame conditioned on prior outputs. | Relevant for model branches that produce structured JSON or temporal predictions. | A one-shot classifier. | | Forward dynamics | Predicting the next state from the current state and action/context. | The Cosmos3-Super LoRA branch uses a forward-dynamics-style diagnostic contract. | Reverse inference from result back to cause. | | Latent state | A hidden representation that summarizes observed context. | Useful for future foundation-model and world-model training plans. | A visible annotation column. | | Long-horizon prediction | Predicting outcomes several seconds or steps ahead. | Tasks 13 and 14 test longer temporal context beyond immediate recognition. | Single-frame classification. | | Next-frame prediction | Predicting future visual frames from past frames. | A field-level world-model objective related to the human-video world-model direction. | Next-action prediction. | | Object persistence | Tracking that an object remains present over time even when view or interaction changes. | Relevant for object-set forecast and long-video reasoning. | A single-frame object detection. | | Rollout | Repeatedly predicting future steps from a model state. | Important for judging world models beyond one-step prediction. | A held-out static test row. | | Subtask forecasting | Predicting the next higher-level step in an activity. | Used in the future-task probe line for Qwen3-Omni. | Frame-level action classification. | | Teacher forcing | Training a sequence model using ground-truth previous outputs. | A likely training option for future sequence/world-model baselines. | Free-running rollout evaluation. | | Temporal leakage | Using future information that would not be available at prediction time. | Avoided by chronological splits and target-side feature controls. | A low model score. | | Transition timing | Estimating when the next state or action transition happens. | Task 20 turns temporal change into a regression target. | Classifying the transition type only. | ### Robotics and VLA | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Action chunk | A short sequence of low-level actions predicted together. | The VLA figure and plan use action chunks as the policy-output concept. | A natural-language action label. | | Behavior cloning | A supervised imitation-learning method for predicting demonstrated actions. | A plausible baseline once action targets are converted. | Generative video modeling. | | Contact event | A moment when a hand, body, or tool touches an object or surface. | Used in contact-related tasks and action-quality interpretation. | Visual co-occurrence without touch. | | Dexterity | Fine-grained physical manipulation ability. | Relevant to hand-object interaction, contact, and VLA/policy directions. | High text-generation accuracy. | | End effector | The robot part that acts on the world, such as a gripper or hand. | A key target frame for future manipulation-policy conversion. | A camera or global scene coordinate. | | Hand-object interaction | A physical interaction between hands and objects. | A central signal family behind action, contact, object relevance, and interaction-text tasks. | Object detection without action. | | Imitation learning | Training a policy to imitate demonstrated behavior. | Relevant when converting human video/motion into action supervision. | Reinforcement learning from online robot trials. | | Language grounding | Connecting text to observed objects, actions, or spatial context. | Task 8 and VLA directions use language as grounded supervision rather than standalone text. | Caption fluency alone. | | Policy | A mapping from observations to actions. | A future target for robot-compatible Xperience-derived action data. | A benchmark metric. | | Robot-compatible action target | An action representation a robot policy can execute or imitate. | Needed before OpenVLA/openpi/GR00T-style policy training is meaningful here. | Human-only caption text. | | Vision-language-action model | A model that maps visual context and language into actions. | The VLA direction is a future path after action targets are converted into robot-compatible chunks. | A vision-language model that only answers text. | ### Tasks and metrics | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Compact-proxy score | A bounded proxy metric when a direct raw target is not publicly available. | Kept explicit in the matrix and gap audit so readers do not over-read it. | A direct target measurement. | | Direct score | A metric computed against the task target directly. | The preferred score type in the 20-task matrix. | Compact-proxy score. | | Gap audit | A coverage and source-status audit. | Explains scored, proxy, and unsupported cells. | A performance leaderboard. | | Leakage control | A split or feature rule that prevents using target information unfairly. | Chronological splits, held-out splits, and source audits protect task interpretation. | Lower training accuracy. | | Normalized radar value | A 0-1 plotting value used only to draw comparable radar polygons. | Helps visualize metrics with different scales and directions. | The raw metric value to cite. | | Raw metric value | The original metric value emitted by the runner or verified result package. | This is the value to cite from the 180-result table. | The normalized radar value. | | Task contract | The definition of one benchmark task. | Includes input, target/output, metric, split, source artifact, and limitation. | A model architecture. | | Task-method record | One method evaluated on one task. | 9 methods x 20 tasks gives 180 public result records. | A single prediction row. | | Unified 20-task suite | The current task surface. | All 20 task contracts are presented together and scored across methods where real artifacts exist. | Historical tier2_task_suite filenames, which are provenance paths rather than a second suite. | ### Training and evaluation | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Adapter checkpoint | Saved adapter weights from a fine-tuning run. | The public model branches publish adapters when validated and public-safe. | Full base-model checkpoint. | | Balanced accuracy | Accuracy averaged across classes to reduce majority-class dominance. | Useful for imbalanced task labels. | Overall accuracy. | | Chronological split | A split ordered by time. | Used for the single-episode baselines to reduce future-window leakage. | A random row split. | | Confusion matrix | A table of predicted classes versus true classes. | Helps inspect which task labels a method confuses. | A scalar leaderboard score. | | FSDP | Fully Sharded Data Parallel, a distributed training strategy. | Appears in full-parameter feasibility and multi-GPU training notes. | A model architecture. | | Held-out evaluation | Testing on examples not used for training. | Required before promoting Qwen/Cosmos results to public evidence. | Training-set loss. | | JSON validity | Whether model output parses as the required JSON schema. | A key diagnostic for Qwen3-Omni structured-output runs. | Task correctness after parsing. | | Macro F1 | The average F1 score across classes, usually treating classes equally. | Used when class imbalance matters in classification tasks. | Accuracy dominated by frequent classes. | | Mean absolute error | The average absolute difference between predicted and true numeric values. | Used for regression-style task rows such as timing or trajectory targets. | A classification F1 score. | | Overfit check | A small training test that verifies a model can learn a tiny subset. | Useful for catching data/model wiring bugs before full training. | Evidence of generalization. | | Parameter-efficient fine-tuning | Updating a small number of added or selected parameters. | LoRA is the current parameter-efficient path for Qwen/Cosmos branches. | Full-parameter fine-tuning. | | Schema compliance | Whether an output follows the expected field names and value types. | Needed for structured task probes and public package validation. | High semantic accuracy. | | Smoke run | A short run that checks whether a pipeline can start and execute key steps. | Used for feasibility gates before expensive full runs. | A complete benchmark result. | | Top-k accuracy | A score that counts a prediction correct if the target is among the k highest-ranked outputs. | Useful for large-label or retrieval-style tasks. | Top-1 exact accuracy. | | Train/validation/test split | A partition that separates model fitting, tuning, and final evaluation examples. | The selected-128 setup uses a held-out split discipline for model branches. | A random shuffle without temporal or episode boundaries. | ### Models and runs | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | Cosmos3-Nano | A smaller Cosmos3 compatibility/future-window branch. | Used for the Nano Future Window row and related diagnostics. | Cosmos3-Super fine-tuned adapter. | | Cosmos3-Super | The larger Cosmos3-style branch tracked in this project. | Published as Reasoner diagnostics and a separate forward-dynamics LoRA adapter/result branch when verified. | Cosmos3-Nano. | | Foundation pipeline | A high-level training direction. | Spatial intelligence, human-video world modeling, and vision-language-action are documented as trainable directions with task mappings. | A completed public result row. | | Full-parameter fine-tuning | Updating the whole model rather than only adapters. | This project records feasibility gates and short pilots, but does not publish full checkpoints. | LoRA adapter publication. | | Human-video world model | Learning future frames, actions, and interaction dynamics from human video. | Uses temporal prediction, next-action, transition, and object-forecast tasks. | Robot policy execution. | | LoRA adapter | A lightweight set of trainable adapter weights. | Published only when the package is verified and public-safe. | Full base-model weights. | | Metadata baseline | A selected-128 baseline using metadata or text-derived public-safe features. | Compares simple and neural heads on the held-out split. | Raw video, depth, or audio feature baselines. | | Minimal baseline | A simple non-neural task head; the "minimum" reference row in casual wording. | Provides a reproducible lower-complexity comparison for task feasibility. | Metadata-only selected-128 baseline family. | | Neural MLP | A compact neural task head. | Used for single-episode and selected-128 baseline comparisons. | Foundation-model fine-tuning. | | Qwen v1-v6 | The Qwen3-Omni run lineage. | v1-v4 are earlier pipeline/ablation evidence, v5 is the prior pinned release, and v6 is the current public 20-task row. | Six different evidence lines. | | Qwen3-Omni | The multimodal foundation-model family used for the Qwen branch. | The current public 20-task Qwen row is Qwen3-Omni v6 LoRA plus task-specific probes. | Cosmos3 or single-episode task-head baselines. | | Raw-feature baseline | A selected-128 baseline using exported public-safe raw-feature groups. | Tracks what non-foundation heads can do with richer processed inputs. | Raw gated media redistribution. | | Simple baseline | A non-neural baseline family for the selected-128 rows. | Used for metadata/text and raw-feature 128-episode comparisons before NN/foundation-model rows. | The single-episode Minimal baseline. | | Spatial intelligence | Learning geometry and spatial reasoning from egocentric data. | Uses video, depth, camera pose, and language tasks to target 3D/space reasoning. | World-model future prediction. | | Vision-language-action | Mapping perception and language to action chunks. | A future policy/VLA direction that needs action-target conversion and stronger policy packaging. | Qwen3-Omni diagnostic scoring. | ### Public surfaces | Term | Plain meaning | In this project | Do not confuse with | | --- | --- | --- | --- | | HF artifact dataset | Hugging Face dataset repo for derived evidence. | Stores public-safe reports, metrics, website JSON, and sanitized result packages. | Original Xperience-10M dataset. | | HF baseline model repo | Hugging Face model repo for lightweight baseline artifacts. | Mirrors baseline weights, figures, metrics, and task artifacts. | Qwen/Cosmos adapter-specific repos. | | HF Space | Hugging Face-hosted app/site surface. | Mirrors the dashboard and static website assets. | HF artifact dataset or model repo. | | HF weights/results repo | A consolidated public-safe model-result bundle. | Groups baseline weights, verified model artifacts, analysis files, and manifests. | The upstream raw dataset. | | Mirror parity | A check that public copies match the source files. | Records whether GitHub, website, and HF mirrors agree. | A model-quality metric. | | Public-safe artifact | A file that can be mirrored publicly without raw gated content. | Metrics, JSON summaries, model cards, figures, derived manifests, and approved lightweight weights/adapters. | Raw dataset redistribution. | | Publication audit | A public-package validation report. | Confirms required files exist and forbidden raw/private assets are not included. | Scientific peer review. | | Verified package | A result or artifact bundle that passed local/public validators. | Only verified packages are promoted to README, website, and HF surfaces as public evidence. | A running or exploratory experiment. | ## File Entry Points | Need | Open | | --- | --- | | Reader navigation | `PUBLIC_READER_MAP.md`, `docs/data/public_reader_map.json` | | Task definitions | `TASK_SUITE_20.md`, `docs/data/task_suite_20.json` | | Result matrix | `TASK_METHOD_20_RESULT_MATRIX.md`, `docs/data/task_method_20_result_matrix.json` | | Direct/proxy status | `TASK_METHOD_20_GAP_AUDIT.md`, `docs/data/task_method_20_gap_audit.json` | | Qwen lineage | `QWEN3_OMNI_RUN_LINEAGE.md`, `docs/data/qwen3_omni_run_lineage.json` | | 128-episode source/features | `XPERIENCE10M_128_EPISODE_FEATURE_INDEX.md`, `docs/data/xperience10m_128_episode_feature_index.json` | | Public mirrors | `PUBLIC_SURFACE_QA.md`, `docs/data/mirror_parity.json`, `docs/data/live_publication_status.json` |