Expand glossary with technical field terms

c18a0a3 verified 3 days ago

24.1 kB

	# Glossary

	This glossary defines project-specific terms and adjacent technical field terms that can be easy to confuse across the GitHub repo, website, Hugging Face Space, artifact dataset, model repos, result matrices, and embodied-AI training discussions. Use it with `PUBLIC_READER_MAP.md` when choosing what to read first, and with `docs/data/glossary.json` when a tool needs the same terms in machine-readable form.

	## How To Read The Terms

	\| Category \| What it clarifies \|
	\| --- \| --- \|
	\| Dataset and scope \| Public data boundaries, evidence lines, and how each result family should be read. \|
	\| Files and features \| Raw sample files, windows, feature manifests, and public-safe derivatives. \|
	\| Multimodal sensing \| Video, audio, depth, IMU, motion capture, calibration, and synchronization terms. \|
	\| Spatial geometry \| Camera pose, SLAM, coordinate frames, point clouds, 3D reconstruction, and spatial grounding. \|
	\| Temporal and world models \| Future prediction, rollouts, forward dynamics, long-horizon forecasting, and temporal leakage. \|
	\| Robotics and VLA \| Vision-language-action, policies, action chunks, imitation learning, contact, and dexterity. \|
	\| Tasks and metrics \| Task contracts, scored records, direct scores, compact proxies, and audits. \|
	\| Training and evaluation \| Splits, held-out evaluation, metric types, prompt/schema checks, adapters, and distributed training. \|
	\| Models and runs \| Baseline families, Qwen3-Omni, Cosmos3, LoRA adapters, and full-parameter gates. \|
	\| Public surfaces \| GitHub, website, Hugging Face repos, parity checks, and package validation. \|

	## Core And Field Terms

	### Dataset and scope

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Evidence line \| A reading lane for a group of results. \| Line 1 is one public sample episode; Line 2 is selected-128 held-out comparison. \| Qwen run versions v1-v6, which are model-run lineage. \|
	\| Official gated data \| Upstream files that require official dataset access. \| Raw Xperience-10M MP4/HDF5/RRD files and full source directories remain outside the public repo. \| Public-safe metrics, derived features, figures, and manifests. \|
	\| Public sample episode \| One officially available sample episode. \| The fully inspectable Line 1 unit used for raw-file browsing, 20-frame windows, task construction, and single-episode baselines. \| The selected-128 comparison rows. \|
	\| Selected 128 episodes \| A public-safe selected subset of official gated episode paths. \| Line 2 uses derived windows/features and keeps links back to official episode ids and gated source paths. \| Redistributed raw MP4/HDF5/RRD data. \|
	\| Xperience-10M \| The upstream embodied human-interaction dataset. \| Source dataset behind the public sample, selected-128 features, task suite, and model diagnostics. \| This repo, which only redistributes public-safe derived artifacts. \|

	### Files and features

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| 20-frame window \| A fixed short clip slice. \| The sample episode is converted into aligned 20-frame units for features, labels, and many task heads. \| A full episode or arbitrary video segment. \|
	\| annotation.hdf5 \| Upstream annotation container for the sample. \| Contains original labels/metadata; some public derived files expose processed features instead of every raw text field. \| Task result summaries. \|
	\| Episode \| One recorded interaction sequence. \| The basic source unit behind windows, labels, and train/val/test splits. \| A 20-frame window. \|
	\| Feature manifest \| A map from model-input columns to source modalities. \| Explains feature groups and dimensions for the sample task suite. \| The raw annotation file. \|
	\| Interaction text \| Natural-language interaction/caption content. \| Used by task 15 and some derived text features; public matrices record direct or compact-proxy status. \| Numeric action ids or subtask ids. \|
	\| Modality \| A type of signal. \| Video, audio, depth, pose/SLAM, motion capture, inertial, calibration, and language-derived signals. \| A task target. \|
	\| Raw sample file map \| A human-readable inventory of the sample episode files. \| Explains videos, annotations, calibration, motion, and derived previews. \| A training manifest. \|
	\| visualization.rrd \| Rerun viewer recording for visual inspection. \| Can be downloaded from the official sample dataset and opened in Rerun 0.29.0 to inspect the sample episode. It is not used for published training or metric rows. \| MP4 video streams or model inputs. \|
	\| Window stride \| The frame step between neighboring windows. \| Creates overlapping examples while preserving chronological order and leakage controls. \| Video frame rate. \|

	### Multimodal sensing

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Audio waveform \| A time-series pressure signal from sound. \| The audio ablation measures whether embedded audio helps selected task contracts. \| Language captions or text labels. \|
	\| Calibration \| Parameters that relate sensors to each other and to physical space. \| Needed to interpret camera streams, depth, pose, and synchronized multimodal features together. \| A model training hyperparameter. \|
	\| Camera extrinsics \| A camera position and orientation relative to another coordinate frame. \| Connects different camera streams and world coordinates. \| Camera intrinsics. \|
	\| Camera intrinsics \| Internal camera parameters such as focal length and distortion. \| Explain how image pixels project to rays for geometry tasks. \| Camera extrinsics. \|
	\| Depth map \| A per-pixel estimate of distance from the camera. \| Depth-derived signals support spatial and geometry-oriented tasks. \| RGB brightness or semantic segmentation. \|
	\| Egocentric video \| Video captured from a first-person or body-mounted viewpoint. \| The sample streams are egocentric views of human interaction and are the visual basis for many tasks. \| Third-person robot-camera footage. \|
	\| Fisheye camera \| A wide-angle camera with strong lens distortion. \| Multiple fisheye MP4 streams give broad room coverage but need calibration-aware interpretation. \| A rectilinear pinhole camera image. \|
	\| IMU \| An inertial measurement unit with accelerometer and gyroscope signals. \| Supports motion, temporal, and sensor-bridging tasks. \| Motion capture skeleton data. \|
	\| Metric depth \| Depth expressed in physical units rather than arbitrary relative scale. \| Useful for distance-sensitive spatial reasoning and reconstruction targets. \| Relative monocular depth. \|
	\| Motion capture \| A system that records body or hand motion over time. \| Provides hand/body motion evidence when exposed through public-safe derived features. \| Video-only pose estimation. \|
	\| RGB frame \| A color image frame from a video stream. \| Used for visual statistics, previews, and many model inputs. \| Depth values or point-cloud coordinates. \|
	\| Sensor alignment \| Putting different sensor streams into a shared temporal or spatial reference. \| Used to make video, audio, pose, depth, IMU, and mocap usable in the same task input. \| Model ensembling. \|
	\| Stereo camera \| A paired-camera setup that supports depth or geometry estimation. \| The sample browser exposes stereo streams as part of the visual modality set. \| Single-view RGB video. \|
	\| Timestamp synchronization \| Aligning sensor samples by time. \| The task suite assumes aligned windows across modalities so labels and features refer to the same moment. \| Randomly joining files with similar names. \|

	### Spatial geometry

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| 3D reconstruction \| Recovering 3D scene structure from sensor data. \| One core spatial-intelligence direction for Xperience-style data. \| Next-action classification. \|
	\| Affordance \| An action possibility offered by an object or scene. \| Relevant when moving from observed human interaction to robot-action or VLA tasks. \| A detected object category alone. \|
	\| Camera pose \| The camera position and orientation at a time step. \| Supports spatial-intelligence tasks, view synchronization, and geometry diagnostics. \| The human body pose. \|
	\| Coordinate frame \| A reference system for positions and orientations. \| Needed when comparing camera, body, object, and world measurements. \| A video frame. \|
	\| Object-centric representation \| A representation organized around objects and their relations. \| Useful for object relevance, object-set forecast, and action-object relation tasks. \| A flat feature vector without object identity. \|
	\| Odometry \| Motion estimated from sensor changes over time. \| A relevant spatial term for ego-motion and camera-pose reasoning. \| Ground-truth motion capture. \|
	\| Point cloud \| A set of 3D points representing scene structure. \| A likely target or intermediate representation for spatial-intelligence extensions. \| A 2D image grid. \|
	\| SLAM \| Simultaneous localization and mapping. \| A field term for estimating camera motion and scene structure from sensor observations. \| A task label or action class. \|
	\| Spatial grounding \| Linking language or labels to locations, objects, or geometry. \| Connects language grounding tasks with 3D/spatial reasoning. \| General text classification. \|
	\| Trajectory \| A sequence of positions over time. \| Used for hand motion, camera motion, and future-path tasks. \| A single coordinate or label. \|

	### Temporal and world models

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Action forecasting \| Predicting a future action before it happens. \| Covered by next-action and long-horizon task contracts. \| Recognizing the current action only. \|
	\| Autoregressive prediction \| Generating each future token, state, or frame conditioned on prior outputs. \| Relevant for model branches that produce structured JSON or temporal predictions. \| A one-shot classifier. \|
	\| Forward dynamics \| Predicting the next state from the current state and action/context. \| The Cosmos3-Super LoRA branch uses a forward-dynamics-style diagnostic contract. \| Reverse inference from result back to cause. \|
	\| Latent state \| A hidden representation that summarizes observed context. \| Useful for future foundation-model and world-model training plans. \| A visible annotation column. \|
	\| Long-horizon prediction \| Predicting outcomes several seconds or steps ahead. \| Tasks 13 and 14 test longer temporal context beyond immediate recognition. \| Single-frame classification. \|
	\| Next-frame prediction \| Predicting future visual frames from past frames. \| A field-level world-model objective related to the human-video world-model direction. \| Next-action prediction. \|
	\| Object persistence \| Tracking that an object remains present over time even when view or interaction changes. \| Relevant for object-set forecast and long-video reasoning. \| A single-frame object detection. \|
	\| Rollout \| Repeatedly predicting future steps from a model state. \| Important for judging world models beyond one-step prediction. \| A held-out static test row. \|
	\| Subtask forecasting \| Predicting the next higher-level step in an activity. \| Used in the future-task probe line for Qwen3-Omni. \| Frame-level action classification. \|
	\| Teacher forcing \| Training a sequence model using ground-truth previous outputs. \| A likely training option for future sequence/world-model baselines. \| Free-running rollout evaluation. \|
	\| Temporal leakage \| Using future information that would not be available at prediction time. \| Avoided by chronological splits and target-side feature controls. \| A low model score. \|
	\| Transition timing \| Estimating when the next state or action transition happens. \| Task 20 turns temporal change into a regression target. \| Classifying the transition type only. \|

	### Robotics and VLA

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Action chunk \| A short sequence of low-level actions predicted together. \| The VLA figure and plan use action chunks as the policy-output concept. \| A natural-language action label. \|
	\| Behavior cloning \| A supervised imitation-learning method for predicting demonstrated actions. \| A plausible baseline once action targets are converted. \| Generative video modeling. \|
	\| Contact event \| A moment when a hand, body, or tool touches an object or surface. \| Used in contact-related tasks and action-quality interpretation. \| Visual co-occurrence without touch. \|
	\| Dexterity \| Fine-grained physical manipulation ability. \| Relevant to hand-object interaction, contact, and VLA/policy directions. \| High text-generation accuracy. \|
	\| End effector \| The robot part that acts on the world, such as a gripper or hand. \| A key target frame for future manipulation-policy conversion. \| A camera or global scene coordinate. \|
	\| Hand-object interaction \| A physical interaction between hands and objects. \| A central signal family behind action, contact, object relevance, and interaction-text tasks. \| Object detection without action. \|
	\| Imitation learning \| Training a policy to imitate demonstrated behavior. \| Relevant when converting human video/motion into action supervision. \| Reinforcement learning from online robot trials. \|
	\| Language grounding \| Connecting text to observed objects, actions, or spatial context. \| Task 8 and VLA directions use language as grounded supervision rather than standalone text. \| Caption fluency alone. \|
	\| Policy \| A mapping from observations to actions. \| A future target for robot-compatible Xperience-derived action data. \| A benchmark metric. \|
	\| Robot-compatible action target \| An action representation a robot policy can execute or imitate. \| Needed before OpenVLA/openpi/GR00T-style policy training is meaningful here. \| Human-only caption text. \|
	\| Vision-language-action model \| A model that maps visual context and language into actions. \| The VLA direction is a future path after action targets are converted into robot-compatible chunks. \| A vision-language model that only answers text. \|

	### Tasks and metrics

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Compact-proxy score \| A bounded proxy metric when a direct raw target is not publicly available. \| Kept explicit in the matrix and gap audit so readers do not over-read it. \| A direct target measurement. \|
	\| Direct score \| A metric computed against the task target directly. \| The preferred score type in the 20-task matrix. \| Compact-proxy score. \|
	\| Gap audit \| A coverage and source-status audit. \| Explains scored, proxy, and unsupported cells. \| A performance leaderboard. \|
	\| Leakage control \| A split or feature rule that prevents using target information unfairly. \| Chronological splits, held-out splits, and source audits protect task interpretation. \| Lower training accuracy. \|
	\| Normalized radar value \| A 0-1 plotting value used only to draw comparable radar polygons. \| Helps visualize metrics with different scales and directions. \| The raw metric value to cite. \|
	\| Raw metric value \| The original metric value emitted by the runner or verified result package. \| This is the value to cite from the 180-result table. \| The normalized radar value. \|
	\| Task contract \| The definition of one benchmark task. \| Includes input, target/output, metric, split, source artifact, and limitation. \| A model architecture. \|
	\| Task-method record \| One method evaluated on one task. \| 9 methods x 20 tasks gives 180 public result records. \| A single prediction row. \|
	\| Unified 20-task suite \| The current task surface. \| All 20 task contracts are presented together and scored across methods where real artifacts exist. \| Historical tier2_task_suite filenames, which are provenance paths rather than a second suite. \|

	### Training and evaluation

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Adapter checkpoint \| Saved adapter weights from a fine-tuning run. \| The public model branches publish adapters when validated and public-safe. \| Full base-model checkpoint. \|
	\| Balanced accuracy \| Accuracy averaged across classes to reduce majority-class dominance. \| Useful for imbalanced task labels. \| Overall accuracy. \|
	\| Chronological split \| A split ordered by time. \| Used for the single-episode baselines to reduce future-window leakage. \| A random row split. \|
	\| Confusion matrix \| A table of predicted classes versus true classes. \| Helps inspect which task labels a method confuses. \| A scalar leaderboard score. \|
	\| FSDP \| Fully Sharded Data Parallel, a distributed training strategy. \| Appears in full-parameter feasibility and multi-GPU training notes. \| A model architecture. \|
	\| Held-out evaluation \| Testing on examples not used for training. \| Required before promoting Qwen/Cosmos results to public evidence. \| Training-set loss. \|
	\| JSON validity \| Whether model output parses as the required JSON schema. \| A key diagnostic for Qwen3-Omni structured-output runs. \| Task correctness after parsing. \|
	\| Macro F1 \| The average F1 score across classes, usually treating classes equally. \| Used when class imbalance matters in classification tasks. \| Accuracy dominated by frequent classes. \|
	\| Mean absolute error \| The average absolute difference between predicted and true numeric values. \| Used for regression-style task rows such as timing or trajectory targets. \| A classification F1 score. \|
	\| Overfit check \| A small training test that verifies a model can learn a tiny subset. \| Useful for catching data/model wiring bugs before full training. \| Evidence of generalization. \|
	\| Parameter-efficient fine-tuning \| Updating a small number of added or selected parameters. \| LoRA is the current parameter-efficient path for Qwen/Cosmos branches. \| Full-parameter fine-tuning. \|
	\| Schema compliance \| Whether an output follows the expected field names and value types. \| Needed for structured task probes and public package validation. \| High semantic accuracy. \|
	\| Smoke run \| A short run that checks whether a pipeline can start and execute key steps. \| Used for feasibility gates before expensive full runs. \| A complete benchmark result. \|
	\| Top-k accuracy \| A score that counts a prediction correct if the target is among the k highest-ranked outputs. \| Useful for large-label or retrieval-style tasks. \| Top-1 exact accuracy. \|
	\| Train/validation/test split \| A partition that separates model fitting, tuning, and final evaluation examples. \| The selected-128 setup uses a held-out split discipline for model branches. \| A random shuffle without temporal or episode boundaries. \|

	### Models and runs

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Cosmos3-Nano \| A smaller Cosmos3 compatibility/future-window branch. \| Used for the Nano Future Window row and related diagnostics. \| Cosmos3-Super fine-tuned adapter. \|
	\| Cosmos3-Super \| The larger Cosmos3-style branch tracked in this project. \| Published as Reasoner diagnostics and a separate forward-dynamics LoRA adapter/result branch when verified. \| Cosmos3-Nano. \|
	\| Foundation pipeline \| A high-level training direction. \| Spatial intelligence, human-video world modeling, and vision-language-action are documented as trainable directions with task mappings. \| A completed public result row. \|
	\| Full-parameter fine-tuning \| Updating the whole model rather than only adapters. \| This project records feasibility gates and short pilots, but does not publish full checkpoints. \| LoRA adapter publication. \|
	\| Human-video world model \| Learning future frames, actions, and interaction dynamics from human video. \| Uses temporal prediction, next-action, transition, and object-forecast tasks. \| Robot policy execution. \|
	\| LoRA adapter \| A lightweight set of trainable adapter weights. \| Published only when the package is verified and public-safe. \| Full base-model weights. \|
	\| Metadata baseline \| A selected-128 baseline using metadata or text-derived public-safe features. \| Compares simple and neural heads on the held-out split. \| Raw video, depth, or audio feature baselines. \|
	\| Minimal baseline \| A simple non-neural task head; the "minimum" reference row in casual wording. \| Provides a reproducible lower-complexity comparison for task feasibility. \| Metadata-only selected-128 baseline family. \|
	\| Neural MLP \| A compact neural task head. \| Used for single-episode and selected-128 baseline comparisons. \| Foundation-model fine-tuning. \|
	\| Qwen v1-v6 \| The Qwen3-Omni run lineage. \| v1-v4 are earlier pipeline/ablation evidence, v5 is the prior pinned release, and v6 is the current public 20-task row. \| Six different evidence lines. \|
	\| Qwen3-Omni \| The multimodal foundation-model family used for the Qwen branch. \| The current public 20-task Qwen row is Qwen3-Omni v6 LoRA plus task-specific probes. \| Cosmos3 or single-episode task-head baselines. \|
	\| Raw-feature baseline \| A selected-128 baseline using exported public-safe raw-feature groups. \| Tracks what non-foundation heads can do with richer processed inputs. \| Raw gated media redistribution. \|
	\| Simple baseline \| A non-neural baseline family for the selected-128 rows. \| Used for metadata/text and raw-feature 128-episode comparisons before NN/foundation-model rows. \| The single-episode Minimal baseline. \|
	\| Spatial intelligence \| Learning geometry and spatial reasoning from egocentric data. \| Uses video, depth, camera pose, and language tasks to target 3D/space reasoning. \| World-model future prediction. \|
	\| Vision-language-action \| Mapping perception and language to action chunks. \| A future policy/VLA direction that needs action-target conversion and stronger policy packaging. \| Qwen3-Omni diagnostic scoring. \|

	### Public surfaces

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| HF artifact dataset \| Hugging Face dataset repo for derived evidence. \| Stores public-safe reports, metrics, website JSON, and sanitized result packages. \| Original Xperience-10M dataset. \|
	\| HF baseline model repo \| Hugging Face model repo for lightweight baseline artifacts. \| Mirrors baseline weights, figures, metrics, and task artifacts. \| Qwen/Cosmos adapter-specific repos. \|
	\| HF Space \| Hugging Face-hosted app/site surface. \| Mirrors the dashboard and static website assets. \| HF artifact dataset or model repo. \|
	\| HF weights/results repo \| A consolidated public-safe model-result bundle. \| Groups baseline weights, verified model artifacts, analysis files, and manifests. \| The upstream raw dataset. \|
	\| Mirror parity \| A check that public copies match the source files. \| Records whether GitHub, website, and HF mirrors agree. \| A model-quality metric. \|
	\| Public-safe artifact \| A file that can be mirrored publicly without raw gated content. \| Metrics, JSON summaries, model cards, figures, derived manifests, and approved lightweight weights/adapters. \| Raw dataset redistribution. \|
	\| Publication audit \| A public-package validation report. \| Confirms required files exist and forbidden raw/private assets are not included. \| Scientific peer review. \|
	\| Verified package \| A result or artifact bundle that passed local/public validators. \| Only verified packages are promoted to README, website, and HF surfaces as public evidence. \| A running or exploratory experiment. \|

	## File Entry Points

	\| Need \| Open \|
	\| --- \| --- \|
	\| Reader navigation \| `PUBLIC_READER_MAP.md`, `docs/data/public_reader_map.json` \|
	\| Task definitions \| `TASK_SUITE_20.md`, `docs/data/task_suite_20.json` \|
	\| Result matrix \| `TASK_METHOD_20_RESULT_MATRIX.md`, `docs/data/task_method_20_result_matrix.json` \|
	\| Direct/proxy status \| `TASK_METHOD_20_GAP_AUDIT.md`, `docs/data/task_method_20_gap_audit.json` \|
	\| Qwen lineage \| `QWEN3_OMNI_RUN_LINEAGE.md`, `docs/data/qwen3_omni_run_lineage.json` \|
	\| 128-episode source/features \| `XPERIENCE10M_128_EPISODE_FEATURE_INDEX.md`, `docs/data/xperience10m_128_episode_feature_index.json` \|
	\| Public mirrors \| `PUBLIC_SURFACE_QA.md`, `docs/data/mirror_parity.json`, `docs/data/live_publication_status.json` \|

	# Glossary

	This glossary defines project-specific terms and adjacent technical field terms that can be easy to confuse across the GitHub repo, website, Hugging Face Space, artifact dataset, model repos, result matrices, and embodied-AI training discussions. Use it with `PUBLIC_READER_MAP.md` when choosing what to read first, and with `docs/data/glossary.json` when a tool needs the same terms in machine-readable form.

	## How To Read The Terms

	\| Category \| What it clarifies \|
	\| --- \| --- \|
	\| Dataset and scope \| Public data boundaries, evidence lines, and how each result family should be read. \|
	\| Files and features \| Raw sample files, windows, feature manifests, and public-safe derivatives. \|
	\| Multimodal sensing \| Video, audio, depth, IMU, motion capture, calibration, and synchronization terms. \|
	\| Spatial geometry \| Camera pose, SLAM, coordinate frames, point clouds, 3D reconstruction, and spatial grounding. \|
	\| Temporal and world models \| Future prediction, rollouts, forward dynamics, long-horizon forecasting, and temporal leakage. \|
	\| Robotics and VLA \| Vision-language-action, policies, action chunks, imitation learning, contact, and dexterity. \|
	\| Tasks and metrics \| Task contracts, scored records, direct scores, compact proxies, and audits. \|
	\| Training and evaluation \| Splits, held-out evaluation, metric types, prompt/schema checks, adapters, and distributed training. \|
	\| Models and runs \| Baseline families, Qwen3-Omni, Cosmos3, LoRA adapters, and full-parameter gates. \|
	\| Public surfaces \| GitHub, website, Hugging Face repos, parity checks, and package validation. \|

	## Core And Field Terms

	### Dataset and scope

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Evidence line \| A reading lane for a group of results. \| Line 1 is one public sample episode; Line 2 is selected-128 held-out comparison. \| Qwen run versions v1-v6, which are model-run lineage. \|
	\| Official gated data \| Upstream files that require official dataset access. \| Raw Xperience-10M MP4/HDF5/RRD files and full source directories remain outside the public repo. \| Public-safe metrics, derived features, figures, and manifests. \|
	\| Public sample episode \| One officially available sample episode. \| The fully inspectable Line 1 unit used for raw-file browsing, 20-frame windows, task construction, and single-episode baselines. \| The selected-128 comparison rows. \|
	\| Selected 128 episodes \| A public-safe selected subset of official gated episode paths. \| Line 2 uses derived windows/features and keeps links back to official episode ids and gated source paths. \| Redistributed raw MP4/HDF5/RRD data. \|
	\| Xperience-10M \| The upstream embodied human-interaction dataset. \| Source dataset behind the public sample, selected-128 features, task suite, and model diagnostics. \| This repo, which only redistributes public-safe derived artifacts. \|

	### Files and features

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| 20-frame window \| A fixed short clip slice. \| The sample episode is converted into aligned 20-frame units for features, labels, and many task heads. \| A full episode or arbitrary video segment. \|
	\| annotation.hdf5 \| Upstream annotation container for the sample. \| Contains original labels/metadata; some public derived files expose processed features instead of every raw text field. \| Task result summaries. \|
	\| Episode \| One recorded interaction sequence. \| The basic source unit behind windows, labels, and train/val/test splits. \| A 20-frame window. \|
	\| Feature manifest \| A map from model-input columns to source modalities. \| Explains feature groups and dimensions for the sample task suite. \| The raw annotation file. \|
	\| Interaction text \| Natural-language interaction/caption content. \| Used by task 15 and some derived text features; public matrices record direct or compact-proxy status. \| Numeric action ids or subtask ids. \|
	\| Modality \| A type of signal. \| Video, audio, depth, pose/SLAM, motion capture, inertial, calibration, and language-derived signals. \| A task target. \|
	\| Raw sample file map \| A human-readable inventory of the sample episode files. \| Explains videos, annotations, calibration, motion, and derived previews. \| A training manifest. \|
	\| visualization.rrd \| Rerun viewer recording for visual inspection. \| Can be downloaded from the official sample dataset and opened in Rerun 0.29.0 to inspect the sample episode. It is not used for published training or metric rows. \| MP4 video streams or model inputs. \|
	\| Window stride \| The frame step between neighboring windows. \| Creates overlapping examples while preserving chronological order and leakage controls. \| Video frame rate. \|

	### Multimodal sensing

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Audio waveform \| A time-series pressure signal from sound. \| The audio ablation measures whether embedded audio helps selected task contracts. \| Language captions or text labels. \|
	\| Calibration \| Parameters that relate sensors to each other and to physical space. \| Needed to interpret camera streams, depth, pose, and synchronized multimodal features together. \| A model training hyperparameter. \|
	\| Camera extrinsics \| A camera position and orientation relative to another coordinate frame. \| Connects different camera streams and world coordinates. \| Camera intrinsics. \|
	\| Camera intrinsics \| Internal camera parameters such as focal length and distortion. \| Explain how image pixels project to rays for geometry tasks. \| Camera extrinsics. \|
	\| Depth map \| A per-pixel estimate of distance from the camera. \| Depth-derived signals support spatial and geometry-oriented tasks. \| RGB brightness or semantic segmentation. \|
	\| Egocentric video \| Video captured from a first-person or body-mounted viewpoint. \| The sample streams are egocentric views of human interaction and are the visual basis for many tasks. \| Third-person robot-camera footage. \|
	\| Fisheye camera \| A wide-angle camera with strong lens distortion. \| Multiple fisheye MP4 streams give broad room coverage but need calibration-aware interpretation. \| A rectilinear pinhole camera image. \|
	\| IMU \| An inertial measurement unit with accelerometer and gyroscope signals. \| Supports motion, temporal, and sensor-bridging tasks. \| Motion capture skeleton data. \|
	\| Metric depth \| Depth expressed in physical units rather than arbitrary relative scale. \| Useful for distance-sensitive spatial reasoning and reconstruction targets. \| Relative monocular depth. \|
	\| Motion capture \| A system that records body or hand motion over time. \| Provides hand/body motion evidence when exposed through public-safe derived features. \| Video-only pose estimation. \|
	\| RGB frame \| A color image frame from a video stream. \| Used for visual statistics, previews, and many model inputs. \| Depth values or point-cloud coordinates. \|
	\| Sensor alignment \| Putting different sensor streams into a shared temporal or spatial reference. \| Used to make video, audio, pose, depth, IMU, and mocap usable in the same task input. \| Model ensembling. \|
	\| Stereo camera \| A paired-camera setup that supports depth or geometry estimation. \| The sample browser exposes stereo streams as part of the visual modality set. \| Single-view RGB video. \|
	\| Timestamp synchronization \| Aligning sensor samples by time. \| The task suite assumes aligned windows across modalities so labels and features refer to the same moment. \| Randomly joining files with similar names. \|

	### Spatial geometry

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| 3D reconstruction \| Recovering 3D scene structure from sensor data. \| One core spatial-intelligence direction for Xperience-style data. \| Next-action classification. \|
	\| Affordance \| An action possibility offered by an object or scene. \| Relevant when moving from observed human interaction to robot-action or VLA tasks. \| A detected object category alone. \|
	\| Camera pose \| The camera position and orientation at a time step. \| Supports spatial-intelligence tasks, view synchronization, and geometry diagnostics. \| The human body pose. \|
	\| Coordinate frame \| A reference system for positions and orientations. \| Needed when comparing camera, body, object, and world measurements. \| A video frame. \|
	\| Object-centric representation \| A representation organized around objects and their relations. \| Useful for object relevance, object-set forecast, and action-object relation tasks. \| A flat feature vector without object identity. \|
	\| Odometry \| Motion estimated from sensor changes over time. \| A relevant spatial term for ego-motion and camera-pose reasoning. \| Ground-truth motion capture. \|
	\| Point cloud \| A set of 3D points representing scene structure. \| A likely target or intermediate representation for spatial-intelligence extensions. \| A 2D image grid. \|
	\| SLAM \| Simultaneous localization and mapping. \| A field term for estimating camera motion and scene structure from sensor observations. \| A task label or action class. \|
	\| Spatial grounding \| Linking language or labels to locations, objects, or geometry. \| Connects language grounding tasks with 3D/spatial reasoning. \| General text classification. \|
	\| Trajectory \| A sequence of positions over time. \| Used for hand motion, camera motion, and future-path tasks. \| A single coordinate or label. \|

	### Temporal and world models

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Action forecasting \| Predicting a future action before it happens. \| Covered by next-action and long-horizon task contracts. \| Recognizing the current action only. \|
	\| Autoregressive prediction \| Generating each future token, state, or frame conditioned on prior outputs. \| Relevant for model branches that produce structured JSON or temporal predictions. \| A one-shot classifier. \|
	\| Forward dynamics \| Predicting the next state from the current state and action/context. \| The Cosmos3-Super LoRA branch uses a forward-dynamics-style diagnostic contract. \| Reverse inference from result back to cause. \|
	\| Latent state \| A hidden representation that summarizes observed context. \| Useful for future foundation-model and world-model training plans. \| A visible annotation column. \|
	\| Long-horizon prediction \| Predicting outcomes several seconds or steps ahead. \| Tasks 13 and 14 test longer temporal context beyond immediate recognition. \| Single-frame classification. \|
	\| Next-frame prediction \| Predicting future visual frames from past frames. \| A field-level world-model objective related to the human-video world-model direction. \| Next-action prediction. \|
	\| Object persistence \| Tracking that an object remains present over time even when view or interaction changes. \| Relevant for object-set forecast and long-video reasoning. \| A single-frame object detection. \|
	\| Rollout \| Repeatedly predicting future steps from a model state. \| Important for judging world models beyond one-step prediction. \| A held-out static test row. \|
	\| Subtask forecasting \| Predicting the next higher-level step in an activity. \| Used in the future-task probe line for Qwen3-Omni. \| Frame-level action classification. \|
	\| Teacher forcing \| Training a sequence model using ground-truth previous outputs. \| A likely training option for future sequence/world-model baselines. \| Free-running rollout evaluation. \|
	\| Temporal leakage \| Using future information that would not be available at prediction time. \| Avoided by chronological splits and target-side feature controls. \| A low model score. \|
	\| Transition timing \| Estimating when the next state or action transition happens. \| Task 20 turns temporal change into a regression target. \| Classifying the transition type only. \|

	### Robotics and VLA

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Action chunk \| A short sequence of low-level actions predicted together. \| The VLA figure and plan use action chunks as the policy-output concept. \| A natural-language action label. \|
	\| Behavior cloning \| A supervised imitation-learning method for predicting demonstrated actions. \| A plausible baseline once action targets are converted. \| Generative video modeling. \|
	\| Contact event \| A moment when a hand, body, or tool touches an object or surface. \| Used in contact-related tasks and action-quality interpretation. \| Visual co-occurrence without touch. \|
	\| Dexterity \| Fine-grained physical manipulation ability. \| Relevant to hand-object interaction, contact, and VLA/policy directions. \| High text-generation accuracy. \|
	\| End effector \| The robot part that acts on the world, such as a gripper or hand. \| A key target frame for future manipulation-policy conversion. \| A camera or global scene coordinate. \|
	\| Hand-object interaction \| A physical interaction between hands and objects. \| A central signal family behind action, contact, object relevance, and interaction-text tasks. \| Object detection without action. \|
	\| Imitation learning \| Training a policy to imitate demonstrated behavior. \| Relevant when converting human video/motion into action supervision. \| Reinforcement learning from online robot trials. \|
	\| Language grounding \| Connecting text to observed objects, actions, or spatial context. \| Task 8 and VLA directions use language as grounded supervision rather than standalone text. \| Caption fluency alone. \|
	\| Policy \| A mapping from observations to actions. \| A future target for robot-compatible Xperience-derived action data. \| A benchmark metric. \|
	\| Robot-compatible action target \| An action representation a robot policy can execute or imitate. \| Needed before OpenVLA/openpi/GR00T-style policy training is meaningful here. \| Human-only caption text. \|
	\| Vision-language-action model \| A model that maps visual context and language into actions. \| The VLA direction is a future path after action targets are converted into robot-compatible chunks. \| A vision-language model that only answers text. \|

	### Tasks and metrics

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Compact-proxy score \| A bounded proxy metric when a direct raw target is not publicly available. \| Kept explicit in the matrix and gap audit so readers do not over-read it. \| A direct target measurement. \|
	\| Direct score \| A metric computed against the task target directly. \| The preferred score type in the 20-task matrix. \| Compact-proxy score. \|
	\| Gap audit \| A coverage and source-status audit. \| Explains scored, proxy, and unsupported cells. \| A performance leaderboard. \|
	\| Leakage control \| A split or feature rule that prevents using target information unfairly. \| Chronological splits, held-out splits, and source audits protect task interpretation. \| Lower training accuracy. \|
	\| Normalized radar value \| A 0-1 plotting value used only to draw comparable radar polygons. \| Helps visualize metrics with different scales and directions. \| The raw metric value to cite. \|
	\| Raw metric value \| The original metric value emitted by the runner or verified result package. \| This is the value to cite from the 180-result table. \| The normalized radar value. \|
	\| Task contract \| The definition of one benchmark task. \| Includes input, target/output, metric, split, source artifact, and limitation. \| A model architecture. \|
	\| Task-method record \| One method evaluated on one task. \| 9 methods x 20 tasks gives 180 public result records. \| A single prediction row. \|
	\| Unified 20-task suite \| The current task surface. \| All 20 task contracts are presented together and scored across methods where real artifacts exist. \| Historical tier2_task_suite filenames, which are provenance paths rather than a second suite. \|

	### Training and evaluation

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Adapter checkpoint \| Saved adapter weights from a fine-tuning run. \| The public model branches publish adapters when validated and public-safe. \| Full base-model checkpoint. \|
	\| Balanced accuracy \| Accuracy averaged across classes to reduce majority-class dominance. \| Useful for imbalanced task labels. \| Overall accuracy. \|
	\| Chronological split \| A split ordered by time. \| Used for the single-episode baselines to reduce future-window leakage. \| A random row split. \|
	\| Confusion matrix \| A table of predicted classes versus true classes. \| Helps inspect which task labels a method confuses. \| A scalar leaderboard score. \|
	\| FSDP \| Fully Sharded Data Parallel, a distributed training strategy. \| Appears in full-parameter feasibility and multi-GPU training notes. \| A model architecture. \|
	\| Held-out evaluation \| Testing on examples not used for training. \| Required before promoting Qwen/Cosmos results to public evidence. \| Training-set loss. \|
	\| JSON validity \| Whether model output parses as the required JSON schema. \| A key diagnostic for Qwen3-Omni structured-output runs. \| Task correctness after parsing. \|
	\| Macro F1 \| The average F1 score across classes, usually treating classes equally. \| Used when class imbalance matters in classification tasks. \| Accuracy dominated by frequent classes. \|
	\| Mean absolute error \| The average absolute difference between predicted and true numeric values. \| Used for regression-style task rows such as timing or trajectory targets. \| A classification F1 score. \|
	\| Overfit check \| A small training test that verifies a model can learn a tiny subset. \| Useful for catching data/model wiring bugs before full training. \| Evidence of generalization. \|
	\| Parameter-efficient fine-tuning \| Updating a small number of added or selected parameters. \| LoRA is the current parameter-efficient path for Qwen/Cosmos branches. \| Full-parameter fine-tuning. \|
	\| Schema compliance \| Whether an output follows the expected field names and value types. \| Needed for structured task probes and public package validation. \| High semantic accuracy. \|
	\| Smoke run \| A short run that checks whether a pipeline can start and execute key steps. \| Used for feasibility gates before expensive full runs. \| A complete benchmark result. \|
	\| Top-k accuracy \| A score that counts a prediction correct if the target is among the k highest-ranked outputs. \| Useful for large-label or retrieval-style tasks. \| Top-1 exact accuracy. \|
	\| Train/validation/test split \| A partition that separates model fitting, tuning, and final evaluation examples. \| The selected-128 setup uses a held-out split discipline for model branches. \| A random shuffle without temporal or episode boundaries. \|

	### Models and runs

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| Cosmos3-Nano \| A smaller Cosmos3 compatibility/future-window branch. \| Used for the Nano Future Window row and related diagnostics. \| Cosmos3-Super fine-tuned adapter. \|
	\| Cosmos3-Super \| The larger Cosmos3-style branch tracked in this project. \| Published as Reasoner diagnostics and a separate forward-dynamics LoRA adapter/result branch when verified. \| Cosmos3-Nano. \|
	\| Foundation pipeline \| A high-level training direction. \| Spatial intelligence, human-video world modeling, and vision-language-action are documented as trainable directions with task mappings. \| A completed public result row. \|
	\| Full-parameter fine-tuning \| Updating the whole model rather than only adapters. \| This project records feasibility gates and short pilots, but does not publish full checkpoints. \| LoRA adapter publication. \|
	\| Human-video world model \| Learning future frames, actions, and interaction dynamics from human video. \| Uses temporal prediction, next-action, transition, and object-forecast tasks. \| Robot policy execution. \|
	\| LoRA adapter \| A lightweight set of trainable adapter weights. \| Published only when the package is verified and public-safe. \| Full base-model weights. \|
	\| Metadata baseline \| A selected-128 baseline using metadata or text-derived public-safe features. \| Compares simple and neural heads on the held-out split. \| Raw video, depth, or audio feature baselines. \|
	\| Minimal baseline \| A simple non-neural task head; the "minimum" reference row in casual wording. \| Provides a reproducible lower-complexity comparison for task feasibility. \| Metadata-only selected-128 baseline family. \|
	\| Neural MLP \| A compact neural task head. \| Used for single-episode and selected-128 baseline comparisons. \| Foundation-model fine-tuning. \|
	\| Qwen v1-v6 \| The Qwen3-Omni run lineage. \| v1-v4 are earlier pipeline/ablation evidence, v5 is the prior pinned release, and v6 is the current public 20-task row. \| Six different evidence lines. \|
	\| Qwen3-Omni \| The multimodal foundation-model family used for the Qwen branch. \| The current public 20-task Qwen row is Qwen3-Omni v6 LoRA plus task-specific probes. \| Cosmos3 or single-episode task-head baselines. \|
	\| Raw-feature baseline \| A selected-128 baseline using exported public-safe raw-feature groups. \| Tracks what non-foundation heads can do with richer processed inputs. \| Raw gated media redistribution. \|
	\| Simple baseline \| A non-neural baseline family for the selected-128 rows. \| Used for metadata/text and raw-feature 128-episode comparisons before NN/foundation-model rows. \| The single-episode Minimal baseline. \|
	\| Spatial intelligence \| Learning geometry and spatial reasoning from egocentric data. \| Uses video, depth, camera pose, and language tasks to target 3D/space reasoning. \| World-model future prediction. \|
	\| Vision-language-action \| Mapping perception and language to action chunks. \| A future policy/VLA direction that needs action-target conversion and stronger policy packaging. \| Qwen3-Omni diagnostic scoring. \|

	### Public surfaces

	\| Term \| Plain meaning \| In this project \| Do not confuse with \|
	\| --- \| --- \| --- \| --- \|
	\| HF artifact dataset \| Hugging Face dataset repo for derived evidence. \| Stores public-safe reports, metrics, website JSON, and sanitized result packages. \| Original Xperience-10M dataset. \|
	\| HF baseline model repo \| Hugging Face model repo for lightweight baseline artifacts. \| Mirrors baseline weights, figures, metrics, and task artifacts. \| Qwen/Cosmos adapter-specific repos. \|
	\| HF Space \| Hugging Face-hosted app/site surface. \| Mirrors the dashboard and static website assets. \| HF artifact dataset or model repo. \|
	\| HF weights/results repo \| A consolidated public-safe model-result bundle. \| Groups baseline weights, verified model artifacts, analysis files, and manifests. \| The upstream raw dataset. \|
	\| Mirror parity \| A check that public copies match the source files. \| Records whether GitHub, website, and HF mirrors agree. \| A model-quality metric. \|
	\| Public-safe artifact \| A file that can be mirrored publicly without raw gated content. \| Metrics, JSON summaries, model cards, figures, derived manifests, and approved lightweight weights/adapters. \| Raw dataset redistribution. \|
	\| Publication audit \| A public-package validation report. \| Confirms required files exist and forbidden raw/private assets are not included. \| Scientific peer review. \|
	\| Verified package \| A result or artifact bundle that passed local/public validators. \| Only verified packages are promoted to README, website, and HF surfaces as public evidence. \| A running or exploratory experiment. \|

	## File Entry Points

	\| Need \| Open \|
	\| --- \| --- \|
	\| Reader navigation \| `PUBLIC_READER_MAP.md`, `docs/data/public_reader_map.json` \|
	\| Task definitions \| `TASK_SUITE_20.md`, `docs/data/task_suite_20.json` \|
	\| Result matrix \| `TASK_METHOD_20_RESULT_MATRIX.md`, `docs/data/task_method_20_result_matrix.json` \|
	\| Direct/proxy status \| `TASK_METHOD_20_GAP_AUDIT.md`, `docs/data/task_method_20_gap_audit.json` \|
	\| Qwen lineage \| `QWEN3_OMNI_RUN_LINEAGE.md`, `docs/data/qwen3_omni_run_lineage.json` \|
	\| 128-episode source/features \| `XPERIENCE10M_128_EPISODE_FEATURE_INDEX.md`, `docs/data/xperience10m_128_episode_feature_index.json` \|
	\| Public mirrors \| `PUBLIC_SURFACE_QA.md`, `docs/data/mirror_parity.json`, `docs/data/live_publication_status.json` \|