# Xperience-10M Official Dataset Card Alignment This file records the public description of the official [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m) dataset card and how this repo uses only one public sample episode from that larger source. It is a description-alignment artifact, not a raw-data mirror. Checked on: 2026-06-01 11:14:51 UTC against the public Hugging Face dataset page/API and the public sample dataset card. ## Official Dataset Scope The official Xperience-10M dataset is described by Ropedia as a large-scale egocentric multimodal dataset for embodied AI, robotics, world models, and spatial intelligence. The dataset card frames it as human-experience data with roughly 10 million interaction/experience units and about 10,000 hours of synchronized first-person recording. The official card metadata lists these task and modality categories: - task categories: video classification, image-to-text, depth estimation, robotics - modalities: 3D, audio, video - language: English - license field: `other` - size category: `1M/ ep/ fisheye_cam0.mp4 fisheye_cam1.mp4 fisheye_cam2.mp4 fisheye_cam3.mp4 stereo_left.mp4 stereo_right.mp4 annotation.hdf5 visualization.rrd # optional viewer artifact; excluded from training downloads ``` For this repo, a valid training/evaluation episode requires `annotation.hdf5`. Full-omni mode prefers all six MP4 streams. Degraded mode may use `fisheye_cam0.mp4` plus the annotation file, but must record missing views in the manifest. `visualization.rrd` is useful for human inspection in Rerun, but it is excluded from training downloads and public artifact bundles. ## Annotation File Content The official card describes the HDF5 annotation file as carrying aligned multimodal records. The relevant groups include: - calibration: camera intrinsics/extrinsics for fisheye and stereo cameras - SLAM/camera pose: quaternions, translations, frame names, and point cloud - depth: depth map, confidence, scale, min/max, and validity metadata - hand motion capture: left/right hand joints, translations, and MANO-related records - full-body motion capture: body keypoints, contacts, transforms, and body rotations - IMU: timestamps, accelerometer, gyroscope, and keyframe metadata - video timing: timestamps, frame numbers, and video duration - language/caption annotations and metadata This repo's current 8,546-d feature vector uses video-derived statistics, audio, depth, pose/SLAM, calibration, mocap, IMU, and language-derived blocks. ## Intended Research Uses The official dataset card supports research directions such as: - egocentric video/action understanding - task and subtask recognition - temporal action localization and human-object interaction analysis - action-language grounding and action captioning - object grounding and caption/language grounding - audio-visual learning and multimodal pretraining - embodied reasoning, world-model learning, and robotics imitation learning - depth estimation, visual odometry, camera trajectory, SLAM, and scene reconstruction - hand/body pose, human motion understanding, and sensor fusion This repo currently implements a single-episode task suite that starts several of those directions, but it does not solve the full official task list. The 12 current tasks cover action/subtask labels, next-action prediction, transition and temporal diagnostics, hand trajectory forecasting, contact prediction, object relevance, caption grounding, cross-modal retrieval, modality reconstruction, and misalignment detection. Missing or only-proxy coverage includes real audio-visual modeling, full caption generation, depth-pixel estimation, full SLAM estimation, neural rendering, policy learning, and cross-episode generalization. ## Responsible Use and Scope The official dataset is gated and intended for approved non-commercial research use, while the public sample card lists `cc-by-nc-4.0`. This repo therefore does not redistribute raw MP4 files, raw `annotation.hdf5`, private gated data, raw `visualization.rrd`, or any full Qwen weights. Public assets here are derived metrics, small thumbnails, manifests, scripts, charts, and lightweight baseline artifacts. The official card also makes clear that the data is not meant for identity recognition, re-identification, biometric profiling, surveillance, sensitive attribute inference, or safety-critical deployment without appropriate safeguards. It also describes the open-source dataset as limited in diversity and showcase/production quality, so downstream work still needs robust evaluation and safeguards. ## Limitations To Preserve In This Project When describing Xperience-10M in this repo, keep these limitations visible: - one public sample episode cannot prove cross-environment generalization - full-dataset performance requires gated access, many episodes, and held-out episode splits - motion capture, SLAM, depth, captions, and other annotations can contain noise - language annotations are not exhaustive descriptions of every scene state - large-scale training requires substantial storage, preprocessing, and compute - the current feature vector includes compact audio features, while larger audio-visual representation learning remains a multi-episode milestone ## Current Project Alignment | Official dataset card concept | Current repo status | | --- | --- | | Full Xperience-10M is large, gated, and multi-episode | Acknowledged; not redistributed | | HF API lists many gated episode paths | Recorded as upstream metadata, not local possession | | Public sample repo is `cc-by-nc-4.0` and points to HOMIE/Rerun | Preserved in data notice and reproducibility docs | | Public sample includes video/audio/depth/pose/mocap/IMU/language | Represented in the modality atlas | | Episode layout uses six MP4 streams and `annotation.hdf5` | Used by sample inspection and pilot-readiness scripts | | Audio exists in MP4 streams | Represented in the current multimodal feature contract | | 4D reconstruction/world modeling are intended research directions | Represented by proxy/diagnostic tasks only | | Real model quality requires held-out multi-episode evaluation | Pending selected multi-episode data preparation, training, and evaluation |