| # Xperience-10M Official Dataset Card Alignment |
|
|
| This file records the public description of the official |
| [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m) |
| dataset card and how this repo uses only one public sample episode from that |
| larger source. It is a description-alignment artifact, not a raw-data mirror. |
|
|
| Checked on: 2026-06-01 11:14:51 UTC against the public Hugging Face dataset |
| page/API and the public sample dataset card. |
|
|
| ## Official Dataset Scope |
|
|
| The official Xperience-10M dataset is described by Ropedia as a large-scale |
| egocentric multimodal dataset for embodied AI, robotics, world models, and |
| spatial intelligence. The dataset card frames it as human-experience data with |
| roughly 10 million interaction/experience units and about 10,000 hours of |
| synchronized first-person recording. |
|
|
| The official card metadata lists these task and modality categories: |
|
|
| - task categories: video classification, image-to-text, depth estimation, robotics |
| - modalities: 3D, audio, video |
| - language: English |
| - license field: `other` |
| - size category: `1M<n<10M` |
| - access: manually gated, reviewed access for approved non-commercial use |
|
|
| The current public Hugging Face API metadata reports the dataset repo as |
| `gated: manual` and notes that an external DocuSign agreement may be required |
| before approval. The API snapshot checked for this project reported: |
|
|
| | Field | Observed value | |
| | --- | --- | |
| | repo id | `ropedia-ai/xperience-10m` | |
| | pretty name | `Xperience-10M` | |
| | repo commit | `ce943cf271a758b60240084892d05cf6dc12dd90` | |
| | last modified | `2026-04-21T05:03:45.000Z` | |
| | gated mode | manual | |
| | listed task categories | video classification, image-to-text, depth estimation, robotics | |
| | listed modalities | 3D, audio, video | |
| | dataset-card tags | egocentric, first-person, multimodal, 3d/4d, embodied-ai, robotics, human-motion, mocap, imu, audio, depth, captions, video | |
| | license field | `other` | |
| | live HF total file-size display | 31.9 TB | |
|
|
| The API file listing is useful for planning, but it is not the same as local |
| access. The public metadata snapshot listed 85,258 repository siblings, 803 |
| session folders, 12,103 episode folders with `annotation.hdf5`, 72,612 MP4 |
| files, and 541 `visualization.rrd` files. This repo treats those as upstream |
| metadata only; no full-dataset files are redistributed here, and model claims |
| remain limited to the one public sample episode actually processed. |
|
|
| ## Official Modalities |
|
|
| The official dataset card describes the full dataset as synchronized 4D |
| multimodal egocentric data spanning: |
|
|
| - six RGB video streams: four fisheye views and two rectified stereo views |
| - audio embedded in the video streams |
| - stereo depth and depth confidence |
| - camera pose, SLAM trajectory, and point-cloud information |
| - two-hand motion capture, including hand joints and MANO-related data |
| - full-body motion capture, keypoints, contacts, and body orientation data |
| - inertial sensing from accelerometer and gyroscope streams |
| - hierarchical language/caption annotations |
| - metadata and calibration records |
|
|
| ## Official Scale Statistics |
|
|
| The official dataset card describes Xperience-10M at full scale with these |
| headline counts: |
|
|
| | Quantity | Official-card scale | |
| | --- | --- | |
| | Human experience / interaction units | about 10 million | |
| | Recording duration | about 10,000 hours | |
| | RGB frames | about 2.88 billion | |
| | Depth frames | about 720 million | |
| | Camera-pose records | about 576 million | |
| | Motion-capture frames | about 576 million | |
| | IMU records | about 7.2 billion | |
| | Caption sentences | about 16 million | |
| | Caption words | about 200 million | |
| | Vocabulary size | about 6,000 words | |
| | Object annotations | about 350,000 objects | |
| | Trajectory distance | about 39,000 km | |
| | Total storage described by the card | about 1 PB | |
|
|
| The public Hugging Face page/API currently shows a separate live hosted |
| file-size display of 31.9 TB (`usedStorage` observed as 31,871,115,497,224 |
| bytes). This project keeps those concepts separate: the official card scale |
| describes the full dataset design, the HF display describes the currently |
| reported hosted file size, and this repo validates only the files that are |
| actually available to the project. |
|
|
| ## Public Sample Dataset Card |
|
|
| The public sample repo is |
| [`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample). |
| Its dataset card describes it as a sample episode for Xperience-10M and points |
| readers to HOMIE Toolkit for understanding the videos and annotations. It also |
| notes that an `.rrd` file can be opened with Rerun 0.29.0 to inspect the 3D/4D |
| structured annotations. |
|
|
| The sample card metadata observed for this project is: |
|
|
| | Field | Observed value | |
| | --- | --- | |
| | pretty name | `Xperience-10M-Sample` | |
| | license | `cc-by-nc-4.0` | |
| | tags | `sample`, `xperience-10k` | |
| | size category | `n<1K` | |
| | recommended toolkit | HOMIE Toolkit | |
| | visualization tool | Rerun 0.29.0 for `.rrd` | |
|
|
| This project uses the public sample to build the 5,821-frame / 1,161-window |
| task-development suite. The sample license and the full gated dataset terms are both |
| preserved in the public documentation; this repo's MIT code license does not |
| grant additional rights to the raw data. |
|
|
| ## Episode File Layout |
|
|
| The official gated file listing and the public sample use episode folders with |
| this practical layout: |
|
|
| ```text |
| <session_uuid>/ |
| ep<episode_id>/ |
| fisheye_cam0.mp4 |
| fisheye_cam1.mp4 |
| fisheye_cam2.mp4 |
| fisheye_cam3.mp4 |
| stereo_left.mp4 |
| stereo_right.mp4 |
| annotation.hdf5 |
| visualization.rrd # optional viewer artifact; excluded from training downloads |
| ``` |
|
|
| For this repo, a valid training/evaluation episode requires `annotation.hdf5`. |
| Full-omni mode prefers all six MP4 streams. Degraded mode may use |
| `fisheye_cam0.mp4` plus the annotation file, but must record missing views in |
| the manifest. `visualization.rrd` is useful for human inspection in Rerun, but |
| it is excluded from training downloads and public artifact bundles. |
|
|
| ## Annotation File Content |
|
|
| The official card describes the HDF5 annotation file as carrying aligned |
| multimodal records. The relevant groups include: |
|
|
| - calibration: camera intrinsics/extrinsics for fisheye and stereo cameras |
| - SLAM/camera pose: quaternions, translations, frame names, and point cloud |
| - depth: depth map, confidence, scale, min/max, and validity metadata |
| - hand motion capture: left/right hand joints, translations, and MANO-related records |
| - full-body motion capture: body keypoints, contacts, transforms, and body rotations |
| - IMU: timestamps, accelerometer, gyroscope, and keyframe metadata |
| - video timing: timestamps, frame numbers, and video duration |
| - language/caption annotations and metadata |
|
|
| This repo's current 8,546-d feature vector uses video-derived statistics, |
| audio, depth, pose/SLAM, calibration, mocap, IMU, and language-derived |
| blocks. |
|
|
| ## Intended Research Uses |
|
|
| The official dataset card supports research directions such as: |
|
|
| - egocentric video/action understanding |
| - task and subtask recognition |
| - temporal action localization and human-object interaction analysis |
| - action-language grounding and action captioning |
| - object grounding and caption/language grounding |
| - audio-visual learning and multimodal pretraining |
| - embodied reasoning, world-model learning, and robotics imitation learning |
| - depth estimation, visual odometry, camera trajectory, SLAM, and scene reconstruction |
| - hand/body pose, human motion understanding, and sensor fusion |
|
|
| This repo currently implements a single-episode task suite that starts several |
| of those directions, but it does not solve the full official task list. The 12 |
| current tasks cover action/subtask labels, next-action prediction, transition |
| and temporal diagnostics, hand trajectory forecasting, contact prediction, |
| object relevance, caption grounding, cross-modal retrieval, modality |
| reconstruction, and misalignment detection. Missing or only-proxy coverage |
| includes real audio-visual modeling, full caption generation, depth-pixel |
| estimation, full SLAM estimation, neural rendering, policy learning, and |
| cross-episode generalization. |
|
|
| ## Responsible Use and Scope |
|
|
| The official dataset is gated and intended for approved non-commercial research |
| use, while the public sample card lists `cc-by-nc-4.0`. This repo therefore |
| does not redistribute raw MP4 files, raw `annotation.hdf5`, private gated data, |
| raw `visualization.rrd`, or any full Qwen weights. Public assets here are |
| derived metrics, small thumbnails, manifests, scripts, charts, and lightweight |
| baseline artifacts. |
|
|
| The official card also makes clear that the data is not meant for identity |
| recognition, re-identification, biometric profiling, surveillance, sensitive |
| attribute inference, or safety-critical deployment without appropriate |
| safeguards. It also describes the open-source dataset as limited in diversity |
| and showcase/production quality, so downstream work still needs robust |
| evaluation and safeguards. |
|
|
| ## Limitations To Preserve In This Project |
|
|
| When describing Xperience-10M in this repo, keep these limitations visible: |
|
|
| - one public sample episode cannot prove cross-environment generalization |
| - full-dataset claims require gated access, many episodes, and held-out episode splits |
| - motion capture, SLAM, depth, captions, and other annotations can contain noise |
| - language annotations are not exhaustive descriptions of every scene state |
| - large-scale training requires substantial storage, preprocessing, and compute |
| - the current feature vector includes compact audio features, while |
| larger audio-visual representation learning remains a multi-episode milestone |
|
|
| ## Current Project Alignment |
|
|
| | Official dataset card concept | Current repo status | |
| | --- | --- | |
| | Full Xperience-10M is large, gated, and multi-episode | Acknowledged; not redistributed | |
| | HF API lists many gated episode paths | Recorded as upstream metadata, not local possession | |
| | Public sample repo is `cc-by-nc-4.0` and points to HOMIE/Rerun | Preserved in data notice and reproducibility docs | |
| | Public sample includes video/audio/depth/pose/mocap/IMU/language | Represented in the modality atlas | |
| | Episode layout uses six MP4 streams and `annotation.hdf5` | Used by sample inspection and pilot-readiness scripts | |
| | Audio exists in MP4 streams | Represented in the current multimodal feature contract | |
| | 4D reconstruction/world modeling are intended research directions | Represented by proxy/diagnostic tasks only | |
| | Real model quality requires held-out multi-episode evaluation | Pending selected multi-episode data preparation, training, and evaluation | |
|
|