File size: 10,625 Bytes
94a5118
 
 
 
 
 
 
9d58132
 
94a5118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cca436c
 
 
 
 
9d58132
cca436c
 
 
 
 
9d58132
cca436c
9d58132
cca436c
 
 
 
 
 
 
94a5118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d58132
 
 
 
 
 
94a5118
cca436c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c325020
cca436c
 
 
94a5118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cca436c
 
94a5118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8124a8
45c1706
 
94a5118
 
 
 
 
 
 
 
9d58132
94a5118
 
 
 
 
 
c325020
94a5118
 
 
 
 
 
 
 
 
6a1869c
94a5118
 
cca436c
 
 
 
 
94a5118
 
 
 
9d58132
 
6a1869c
94a5118
 
 
 
 
 
 
 
 
 
45c1706
a8124a8
94a5118
 
 
 
 
 
cca436c
 
94a5118
 
45c1706
94a5118
4602161
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# Xperience-10M Official Dataset Card Alignment

This file records the public description of the official
[`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m)
dataset card and how this repo uses only one public sample episode from that
larger source. It is a description-alignment artifact, not a raw-data mirror.

Checked on: 2026-06-01 11:14:51 UTC against the public Hugging Face dataset
page/API and the public sample dataset card.

## Official Dataset Scope

The official Xperience-10M dataset is described by Ropedia as a large-scale
egocentric multimodal dataset for embodied AI, robotics, world models, and
spatial intelligence. The dataset card frames it as human-experience data with
roughly 10 million interaction/experience units and about 10,000 hours of
synchronized first-person recording.

The official card metadata lists these task and modality categories:

- task categories: video classification, image-to-text, depth estimation, robotics
- modalities: 3D, audio, video
- language: English
- license field: `other`
- size category: `1M<n<10M`
- access: manually gated, reviewed access for approved non-commercial use

The current public Hugging Face API metadata reports the dataset repo as
`gated: manual` and notes that an external DocuSign agreement may be required
before approval. The API snapshot checked for this project reported:

| Field | Observed value |
| --- | --- |
| repo id | `ropedia-ai/xperience-10m` |
| pretty name | `Xperience-10M` |
| repo commit | `ce943cf271a758b60240084892d05cf6dc12dd90` |
| last modified | `2026-04-21T05:03:45.000Z` |
| gated mode | manual |
| listed task categories | video classification, image-to-text, depth estimation, robotics |
| listed modalities | 3D, audio, video |
| dataset-card tags | egocentric, first-person, multimodal, 3d/4d, embodied-ai, robotics, human-motion, mocap, imu, audio, depth, captions, video |
| license field | `other` |
| live HF total file-size display | 31.9 TB |

The API file listing is useful for planning, but it is not the same as local
access. The public metadata snapshot listed 85,258 repository siblings, 803
session folders, 12,103 episode folders with `annotation.hdf5`, 72,612 MP4
files, and 541 `visualization.rrd` files. This repo treats those as upstream
metadata only; no full-dataset files are redistributed here, and model claims
remain limited to the one public sample episode actually processed.

## Official Modalities

The official dataset card describes the full dataset as synchronized 4D
multimodal egocentric data spanning:

- six RGB video streams: four fisheye views and two rectified stereo views
- audio embedded in the video streams
- stereo depth and depth confidence
- camera pose, SLAM trajectory, and point-cloud information
- two-hand motion capture, including hand joints and MANO-related data
- full-body motion capture, keypoints, contacts, and body orientation data
- inertial sensing from accelerometer and gyroscope streams
- hierarchical language/caption annotations
- metadata and calibration records

## Official Scale Statistics

The official dataset card describes Xperience-10M at full scale with these
headline counts:

| Quantity | Official-card scale |
| --- | --- |
| Human experience / interaction units | about 10 million |
| Recording duration | about 10,000 hours |
| RGB frames | about 2.88 billion |
| Depth frames | about 720 million |
| Camera-pose records | about 576 million |
| Motion-capture frames | about 576 million |
| IMU records | about 7.2 billion |
| Caption sentences | about 16 million |
| Caption words | about 200 million |
| Vocabulary size | about 6,000 words |
| Object annotations | about 350,000 objects |
| Trajectory distance | about 39,000 km |
| Total storage described by the card | about 1 PB |

The public Hugging Face page/API currently shows a separate live hosted
file-size display of 31.9 TB (`usedStorage` observed as 31,871,115,497,224
bytes). This project keeps those concepts separate: the official card scale
describes the full dataset design, the HF display describes the currently
reported hosted file size, and this repo validates only the files that are
actually available to the project.

## Public Sample Dataset Card

The public sample repo is
[`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample).
Its dataset card describes it as a sample episode for Xperience-10M and points
readers to HOMIE Toolkit for understanding the videos and annotations. It also
notes that an `.rrd` file can be opened with Rerun 0.29.0 to inspect the 3D/4D
structured annotations.

The sample card metadata observed for this project is:

| Field | Observed value |
| --- | --- |
| pretty name | `Xperience-10M-Sample` |
| license | `cc-by-nc-4.0` |
| tags | `sample`, `xperience-10k` |
| size category | `n<1K` |
| recommended toolkit | HOMIE Toolkit |
| visualization tool | Rerun 0.29.0 for `.rrd` |

This project uses the public sample to build the 5,821-frame / 1,161-window
task-development suite. The sample license and the full gated dataset terms are both
preserved in the public documentation; this repo's MIT code license does not
grant additional rights to the raw data.

## Episode File Layout

The official gated file listing and the public sample use episode folders with
this practical layout:

```text
<session_uuid>/
  ep<episode_id>/
    fisheye_cam0.mp4
    fisheye_cam1.mp4
    fisheye_cam2.mp4
    fisheye_cam3.mp4
    stereo_left.mp4
    stereo_right.mp4
    annotation.hdf5
    visualization.rrd        # optional viewer artifact; excluded from training downloads
```

For this repo, a valid training/evaluation episode requires `annotation.hdf5`.
Full-omni mode prefers all six MP4 streams. Degraded mode may use
`fisheye_cam0.mp4` plus the annotation file, but must record missing views in
the manifest. `visualization.rrd` is useful for human inspection in Rerun, but
it is excluded from training downloads and public artifact bundles.

## Annotation File Content

The official card describes the HDF5 annotation file as carrying aligned
multimodal records. The relevant groups include:

- calibration: camera intrinsics/extrinsics for fisheye and stereo cameras
- SLAM/camera pose: quaternions, translations, frame names, and point cloud
- depth: depth map, confidence, scale, min/max, and validity metadata
- hand motion capture: left/right hand joints, translations, and MANO-related records
- full-body motion capture: body keypoints, contacts, transforms, and body rotations
- IMU: timestamps, accelerometer, gyroscope, and keyframe metadata
- video timing: timestamps, frame numbers, and video duration
- language/caption annotations and metadata

This repo's current 8,546-d feature vector uses video-derived statistics,
audio, depth, pose/SLAM, calibration, mocap, IMU, and language-derived
blocks.

## Intended Research Uses

The official dataset card supports research directions such as:

- egocentric video/action understanding
- task and subtask recognition
- temporal action localization and human-object interaction analysis
- action-language grounding and action captioning
- object grounding and caption/language grounding
- audio-visual learning and multimodal pretraining
- embodied reasoning, world-model learning, and robotics imitation learning
- depth estimation, visual odometry, camera trajectory, SLAM, and scene reconstruction
- hand/body pose, human motion understanding, and sensor fusion

This repo currently implements a single-episode task suite that starts several
of those directions, but it does not solve the full official task list. The 12
current tasks cover action/subtask labels, next-action prediction, transition
and temporal diagnostics, hand trajectory forecasting, contact prediction,
object relevance, caption grounding, cross-modal retrieval, modality
reconstruction, and misalignment detection. Missing or only-proxy coverage
includes real audio-visual modeling, full caption generation, depth-pixel
estimation, full SLAM estimation, neural rendering, policy learning, and
cross-episode generalization.

## Responsible Use and Scope

The official dataset is gated and intended for approved non-commercial research
use, while the public sample card lists `cc-by-nc-4.0`. This repo therefore
does not redistribute raw MP4 files, raw `annotation.hdf5`, private gated data,
raw `visualization.rrd`, or any full Qwen weights. Public assets here are
derived metrics, small thumbnails, manifests, scripts, charts, and lightweight
baseline artifacts.

The official card also makes clear that the data is not meant for identity
recognition, re-identification, biometric profiling, surveillance, sensitive
attribute inference, or safety-critical deployment without appropriate
safeguards. It also describes the open-source dataset as limited in diversity
and showcase/production quality, so downstream work still needs robust
evaluation and safeguards.

## Limitations To Preserve In This Project

When describing Xperience-10M in this repo, keep these limitations visible:

- one public sample episode cannot prove cross-environment generalization
- full-dataset claims require gated access, many episodes, and held-out episode splits
- motion capture, SLAM, depth, captions, and other annotations can contain noise
- language annotations are not exhaustive descriptions of every scene state
- large-scale training requires substantial storage, preprocessing, and compute
- the current feature vector includes compact audio features, while
  larger audio-visual representation learning remains a multi-episode milestone

## Current Project Alignment

| Official dataset card concept | Current repo status |
| --- | --- |
| Full Xperience-10M is large, gated, and multi-episode | Acknowledged; not redistributed |
| HF API lists many gated episode paths | Recorded as upstream metadata, not local possession |
| Public sample repo is `cc-by-nc-4.0` and points to HOMIE/Rerun | Preserved in data notice and reproducibility docs |
| Public sample includes video/audio/depth/pose/mocap/IMU/language | Represented in the modality atlas |
| Episode layout uses six MP4 streams and `annotation.hdf5` | Used by sample inspection and pilot-readiness scripts |
| Audio exists in MP4 streams | Represented in the current multimodal feature contract |
| 4D reconstruction/world modeling are intended research directions | Represented by proxy/diagnostic tasks only |
| Real model quality requires held-out multi-episode evaluation | Pending selected multi-episode data preparation, training, and evaluation |