Robotics
PyTorch
Cosmos
xperience10m_task_baseline_suite
embodied-ai
multimodal
xperience-10m
baseline
evaluation
qwen3-omni
Instructions to use cy0307/ropedia-xperience-10m-task-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use cy0307/ropedia-xperience-10m-task-baselines with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # Xperience-10M Official Dataset Card Alignment | |
| This file records the public description of the official | |
| [`ropedia-ai/xperience-10m`](https://huggingface.co/datasets/ropedia-ai/xperience-10m) | |
| dataset card and how this repo uses only one public sample episode from that | |
| larger source. It is a description-alignment artifact, not a raw-data mirror. | |
| Checked on: 2026-06-01 11:14:51 UTC against the public Hugging Face dataset | |
| page/API and the public sample dataset card. | |
| ## Official Dataset Scope | |
| The official Xperience-10M dataset is described by Ropedia as a large-scale | |
| egocentric multimodal dataset for embodied AI, robotics, world models, and | |
| spatial intelligence. The dataset card frames it as human-experience data with | |
| roughly 10 million interaction/experience units and about 10,000 hours of | |
| synchronized first-person recording. | |
| The official card metadata lists these task and modality categories: | |
| - task categories: video classification, image-to-text, depth estimation, robotics | |
| - modalities: 3D, audio, video | |
| - language: English | |
| - license field: `other` | |
| - size category: `1M<n<10M` | |
| - access: manually gated, reviewed access for approved non-commercial use | |
| The current public Hugging Face API metadata reports the dataset repo as | |
| `gated: manual` and notes that an external DocuSign agreement may be required | |
| before approval. The API snapshot checked for this project reported: | |
| | Field | Observed value | | |
| | --- | --- | | |
| | repo id | `ropedia-ai/xperience-10m` | | |
| | pretty name | `Xperience-10M` | | |
| | repo commit | `ce943cf271a758b60240084892d05cf6dc12dd90` | | |
| | last modified | `2026-04-21T05:03:45.000Z` | | |
| | gated mode | manual | | |
| | listed task categories | video classification, image-to-text, depth estimation, robotics | | |
| | listed modalities | 3D, audio, video | | |
| | dataset-card tags | egocentric, first-person, multimodal, 3d/4d, embodied-ai, robotics, human-motion, mocap, imu, audio, depth, captions, video | | |
| | license field | `other` | | |
| | live HF total file-size display | 31.9 TB | | |
| The API file listing is useful for planning, but it is not the same as local | |
| access. The public metadata snapshot listed 85,258 repository siblings, 803 | |
| session folders, 12,103 episode folders with `annotation.hdf5`, 72,612 MP4 | |
| files, and 541 `visualization.rrd` files. This repo treats those as upstream | |
| metadata only; no full-dataset files are redistributed here, and model claims | |
| remain limited to the one public sample episode actually processed. | |
| ## Official Modalities | |
| The official dataset card describes the full dataset as synchronized 4D | |
| multimodal egocentric data spanning: | |
| - six RGB video streams: four fisheye views and two rectified stereo views | |
| - audio embedded in the video streams | |
| - stereo depth and depth confidence | |
| - camera pose, SLAM trajectory, and point-cloud information | |
| - two-hand motion capture, including hand joints and MANO-related data | |
| - full-body motion capture, keypoints, contacts, and body orientation data | |
| - inertial sensing from accelerometer and gyroscope streams | |
| - hierarchical language/caption annotations | |
| - metadata and calibration records | |
| ## Official Scale Statistics | |
| The official dataset card describes Xperience-10M at full scale with these | |
| headline counts: | |
| | Quantity | Official-card scale | | |
| | --- | --- | | |
| | Human experience / interaction units | about 10 million | | |
| | Recording duration | about 10,000 hours | | |
| | RGB frames | about 2.88 billion | | |
| | Depth frames | about 720 million | | |
| | Camera-pose records | about 576 million | | |
| | Motion-capture frames | about 576 million | | |
| | IMU records | about 7.2 billion | | |
| | Caption sentences | about 16 million | | |
| | Caption words | about 200 million | | |
| | Vocabulary size | about 6,000 words | | |
| | Object annotations | about 350,000 objects | | |
| | Trajectory distance | about 39,000 km | | |
| | Total storage described by the card | about 1 PB | | |
| The public Hugging Face page/API currently shows a separate live hosted | |
| file-size display of 31.9 TB (`usedStorage` observed as 31,871,115,497,224 | |
| bytes). This project keeps those concepts separate: the official card scale | |
| describes the full dataset design, the HF display describes the currently | |
| reported hosted file size, and this repo validates only the files that are | |
| actually available to the project. | |
| ## Public Sample Dataset Card | |
| The public sample repo is | |
| [`ropedia-ai/xperience-10m-sample`](https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample). | |
| Its dataset card describes it as a sample episode for Xperience-10M and points | |
| readers to HOMIE Toolkit for understanding the videos and annotations. It also | |
| notes that an `.rrd` file can be opened with Rerun 0.29.0 to inspect the 3D/4D | |
| structured annotations. | |
| The sample card metadata observed for this project is: | |
| | Field | Observed value | | |
| | --- | --- | | |
| | pretty name | `Xperience-10M-Sample` | | |
| | license | `cc-by-nc-4.0` | | |
| | tags | `sample`, `xperience-10k` | | |
| | size category | `n<1K` | | |
| | recommended toolkit | HOMIE Toolkit | | |
| | visualization tool | Rerun 0.29.0 for `.rrd` | | |
| This project uses the public sample to build the 5,821-frame / 1,161-window | |
| task-development suite. The sample license and the full gated dataset terms are both | |
| preserved in the public documentation; this repo's MIT code license does not | |
| grant additional rights to the raw data. | |
| ## Episode File Layout | |
| The official gated file listing and the public sample use episode folders with | |
| this practical layout: | |
| ```text | |
| <session_uuid>/ | |
| ep<episode_id>/ | |
| fisheye_cam0.mp4 | |
| fisheye_cam1.mp4 | |
| fisheye_cam2.mp4 | |
| fisheye_cam3.mp4 | |
| stereo_left.mp4 | |
| stereo_right.mp4 | |
| annotation.hdf5 | |
| visualization.rrd # optional viewer artifact; excluded from training downloads | |
| ``` | |
| For this repo, a valid training/evaluation episode requires `annotation.hdf5`. | |
| Full-omni mode prefers all six MP4 streams. Degraded mode may use | |
| `fisheye_cam0.mp4` plus the annotation file, but must record missing views in | |
| the manifest. `visualization.rrd` is useful for human inspection in Rerun, but | |
| it is excluded from training downloads and public artifact bundles. | |
| ## Annotation File Content | |
| The official card describes the HDF5 annotation file as carrying aligned | |
| multimodal records. The relevant groups include: | |
| - calibration: camera intrinsics/extrinsics for fisheye and stereo cameras | |
| - SLAM/camera pose: quaternions, translations, frame names, and point cloud | |
| - depth: depth map, confidence, scale, min/max, and validity metadata | |
| - hand motion capture: left/right hand joints, translations, and MANO-related records | |
| - full-body motion capture: body keypoints, contacts, transforms, and body rotations | |
| - IMU: timestamps, accelerometer, gyroscope, and keyframe metadata | |
| - video timing: timestamps, frame numbers, and video duration | |
| - language/caption annotations and metadata | |
| This repo's current 8,546-d feature vector uses video-derived statistics, | |
| audio, depth, pose/SLAM, calibration, mocap, IMU, and language-derived | |
| blocks. | |
| ## Intended Research Uses | |
| The official dataset card supports research directions such as: | |
| - egocentric video/action understanding | |
| - task and subtask recognition | |
| - temporal action localization and human-object interaction analysis | |
| - action-language grounding and action captioning | |
| - object grounding and caption/language grounding | |
| - audio-visual learning and multimodal pretraining | |
| - embodied reasoning, world-model learning, and robotics imitation learning | |
| - depth estimation, visual odometry, camera trajectory, SLAM, and scene reconstruction | |
| - hand/body pose, human motion understanding, and sensor fusion | |
| This repo currently implements a single-episode task suite that starts several | |
| of those directions, but it does not solve the full official task list. The 12 | |
| current tasks cover action/subtask labels, next-action prediction, transition | |
| and temporal diagnostics, hand trajectory forecasting, contact prediction, | |
| object relevance, caption grounding, cross-modal retrieval, modality | |
| reconstruction, and misalignment detection. Missing or only-proxy coverage | |
| includes real audio-visual modeling, full caption generation, depth-pixel | |
| estimation, full SLAM estimation, neural rendering, policy learning, and | |
| cross-episode generalization. | |
| ## Responsible Use and Scope | |
| The official dataset is gated and intended for approved non-commercial research | |
| use, while the public sample card lists `cc-by-nc-4.0`. This repo therefore | |
| does not redistribute raw MP4 files, raw `annotation.hdf5`, private gated data, | |
| raw `visualization.rrd`, or any full Qwen weights. Public assets here are | |
| derived metrics, small thumbnails, manifests, scripts, charts, and lightweight | |
| baseline artifacts. | |
| The official card also makes clear that the data is not meant for identity | |
| recognition, re-identification, biometric profiling, surveillance, sensitive | |
| attribute inference, or safety-critical deployment without appropriate | |
| safeguards. It also describes the open-source dataset as limited in diversity | |
| and showcase/production quality, so downstream work still needs robust | |
| evaluation and safeguards. | |
| ## Limitations To Preserve In This Project | |
| When describing Xperience-10M in this repo, keep these limitations visible: | |
| - one public sample episode cannot prove cross-environment generalization | |
| - full-dataset claims require gated access, many episodes, and held-out episode splits | |
| - motion capture, SLAM, depth, captions, and other annotations can contain noise | |
| - language annotations are not exhaustive descriptions of every scene state | |
| - large-scale training requires substantial storage, preprocessing, and compute | |
| - the current feature vector includes compact audio features, while | |
| larger audio-visual representation learning remains a multi-episode milestone | |
| ## Current Project Alignment | |
| | Official dataset card concept | Current repo status | | |
| | --- | --- | | |
| | Full Xperience-10M is large, gated, and multi-episode | Acknowledged; not redistributed | | |
| | HF API lists many gated episode paths | Recorded as upstream metadata, not local possession | | |
| | Public sample repo is `cc-by-nc-4.0` and points to HOMIE/Rerun | Preserved in data notice and reproducibility docs | | |
| | Public sample includes video/audio/depth/pose/mocap/IMU/language | Represented in the modality atlas | | |
| | Episode layout uses six MP4 streams and `annotation.hdf5` | Used by sample inspection and pilot-readiness scripts | | |
| | Audio exists in MP4 streams | Represented in the current multimodal feature contract | | |
| | 4D reconstruction/world modeling are intended research directions | Represented by proxy/diagnostic tasks only | | |
| | Real model quality requires held-out multi-episode evaluation | Pending selected multi-episode data preparation, training, and evaluation | | |