Add pipeline tag, links to paper/code/project, and improve documentation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +23 -28
README.md CHANGED
@@ -1,43 +1,38 @@
1
  ---
 
2
  tags:
3
  - robot manipulation
4
  - multi-modal perception
5
  - vision-language-action
6
  ---
7
 
8
- # UniLACT
9
 
10
- UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models.
 
 
11
 
12
  ## Abstract
13
- Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for
14
- pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived
15
- solely from RGB observations primarily encode appearancedriven dynamics and lack explicit 3D geometric structure,
16
- which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UNILACT, a
17
- transformer-based VLA model that incorporates geometric
18
- structure through depth-aware latent pretraining, enabling
19
- downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UNILARN, a unified latent action
20
- learning framework based on inverse and forward dynamics
21
- objectives that learns a shared embedding space for RGB and
22
- depth while explicitly modeling their cross-modal interactions.
23
- This formulation produces modality-specific and unified latent
24
- action representations that serve as pseudo-labels for the depthaware pretraining of UNILACT. Extensive experiments in both
25
- simulation and real-world settings demonstrate the effectiveness
26
- of depth-aware unified latent action representations. UNILACT
27
- consistently outperforms RGB-based latent action baselines
28
- under in-domain and out-of-domain pretraining regimes, as
29
- well as on both seen and unseen manipulation tasks.
30
 
 
 
 
 
31
 
32
  ## Citation
33
 
34
  ```bibtex
35
- @misc{govind2026unilactdepthawarergblatent,
36
- title={UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models},
37
- author={Manish Kumar Govind and Dominick Reilly and Pu Wang and Srijan Das},
38
- year={2026},
39
- eprint={2602.20231},
40
- archivePrefix={arXiv},
41
- primaryClass={cs.RO},
42
- url={https://arxiv.org/abs/2602.20231}
43
- }
 
1
  ---
2
+ pipeline_tag: robotics
3
  tags:
4
  - robot manipulation
5
  - multi-modal perception
6
  - vision-language-action
7
  ---
8
 
9
+ # UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
10
 
11
+ [**Paper**](https://huggingface.co/papers/2602.20231) | [**Project Page**](https://manishgovind.github.io/unilact-vla/) | [**Code**](https://github.com/ManishGovind/UniLACT)
12
+
13
+ UniLACT is a transformer-based Vision-Language-Action (VLA) model that incorporates 3D geometric structure through depth-aware latent pretraining. By utilizing UniLARN, a unified latent action learning framework, the model learns a shared embedding space for RGB and depth, enabling downstream policies to inherit stronger spatial priors for precise and contact-rich robot manipulation.
14
 
15
  ## Abstract
16
+ Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure. To address this limitation, we introduce UniLACT, which incorporates geometric structure through depth-aware latent pretraining. Our proposed UniLARN framework learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. Extensive experiments demonstrate that UniLACT consistently outperforms RGB-based latent action baselines under both in-domain and out-of-domain pretraining regimes.
17
+
18
+ ## Setup
19
+
20
+ ```bash
21
+ conda create -n unilact python=3.10 -y
22
+ conda activate unilact
 
 
 
 
 
 
 
 
 
 
23
 
24
+ git clone https://github.com/manishgovind/uniact-vla.git
25
+ cd UniLACT
26
+ pip install -r requirements.txt
27
+ ```
28
 
29
  ## Citation
30
 
31
  ```bibtex
32
+ @article{govind2026unilactdepthawarergblatent,
33
+ title= {UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models},
34
+ author= {Manish Kumar Govind and Dominick Reilly and Pu Wang and Srijan Das},
35
+ journal={arXiv preprint arXiv:2602.20231},
36
+ year={2026}
37
+ }
38
+ ```