File size: 342 Bytes
a8124a8 | 1 2 3 4 5 6 | This is an all-modality lightweight baseline. RGB/stereo/fisheye/depth/point-cloud/calibration/text are compressed into handcrafted features. It is not a deep multimodal model. Do not treat random windows from one episode as a final generalization benchmark. Label text was not included as input; only objects and interaction text were used. |