File size: 342 Bytes
a8124a8
 
 
 
 
1
2
3
4
5
6
This is an all-modality lightweight baseline.
RGB/stereo/fisheye/depth/point-cloud/calibration/text are compressed into handcrafted features.
It is not a deep multimodal model.
Do not treat random windows from one episode as a final generalization benchmark.
Label text was not included as input; only objects and interaction text were used.