This is an all-modality lightweight baseline.
RGB/stereo/fisheye/depth/point-cloud/calibration/text are compressed into handcrafted features.
It is not a deep multimodal model.
Do not treat random windows from one episode as a final generalization benchmark.
Label text was not included as input; only objects and interaction text were used.