This is an all-modality lightweight baseline. RGB/stereo/fisheye/depth/point-cloud/calibration/text are compressed into handcrafted features. It is not a deep multimodal model. Do not treat random windows from one episode as a final generalization benchmark. Label text was not included as input; only objects and interaction text were used.