{"id": "xperience-10m-sample:qa:0", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 0, "end_frame": 19, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 0, "context_end_frame": 119, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00000_ctx0_119_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00000_ctx0_119_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 0, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle", "coffee dripper", "scale", "bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 0, "end_frame": 19}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00000_ctx0_119_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00000_ctx0_119_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 0-19\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 19, \"start_frame\": 0}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\", \"coffee dripper\", \"scale\", \"bottle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:1", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 20, "end_frame": 39, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 0, "context_end_frame": 119, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00001_ctx0_119_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00001_ctx0_119_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 1, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle", "coffee dripper", "scale", "bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 20, "end_frame": 39}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00001_ctx0_119_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00001_ctx0_119_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 20-39\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 39, \"start_frame\": 20}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\", \"coffee dripper\", \"scale\", \"bottle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:2", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 40, "end_frame": 59, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 0, "context_end_frame": 119, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00002_ctx0_119_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00002_ctx0_119_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 2, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle", "coffee dripper", "scale", "bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 40, "end_frame": 59}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00002_ctx0_119_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00002_ctx0_119_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 40-59\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 59, \"start_frame\": 40}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\", \"coffee dripper\", \"scale\", \"bottle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:3", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 60, "end_frame": 79, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 9, "context_end_frame": 128, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00003_ctx9_128_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00003_ctx9_128_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 3, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle", "coffee dripper", "scale", "bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 60, "end_frame": 79}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00003_ctx9_128_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00003_ctx9_128_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 60-79\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 79, \"start_frame\": 60}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\", \"coffee dripper\", \"scale\", \"bottle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:4", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 80, "end_frame": 99, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 29, "context_end_frame": 148, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00004_ctx29_148_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00004_ctx29_148_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 4, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle", "coffee dripper", "scale", "bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 80, "end_frame": 99}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00004_ctx29_148_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00004_ctx29_148_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 80-99\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 99, \"start_frame\": 80}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\", \"coffee dripper\", \"scale\", \"bottle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:5", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 100, "end_frame": 119, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 49, "context_end_frame": 168, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00005_ctx49_168_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00005_ctx49_168_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 5, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle", "coffee dripper", "scale", "bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 100, "end_frame": 119}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00005_ctx49_168_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00005_ctx49_168_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 100-119\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 119, \"start_frame\": 100}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\", \"coffee dripper\", \"scale\", \"bottle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:6", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 120, "end_frame": 139, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 69, "context_end_frame": 188, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00006_ctx69_188_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00006_ctx69_188_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 6, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Pick up kettle", "evidence_window": {"start_frame": 120, "end_frame": 139}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00006_ctx69_188_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00006_ctx69_188_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 120-139\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 139, \"start_frame\": 120}, \"next_action\": \"Pick up kettle\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:7", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 140, "end_frame": 159, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 89, "context_end_frame": 208, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00007_ctx89_208_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00007_ctx89_208_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 7, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up kettle", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 140, "end_frame": 159}}, "label": "Pick up kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00007_ctx89_208_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00007_ctx89_208_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 140-159\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 159, \"start_frame\": 140}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:8", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 160, "end_frame": 179, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 109, "context_end_frame": 228, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00008_ctx109_228_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00008_ctx109_228_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 8, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 160, "end_frame": 179}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00008_ctx109_228_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00008_ctx109_228_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 160-179\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 179, \"start_frame\": 160}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:9", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 180, "end_frame": 199, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 129, "context_end_frame": 248, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00009_ctx129_248_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00009_ctx129_248_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 9, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 180, "end_frame": 199}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00009_ctx129_248_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00009_ctx129_248_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 180-199\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 199, \"start_frame\": 180}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:10", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 200, "end_frame": 219, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 149, "context_end_frame": 268, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00010_ctx149_268_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00010_ctx149_268_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 10, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 200, "end_frame": 219}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00010_ctx149_268_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00010_ctx149_268_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 200-219\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 219, \"start_frame\": 200}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:11", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 220, "end_frame": 239, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 169, "context_end_frame": 288, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00011_ctx169_288_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00011_ctx169_288_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 11, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 220, "end_frame": 239}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00011_ctx169_288_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00011_ctx169_288_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 220-239\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 239, \"start_frame\": 220}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:12", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 240, "end_frame": 259, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 189, "context_end_frame": 308, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00012_ctx189_308_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00012_ctx189_308_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 12, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 240, "end_frame": 259}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00012_ctx189_308_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00012_ctx189_308_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 240-259\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 259, \"start_frame\": 240}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:13", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 260, "end_frame": 279, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 209, "context_end_frame": 328, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00013_ctx209_328_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00013_ctx209_328_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 13, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 260, "end_frame": 279}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00013_ctx209_328_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00013_ctx209_328_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 260-279\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 279, \"start_frame\": 260}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:14", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 280, "end_frame": 299, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 229, "context_end_frame": 348, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00014_ctx229_348_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00014_ctx229_348_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 14, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 280, "end_frame": 299}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00014_ctx229_348_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00014_ctx229_348_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 280-299\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 299, \"start_frame\": 280}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:15", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 300, "end_frame": 319, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 249, "context_end_frame": 368, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00015_ctx249_368_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00015_ctx249_368_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 15, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Position kettle to pour", "subtask": "Pick up and position kettle", "objects": ["kettle"], "contact": "yes", "transition": "no", "next_action": "Position kettle to pour", "evidence_window": {"start_frame": 300, "end_frame": 319}}, "label": "Position kettle to pour", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00015_ctx249_368_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00015_ctx249_368_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 300-319\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Position kettle to pour\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 319, \"start_frame\": 300}, \"next_action\": \"Position kettle to pour\", \"objects\": [\"kettle\"], \"subtask\": \"Pick up and position kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:40", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 800, "end_frame": 819, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 749, "context_end_frame": 868, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00040_ctx749_868_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00040_ctx749_868_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 40, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 800, "end_frame": 819}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00040_ctx749_868_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00040_ctx749_868_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 800-819\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 819, \"start_frame\": 800}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:41", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 820, "end_frame": 839, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 769, "context_end_frame": 888, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00041_ctx769_888_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00041_ctx769_888_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 41, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 820, "end_frame": 839}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00041_ctx769_888_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00041_ctx769_888_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 820-839\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 839, \"start_frame\": 820}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:42", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 840, "end_frame": 859, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 789, "context_end_frame": 908, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00042_ctx789_908_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00042_ctx789_908_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 42, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 840, "end_frame": 859}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00042_ctx789_908_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00042_ctx789_908_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 840-859\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 859, \"start_frame\": 840}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:43", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 860, "end_frame": 879, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 809, "context_end_frame": 928, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00043_ctx809_928_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00043_ctx809_928_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 43, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 860, "end_frame": 879}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00043_ctx809_928_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00043_ctx809_928_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 860-879\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 879, \"start_frame\": 860}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:44", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 880, "end_frame": 899, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 829, "context_end_frame": 948, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00044_ctx829_948_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00044_ctx829_948_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 44, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 880, "end_frame": 899}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00044_ctx829_948_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00044_ctx829_948_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 880-899\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 899, \"start_frame\": 880}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:45", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 900, "end_frame": 919, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 849, "context_end_frame": 968, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00045_ctx849_968_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00045_ctx849_968_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 45, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 900, "end_frame": 919}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00045_ctx849_968_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00045_ctx849_968_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 900-919\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 919, \"start_frame\": 900}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:46", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 920, "end_frame": 939, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 869, "context_end_frame": 988, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00046_ctx869_988_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00046_ctx869_988_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 46, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 920, "end_frame": 939}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00046_ctx869_988_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00046_ctx869_988_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 920-939\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 939, \"start_frame\": 920}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:47", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 940, "end_frame": 959, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 889, "context_end_frame": 1008, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00047_ctx889_1008_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00047_ctx889_1008_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 47, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle", "subtask": "Move kettle", "objects": ["kettle", "dripper", "scale", "coffee filter", "table"], "contact": "yes", "transition": "no", "next_action": "Move kettle", "evidence_window": {"start_frame": 940, "end_frame": 959}}, "label": "Move kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00047_ctx889_1008_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00047_ctx889_1008_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 940-959\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 959, \"start_frame\": 940}, \"next_action\": \"Move kettle\", \"objects\": [\"kettle\", \"dripper\", \"scale\", \"coffee filter\", \"table\"], \"subtask\": \"Move kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:50", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1000, "end_frame": 1019, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 949, "context_end_frame": 1068, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00050_ctx949_1068_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00050_ctx949_1068_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 50, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1000, "end_frame": 1019}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00050_ctx949_1068_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00050_ctx949_1068_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1000-1019\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1019, \"start_frame\": 1000}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:51", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1020, "end_frame": 1039, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 969, "context_end_frame": 1088, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00051_ctx969_1088_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00051_ctx969_1088_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 51, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1020, "end_frame": 1039}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00051_ctx969_1088_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00051_ctx969_1088_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1020-1039\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1039, \"start_frame\": 1020}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:52", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1040, "end_frame": 1059, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 989, "context_end_frame": 1108, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00052_ctx989_1108_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00052_ctx989_1108_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 52, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1040, "end_frame": 1059}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00052_ctx989_1108_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00052_ctx989_1108_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1040-1059\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1059, \"start_frame\": 1040}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:53", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1060, "end_frame": 1079, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1009, "context_end_frame": 1128, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00053_ctx1009_1128_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00053_ctx1009_1128_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 53, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1060, "end_frame": 1079}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00053_ctx1009_1128_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00053_ctx1009_1128_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1060-1079\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1079, \"start_frame\": 1060}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:54", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1080, "end_frame": 1099, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1029, "context_end_frame": 1148, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00054_ctx1029_1148_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00054_ctx1029_1148_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 54, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1080, "end_frame": 1099}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00054_ctx1029_1148_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00054_ctx1029_1148_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1080-1099\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1099, \"start_frame\": 1080}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:55", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1100, "end_frame": 1119, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1049, "context_end_frame": 1168, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00055_ctx1049_1168_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00055_ctx1049_1168_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 55, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1100, "end_frame": 1119}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00055_ctx1049_1168_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00055_ctx1049_1168_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1100-1119\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1119, \"start_frame\": 1100}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:56", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1120, "end_frame": 1139, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1069, "context_end_frame": 1188, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00056_ctx1069_1188_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00056_ctx1069_1188_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 56, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1120, "end_frame": 1139}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00056_ctx1069_1188_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00056_ctx1069_1188_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1120-1139\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1139, \"start_frame\": 1120}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:57", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1140, "end_frame": 1159, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1089, "context_end_frame": 1208, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00057_ctx1089_1208_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00057_ctx1089_1208_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 57, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold coffee carafe", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "coffee scale", "coffee dripper", "white mug", "bottle"], "contact": "yes", "transition": "no", "next_action": "Hold coffee carafe", "evidence_window": {"start_frame": 1140, "end_frame": 1159}}, "label": "Hold coffee carafe", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00057_ctx1089_1208_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00057_ctx1089_1208_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1140-1159\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold coffee carafe\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1159, \"start_frame\": 1140}, \"next_action\": \"Hold coffee carafe\", \"objects\": [\"glass carafe\", \"coffee scale\", \"coffee dripper\", \"white mug\", \"bottle\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:62", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1240, "end_frame": 1259, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1189, "context_end_frame": 1308, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00062_ctx1189_1308_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00062_ctx1189_1308_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 62, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1240, "end_frame": 1259}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00062_ctx1189_1308_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00062_ctx1189_1308_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1240-1259\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1259, \"start_frame\": 1240}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:63", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1260, "end_frame": 1279, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1209, "context_end_frame": 1328, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00063_ctx1209_1328_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00063_ctx1209_1328_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 63, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1260, "end_frame": 1279}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00063_ctx1209_1328_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00063_ctx1209_1328_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1260-1279\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1279, \"start_frame\": 1260}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:64", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1280, "end_frame": 1299, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1229, "context_end_frame": 1348, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00064_ctx1229_1348_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00064_ctx1229_1348_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 64, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1280, "end_frame": 1299}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00064_ctx1229_1348_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00064_ctx1229_1348_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1280-1299\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1299, \"start_frame\": 1280}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:65", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1300, "end_frame": 1319, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1249, "context_end_frame": 1368, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00065_ctx1249_1368_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00065_ctx1249_1368_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 65, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1300, "end_frame": 1319}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00065_ctx1249_1368_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00065_ctx1249_1368_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1300-1319\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1319, \"start_frame\": 1300}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:66", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1320, "end_frame": 1339, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1269, "context_end_frame": 1388, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00066_ctx1269_1388_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00066_ctx1269_1388_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 66, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1320, "end_frame": 1339}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00066_ctx1269_1388_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00066_ctx1269_1388_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1320-1339\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1339, \"start_frame\": 1320}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:67", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1340, "end_frame": 1359, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1289, "context_end_frame": 1408, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00067_ctx1289_1408_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00067_ctx1289_1408_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 67, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1340, "end_frame": 1359}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00067_ctx1289_1408_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00067_ctx1289_1408_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1340-1359\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1359, \"start_frame\": 1340}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:68", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1360, "end_frame": 1379, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1309, "context_end_frame": 1428, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00068_ctx1309_1428_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00068_ctx1309_1428_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 68, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1360, "end_frame": 1379}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00068_ctx1309_1428_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00068_ctx1309_1428_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1360-1379\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1379, \"start_frame\": 1360}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:69", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1380, "end_frame": 1399, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1329, "context_end_frame": 1448, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00069_ctx1329_1448_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00069_ctx1329_1448_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 69, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp coffee scoop", "subtask": "Prepare coffee equipment and scoop grounds", "objects": ["glass carafe", "wooden scoop"], "contact": "yes", "transition": "no", "next_action": "Grasp coffee scoop", "evidence_window": {"start_frame": 1380, "end_frame": 1399}}, "label": "Grasp coffee scoop", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00069_ctx1329_1448_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00069_ctx1329_1448_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1380-1399\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp coffee scoop\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1399, \"start_frame\": 1380}, \"next_action\": \"Grasp coffee scoop\", \"objects\": [\"glass carafe\", \"wooden scoop\"], \"subtask\": \"Prepare coffee equipment and scoop grounds\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:78", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1560, "end_frame": 1579, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1509, "context_end_frame": 1628, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00078_ctx1509_1628_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00078_ctx1509_1628_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 78, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee jar", "coffee scoop"], "contact": "yes", "transition": "yes", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1560, "end_frame": 1579}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00078_ctx1509_1628_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00078_ctx1509_1628_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1560-1579\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1579, \"start_frame\": 1560}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee jar\", \"coffee scoop\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"yes\"}"}]} {"id": "xperience-10m-sample:qa:79", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1580, "end_frame": 1599, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1529, "context_end_frame": 1648, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00079_ctx1529_1648_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00079_ctx1529_1648_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 79, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee jar", "coffee scoop"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1580, "end_frame": 1599}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00079_ctx1529_1648_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00079_ctx1529_1648_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1580-1599\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1599, \"start_frame\": 1580}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee jar\", \"coffee scoop\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:80", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1600, "end_frame": 1619, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1549, "context_end_frame": 1668, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00080_ctx1549_1668_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00080_ctx1549_1668_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 80, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee scoop", "dripper"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1600, "end_frame": 1619}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00080_ctx1549_1668_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00080_ctx1549_1668_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1600-1619\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1619, \"start_frame\": 1600}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee scoop\", \"dripper\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:81", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1620, "end_frame": 1639, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1569, "context_end_frame": 1688, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00081_ctx1569_1688_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00081_ctx1569_1688_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 81, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee scoop", "dripper"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1620, "end_frame": 1639}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00081_ctx1569_1688_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00081_ctx1569_1688_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1620-1639\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1639, \"start_frame\": 1620}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee scoop\", \"dripper\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:82", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1640, "end_frame": 1659, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1589, "context_end_frame": 1708, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00082_ctx1589_1708_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00082_ctx1589_1708_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 82, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee scoop", "dripper"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1640, "end_frame": 1659}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00082_ctx1589_1708_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00082_ctx1589_1708_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1640-1659\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1659, \"start_frame\": 1640}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee scoop\", \"dripper\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:83", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1660, "end_frame": 1679, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1609, "context_end_frame": 1728, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00083_ctx1609_1728_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00083_ctx1609_1728_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 83, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee scoop", "dripper"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1660, "end_frame": 1679}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00083_ctx1609_1728_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00083_ctx1609_1728_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1660-1679\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1679, \"start_frame\": 1660}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee scoop\", \"dripper\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:84", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1680, "end_frame": 1699, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1629, "context_end_frame": 1748, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00084_ctx1629_1748_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00084_ctx1629_1748_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 84, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee scoop", "dripper"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1680, "end_frame": 1699}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00084_ctx1629_1748_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00084_ctx1629_1748_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1680-1699\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1699, \"start_frame\": 1680}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee scoop\", \"dripper\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:85", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 1700, "end_frame": 1719, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1649, "context_end_frame": 1768, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00085_ctx1649_1768_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00085_ctx1649_1768_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 85, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Transfer coffee to dripper", "subtask": "Transfer coffee grounds to dripper", "objects": ["coffee scoop", "dripper"], "contact": "yes", "transition": "no", "next_action": "Transfer coffee to dripper", "evidence_window": {"start_frame": 1700, "end_frame": 1719}}, "label": "Transfer coffee to dripper", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00085_ctx1649_1768_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00085_ctx1649_1768_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 1700-1719\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Transfer coffee to dripper\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 1719, \"start_frame\": 1700}, \"next_action\": \"Transfer coffee to dripper\", \"objects\": [\"coffee scoop\", \"dripper\"], \"subtask\": \"Transfer coffee grounds to dripper\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:100", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2000, "end_frame": 2019, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1949, "context_end_frame": 2068, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00100_ctx1949_2068_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00100_ctx1949_2068_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 100, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2000, "end_frame": 2019}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00100_ctx1949_2068_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00100_ctx1949_2068_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2000-2019\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2019, \"start_frame\": 2000}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:101", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2020, "end_frame": 2039, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1969, "context_end_frame": 2088, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00101_ctx1969_2088_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00101_ctx1969_2088_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 101, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2020, "end_frame": 2039}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00101_ctx1969_2088_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00101_ctx1969_2088_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2020-2039\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2039, \"start_frame\": 2020}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:102", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2040, "end_frame": 2059, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 1989, "context_end_frame": 2108, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00102_ctx1989_2108_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00102_ctx1989_2108_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 102, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2040, "end_frame": 2059}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00102_ctx1989_2108_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00102_ctx1989_2108_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2040-2059\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2059, \"start_frame\": 2040}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:103", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2060, "end_frame": 2079, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2009, "context_end_frame": 2128, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00103_ctx2009_2128_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00103_ctx2009_2128_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 103, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2060, "end_frame": 2079}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00103_ctx2009_2128_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00103_ctx2009_2128_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2060-2079\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2079, \"start_frame\": 2060}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:104", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2080, "end_frame": 2099, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2029, "context_end_frame": 2148, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00104_ctx2029_2148_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00104_ctx2029_2148_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 104, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2080, "end_frame": 2099}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00104_ctx2029_2148_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00104_ctx2029_2148_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2080-2099\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2099, \"start_frame\": 2080}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:105", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2100, "end_frame": 2119, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2049, "context_end_frame": 2168, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00105_ctx2049_2168_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00105_ctx2049_2168_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 105, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2100, "end_frame": 2119}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00105_ctx2049_2168_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00105_ctx2049_2168_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2100-2119\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2119, \"start_frame\": 2100}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:106", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2120, "end_frame": 2139, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2069, "context_end_frame": 2188, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00106_ctx2069_2188_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00106_ctx2069_2188_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 106, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2120, "end_frame": 2139}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00106_ctx2069_2188_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00106_ctx2069_2188_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2120-2139\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2139, \"start_frame\": 2120}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:107", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2140, "end_frame": 2159, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2089, "context_end_frame": 2208, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00107_ctx2089_2208_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00107_ctx2089_2208_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 107, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Hold gooseneck kettle", "subtask": "Handle gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "water bottle", "coffee mug"], "contact": "yes", "transition": "no", "next_action": "Hold gooseneck kettle", "evidence_window": {"start_frame": 2140, "end_frame": 2159}}, "label": "Hold gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00107_ctx2089_2208_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00107_ctx2089_2208_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2140-2159\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Hold gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2159, \"start_frame\": 2140}, \"next_action\": \"Hold gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"water bottle\", \"coffee mug\"], \"subtask\": \"Handle gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:140", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2800, "end_frame": 2819, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2749, "context_end_frame": 2868, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00140_ctx2749_2868_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00140_ctx2749_2868_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 140, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Grasp gooseneck kettle", "evidence_window": {"start_frame": 2800, "end_frame": 2819}}, "label": "Grasp gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00140_ctx2749_2868_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00140_ctx2749_2868_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2800-2819\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2819, \"start_frame\": 2800}, \"next_action\": \"Grasp gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:141", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2820, "end_frame": 2839, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2769, "context_end_frame": 2888, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00141_ctx2769_2888_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00141_ctx2769_2888_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 141, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Grasp gooseneck kettle", "evidence_window": {"start_frame": 2820, "end_frame": 2839}}, "label": "Grasp gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00141_ctx2769_2888_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00141_ctx2769_2888_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2820-2839\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2839, \"start_frame\": 2820}, \"next_action\": \"Grasp gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:142", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2840, "end_frame": 2859, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2789, "context_end_frame": 2908, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00142_ctx2789_2908_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00142_ctx2789_2908_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 142, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Grasp gooseneck kettle", "evidence_window": {"start_frame": 2840, "end_frame": 2859}}, "label": "Grasp gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00142_ctx2789_2908_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00142_ctx2789_2908_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2840-2859\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2859, \"start_frame\": 2840}, \"next_action\": \"Grasp gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:143", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2860, "end_frame": 2879, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2809, "context_end_frame": 2928, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00143_ctx2809_2928_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00143_ctx2809_2928_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 143, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Grasp gooseneck kettle", "evidence_window": {"start_frame": 2860, "end_frame": 2879}}, "label": "Grasp gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00143_ctx2809_2928_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00143_ctx2809_2928_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2860-2879\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2879, \"start_frame\": 2860}, \"next_action\": \"Grasp gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:144", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2880, "end_frame": 2899, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2829, "context_end_frame": 2948, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00144_ctx2829_2948_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00144_ctx2829_2948_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 144, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Grasp gooseneck kettle", "evidence_window": {"start_frame": 2880, "end_frame": 2899}}, "label": "Grasp gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00144_ctx2829_2948_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00144_ctx2829_2948_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2880-2899\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2899, \"start_frame\": 2880}, \"next_action\": \"Grasp gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:145", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2900, "end_frame": 2919, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2849, "context_end_frame": 2968, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00145_ctx2849_2968_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00145_ctx2849_2968_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 145, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Grasp gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 2900, "end_frame": 2919}}, "label": "Grasp gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00145_ctx2849_2968_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00145_ctx2849_2968_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2900-2919\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Grasp gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2919, \"start_frame\": 2900}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:146", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2920, "end_frame": 2939, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2869, "context_end_frame": 2988, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00146_ctx2869_2988_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00146_ctx2869_2988_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 146, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 2920, "end_frame": 2939}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00146_ctx2869_2988_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00146_ctx2869_2988_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2920-2939\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2939, \"start_frame\": 2920}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:147", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2940, "end_frame": 2959, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2889, "context_end_frame": 3008, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00147_ctx2889_3008_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00147_ctx2889_3008_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 147, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "water bottle", "white cup"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 2940, "end_frame": 2959}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00147_ctx2889_3008_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00147_ctx2889_3008_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2940-2959\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2959, \"start_frame\": 2940}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"water bottle\", \"white cup\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:148", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2960, "end_frame": 2979, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2909, "context_end_frame": 3028, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00148_ctx2909_3028_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00148_ctx2909_3028_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 148, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 2960, "end_frame": 2979}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00148_ctx2909_3028_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00148_ctx2909_3028_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2960-2979\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2979, \"start_frame\": 2960}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:149", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 2980, "end_frame": 2999, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2929, "context_end_frame": 3048, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00149_ctx2929_3048_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00149_ctx2929_3048_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 149, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 2980, "end_frame": 2999}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00149_ctx2929_3048_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00149_ctx2929_3048_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 2980-2999\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 2999, \"start_frame\": 2980}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:150", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3000, "end_frame": 3019, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2949, "context_end_frame": 3068, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00150_ctx2949_3068_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00150_ctx2949_3068_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 150, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "white mug"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 3000, "end_frame": 3019}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00150_ctx2949_3068_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00150_ctx2949_3068_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3000-3019\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3019, \"start_frame\": 3000}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"white mug\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:151", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3020, "end_frame": 3039, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2969, "context_end_frame": 3088, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00151_ctx2969_3088_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00151_ctx2969_3088_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 151, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "white mug"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 3020, "end_frame": 3039}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00151_ctx2969_3088_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00151_ctx2969_3088_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3020-3039\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3039, \"start_frame\": 3020}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"white mug\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:152", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3040, "end_frame": 3059, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 2989, "context_end_frame": 3108, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00152_ctx2989_3108_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00152_ctx2989_3108_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 152, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "white mug"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 3040, "end_frame": 3059}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00152_ctx2989_3108_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00152_ctx2989_3108_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3040-3059\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3059, \"start_frame\": 3040}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"white mug\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:153", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3060, "end_frame": 3079, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3009, "context_end_frame": 3128, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00153_ctx3009_3128_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00153_ctx3009_3128_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 153, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Lift gooseneck kettle", "subtask": "Lift gooseneck kettle", "objects": ["gooseneck kettle", "coffee dripper", "digital scale", "white mug"], "contact": "yes", "transition": "no", "next_action": "Lift gooseneck kettle", "evidence_window": {"start_frame": 3060, "end_frame": 3079}}, "label": "Lift gooseneck kettle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00153_ctx3009_3128_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00153_ctx3009_3128_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3060-3079\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Lift gooseneck kettle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3079, \"start_frame\": 3060}, \"next_action\": \"Lift gooseneck kettle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"digital scale\", \"white mug\"], \"subtask\": \"Lift gooseneck kettle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:170", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3400, "end_frame": 3419, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3349, "context_end_frame": 3468, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00170_ctx3349_3468_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00170_ctx3349_3468_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 170, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3400, "end_frame": 3419}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00170_ctx3349_3468_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00170_ctx3349_3468_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3400-3419\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3419, \"start_frame\": 3400}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:171", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3420, "end_frame": 3439, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3369, "context_end_frame": 3488, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00171_ctx3369_3488_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00171_ctx3369_3488_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 171, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3420, "end_frame": 3439}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00171_ctx3369_3488_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00171_ctx3369_3488_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3420-3439\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3439, \"start_frame\": 3420}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:172", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3440, "end_frame": 3459, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3389, "context_end_frame": 3508, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00172_ctx3389_3508_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00172_ctx3389_3508_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 172, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3440, "end_frame": 3459}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00172_ctx3389_3508_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00172_ctx3389_3508_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3440-3459\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3459, \"start_frame\": 3440}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:173", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3460, "end_frame": 3479, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3409, "context_end_frame": 3528, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00173_ctx3409_3528_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00173_ctx3409_3528_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 173, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3460, "end_frame": 3479}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00173_ctx3409_3528_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00173_ctx3409_3528_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3460-3479\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3479, \"start_frame\": 3460}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:174", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3480, "end_frame": 3499, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3429, "context_end_frame": 3548, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00174_ctx3429_3548_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00174_ctx3429_3548_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 174, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3480, "end_frame": 3499}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00174_ctx3429_3548_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00174_ctx3429_3548_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3480-3499\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3499, \"start_frame\": 3480}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:175", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3500, "end_frame": 3519, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3449, "context_end_frame": 3568, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00175_ctx3449_3568_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00175_ctx3449_3568_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 175, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3500, "end_frame": 3519}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00175_ctx3449_3568_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00175_ctx3449_3568_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3500-3519\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3519, \"start_frame\": 3500}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:176", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3520, "end_frame": 3539, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3469, "context_end_frame": 3588, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00176_ctx3469_3588_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00176_ctx3469_3588_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 176, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Move kettle away", "evidence_window": {"start_frame": 3520, "end_frame": 3539}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00176_ctx3469_3588_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00176_ctx3469_3588_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3520-3539\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3539, \"start_frame\": 3520}, \"next_action\": \"Move kettle away\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:177", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3540, "end_frame": 3559, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3489, "context_end_frame": 3608, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00177_ctx3489_3608_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00177_ctx3489_3608_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 177, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Move kettle away", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Place kettle on table", "evidence_window": {"start_frame": 3540, "end_frame": 3559}}, "label": "Move kettle away", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00177_ctx3489_3608_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00177_ctx3489_3608_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3540-3559\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Move kettle away\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3559, \"start_frame\": 3540}, \"next_action\": \"Place kettle on table\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:178", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3560, "end_frame": 3579, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3509, "context_end_frame": 3628, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00178_ctx3509_3628_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00178_ctx3509_3628_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 178, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place kettle on table", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Place kettle on table", "evidence_window": {"start_frame": 3560, "end_frame": 3579}}, "label": "Place kettle on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00178_ctx3509_3628_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00178_ctx3509_3628_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3560-3579\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place kettle on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3579, \"start_frame\": 3560}, \"next_action\": \"Place kettle on table\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:179", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3580, "end_frame": 3599, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3529, "context_end_frame": 3648, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00179_ctx3529_3648_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00179_ctx3529_3648_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 179, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place kettle on table", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Place kettle on table", "evidence_window": {"start_frame": 3580, "end_frame": 3599}}, "label": "Place kettle on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00179_ctx3529_3648_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00179_ctx3529_3648_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3580-3599\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place kettle on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3599, \"start_frame\": 3580}, \"next_action\": \"Place kettle on table\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:180", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3600, "end_frame": 3619, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3549, "context_end_frame": 3668, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00180_ctx3549_3668_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00180_ctx3549_3668_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 180, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place kettle on table", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Place kettle on table", "evidence_window": {"start_frame": 3600, "end_frame": 3619}}, "label": "Place kettle on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00180_ctx3549_3668_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00180_ctx3549_3668_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3600-3619\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place kettle on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3619, \"start_frame\": 3600}, \"next_action\": \"Place kettle on table\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:181", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3620, "end_frame": 3639, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3569, "context_end_frame": 3688, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00181_ctx3569_3688_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00181_ctx3569_3688_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 181, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place kettle on table", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Place kettle on table", "evidence_window": {"start_frame": 3620, "end_frame": 3639}}, "label": "Place kettle on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00181_ctx3569_3688_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00181_ctx3569_3688_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3620-3639\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place kettle on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3639, \"start_frame\": 3620}, \"next_action\": \"Place kettle on table\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:182", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3640, "end_frame": 3659, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3589, "context_end_frame": 3708, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00182_ctx3589_3708_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00182_ctx3589_3708_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 182, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place kettle on table", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Place kettle on table", "evidence_window": {"start_frame": 3640, "end_frame": 3659}}, "label": "Place kettle on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00182_ctx3589_3708_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00182_ctx3589_3708_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3640-3659\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place kettle on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3659, \"start_frame\": 3640}, \"next_action\": \"Place kettle on table\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:183", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3660, "end_frame": 3679, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3609, "context_end_frame": 3728, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00183_ctx3609_3728_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00183_ctx3609_3728_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 183, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place kettle on table", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up white bottle", "evidence_window": {"start_frame": 3660, "end_frame": 3679}}, "label": "Place kettle on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00183_ctx3609_3728_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00183_ctx3609_3728_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3660-3679\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place kettle on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3679, \"start_frame\": 3660}, \"next_action\": \"Pick up white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:184", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3680, "end_frame": 3699, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3629, "context_end_frame": 3748, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00184_ctx3629_3748_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00184_ctx3629_3748_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 184, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up white bottle", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up white bottle", "evidence_window": {"start_frame": 3680, "end_frame": 3699}}, "label": "Pick up white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00184_ctx3629_3748_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00184_ctx3629_3748_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3680-3699\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3699, \"start_frame\": 3680}, \"next_action\": \"Pick up white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:185", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3700, "end_frame": 3719, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3649, "context_end_frame": 3768, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00185_ctx3649_3768_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00185_ctx3649_3768_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 185, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up white bottle", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up white bottle", "evidence_window": {"start_frame": 3700, "end_frame": 3719}}, "label": "Pick up white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00185_ctx3649_3768_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00185_ctx3649_3768_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3700-3719\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3719, \"start_frame\": 3700}, \"next_action\": \"Pick up white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:186", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3720, "end_frame": 3739, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3669, "context_end_frame": 3788, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00186_ctx3669_3788_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00186_ctx3669_3788_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 186, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up white bottle", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up white bottle", "evidence_window": {"start_frame": 3720, "end_frame": 3739}}, "label": "Pick up white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00186_ctx3669_3788_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00186_ctx3669_3788_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3720-3739\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3739, \"start_frame\": 3720}, \"next_action\": \"Pick up white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:187", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3740, "end_frame": 3759, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3689, "context_end_frame": 3808, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00187_ctx3689_3808_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00187_ctx3689_3808_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 187, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up white bottle", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up white bottle", "evidence_window": {"start_frame": 3740, "end_frame": 3759}}, "label": "Pick up white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00187_ctx3689_3808_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00187_ctx3689_3808_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3740-3759\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3759, \"start_frame\": 3740}, \"next_action\": \"Pick up white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:188", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3760, "end_frame": 3779, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3709, "context_end_frame": 3828, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00188_ctx3709_3828_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00188_ctx3709_3828_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 188, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up white bottle", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pick up white bottle", "evidence_window": {"start_frame": 3760, "end_frame": 3779}}, "label": "Pick up white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00188_ctx3709_3828_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00188_ctx3709_3828_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3760-3779\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3779, \"start_frame\": 3760}, \"next_action\": \"Pick up white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:189", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3780, "end_frame": 3799, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3729, "context_end_frame": 3848, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00189_ctx3729_3848_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00189_ctx3729_3848_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 189, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pick up white bottle", "subtask": "Set down kettle and retrieve white bottle", "objects": ["gooseneck kettle", "coffee dripper", "scale", "white bottle"], "contact": "yes", "transition": "no", "next_action": "Pour liquid from white bottle", "evidence_window": {"start_frame": 3780, "end_frame": 3799}}, "label": "Pick up white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00189_ctx3729_3848_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00189_ctx3729_3848_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3780-3799\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pick up white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3799, \"start_frame\": 3780}, \"next_action\": \"Pour liquid from white bottle\", \"objects\": [\"gooseneck kettle\", \"coffee dripper\", \"scale\", \"white bottle\"], \"subtask\": \"Set down kettle and retrieve white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:190", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3800, "end_frame": 3819, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3749, "context_end_frame": 3868, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00190_ctx3749_3868_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00190_ctx3749_3868_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 190, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour liquid from white bottle", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour liquid from white bottle", "evidence_window": {"start_frame": 3800, "end_frame": 3819}}, "label": "Pour liquid from white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00190_ctx3749_3868_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00190_ctx3749_3868_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3800-3819\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour liquid from white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3819, \"start_frame\": 3800}, \"next_action\": \"Pour liquid from white bottle\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:191", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3820, "end_frame": 3839, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3769, "context_end_frame": 3888, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00191_ctx3769_3888_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00191_ctx3769_3888_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 191, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour liquid from white bottle", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour liquid from white bottle", "evidence_window": {"start_frame": 3820, "end_frame": 3839}}, "label": "Pour liquid from white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00191_ctx3769_3888_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00191_ctx3769_3888_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3820-3839\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour liquid from white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3839, \"start_frame\": 3820}, \"next_action\": \"Pour liquid from white bottle\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:192", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3840, "end_frame": 3859, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3789, "context_end_frame": 3908, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00192_ctx3789_3908_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00192_ctx3789_3908_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 192, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour liquid from white bottle", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour liquid from white bottle", "evidence_window": {"start_frame": 3840, "end_frame": 3859}}, "label": "Pour liquid from white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00192_ctx3789_3908_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00192_ctx3789_3908_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3840-3859\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour liquid from white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3859, \"start_frame\": 3840}, \"next_action\": \"Pour liquid from white bottle\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:193", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3860, "end_frame": 3879, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3809, "context_end_frame": 3928, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00193_ctx3809_3928_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00193_ctx3809_3928_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 193, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour liquid from white bottle", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour liquid from white bottle", "evidence_window": {"start_frame": 3860, "end_frame": 3879}}, "label": "Pour liquid from white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00193_ctx3809_3928_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00193_ctx3809_3928_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3860-3879\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour liquid from white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3879, \"start_frame\": 3860}, \"next_action\": \"Pour liquid from white bottle\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:194", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3880, "end_frame": 3899, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3829, "context_end_frame": 3948, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00194_ctx3829_3948_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00194_ctx3829_3948_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 194, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour liquid from white bottle", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour liquid from white bottle", "evidence_window": {"start_frame": 3880, "end_frame": 3899}}, "label": "Pour liquid from white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00194_ctx3829_3948_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00194_ctx3829_3948_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3880-3899\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour liquid from white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3899, \"start_frame\": 3880}, \"next_action\": \"Pour liquid from white bottle\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:195", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3900, "end_frame": 3919, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3849, "context_end_frame": 3968, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00195_ctx3849_3968_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00195_ctx3849_3968_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 195, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour liquid from white bottle", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 3900, "end_frame": 3919}}, "label": "Pour liquid from white bottle", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00195_ctx3849_3968_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00195_ctx3849_3968_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3900-3919\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour liquid from white bottle\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3919, \"start_frame\": 3900}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:196", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3920, "end_frame": 3939, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3869, "context_end_frame": 3988, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00196_ctx3869_3988_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00196_ctx3869_3988_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 196, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 3920, "end_frame": 3939}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00196_ctx3869_3988_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00196_ctx3869_3988_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3920-3939\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3939, \"start_frame\": 3920}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:197", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3940, "end_frame": 3959, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3889, "context_end_frame": 4008, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00197_ctx3889_4008_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00197_ctx3889_4008_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 197, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 3940, "end_frame": 3959}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00197_ctx3889_4008_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00197_ctx3889_4008_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3940-3959\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3959, \"start_frame\": 3940}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:198", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3960, "end_frame": 3979, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3909, "context_end_frame": 4028, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00198_ctx3909_4028_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00198_ctx3909_4028_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 198, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 3960, "end_frame": 3979}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00198_ctx3909_4028_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00198_ctx3909_4028_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3960-3979\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3979, \"start_frame\": 3960}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:199", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 3980, "end_frame": 3999, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3929, "context_end_frame": 4048, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00199_ctx3929_4048_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00199_ctx3929_4048_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 199, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 3980, "end_frame": 3999}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00199_ctx3929_4048_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00199_ctx3929_4048_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 3980-3999\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 3999, \"start_frame\": 3980}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:200", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4000, "end_frame": 4019, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3949, "context_end_frame": 4068, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00200_ctx3949_4068_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00200_ctx3949_4068_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 200, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale", "mug"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 4000, "end_frame": 4019}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00200_ctx3949_4068_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00200_ctx3949_4068_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4000-4019\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4019, \"start_frame\": 4000}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\", \"mug\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:201", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4020, "end_frame": 4039, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3969, "context_end_frame": 4088, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00201_ctx3969_4088_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00201_ctx3969_4088_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 201, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale", "mug"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 4020, "end_frame": 4039}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00201_ctx3969_4088_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00201_ctx3969_4088_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4020-4039\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4039, \"start_frame\": 4020}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\", \"mug\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:202", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4040, "end_frame": 4059, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 3989, "context_end_frame": 4108, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00202_ctx3989_4108_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00202_ctx3989_4108_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 202, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale", "mug"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 4040, "end_frame": 4059}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00202_ctx3989_4108_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00202_ctx3989_4108_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4040-4059\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4059, \"start_frame\": 4040}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\", \"mug\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:203", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4060, "end_frame": 4079, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4009, "context_end_frame": 4128, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00203_ctx4009_4128_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00203_ctx4009_4128_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 203, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale", "mug"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 4060, "end_frame": 4079}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00203_ctx4009_4128_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00203_ctx4009_4128_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4060-4079\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4079, \"start_frame\": 4060}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\", \"mug\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:204", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4080, "end_frame": 4099, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4029, "context_end_frame": 4148, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00204_ctx4029_4148_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00204_ctx4029_4148_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 204, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Close bottle cap", "subtask": "Pour and close white bottle", "objects": ["white bottle", "coffee dripper", "scale", "mug"], "contact": "yes", "transition": "no", "next_action": "Close bottle cap", "evidence_window": {"start_frame": 4080, "end_frame": 4099}}, "label": "Close bottle cap", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00204_ctx4029_4148_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00204_ctx4029_4148_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4080-4099\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Close bottle cap\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4099, \"start_frame\": 4080}, \"next_action\": \"Close bottle cap\", \"objects\": [\"white bottle\", \"coffee dripper\", \"scale\", \"mug\"], \"subtask\": \"Pour and close white bottle\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:220", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4400, "end_frame": 4419, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4349, "context_end_frame": 4468, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00220_ctx4349_4468_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00220_ctx4349_4468_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 220, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place item on table", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Place item on table", "evidence_window": {"start_frame": 4400, "end_frame": 4419}}, "label": "Place item on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00220_ctx4349_4468_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00220_ctx4349_4468_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4400-4419\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place item on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4419, \"start_frame\": 4400}, \"next_action\": \"Place item on table\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:221", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4420, "end_frame": 4439, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4369, "context_end_frame": 4488, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00221_ctx4369_4488_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00221_ctx4369_4488_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 221, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place item on table", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Place item on table", "evidence_window": {"start_frame": 4420, "end_frame": 4439}}, "label": "Place item on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00221_ctx4369_4488_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00221_ctx4369_4488_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4420-4439\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place item on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4439, \"start_frame\": 4420}, \"next_action\": \"Place item on table\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:222", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4440, "end_frame": 4459, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4389, "context_end_frame": 4508, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00222_ctx4389_4508_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00222_ctx4389_4508_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 222, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place item on table", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Place item on table", "evidence_window": {"start_frame": 4440, "end_frame": 4459}}, "label": "Place item on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00222_ctx4389_4508_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00222_ctx4389_4508_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4440-4459\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place item on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4459, \"start_frame\": 4440}, \"next_action\": \"Place item on table\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:223", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4460, "end_frame": 4479, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4409, "context_end_frame": 4528, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00223_ctx4409_4528_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00223_ctx4409_4528_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 223, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place item on table", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Place item on table", "evidence_window": {"start_frame": 4460, "end_frame": 4479}}, "label": "Place item on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00223_ctx4409_4528_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00223_ctx4409_4528_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4460-4479\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place item on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4479, \"start_frame\": 4460}, \"next_action\": \"Place item on table\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:224", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4480, "end_frame": 4499, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4429, "context_end_frame": 4548, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00224_ctx4429_4548_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00224_ctx4429_4548_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 224, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place item on table", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Place item on table", "evidence_window": {"start_frame": 4480, "end_frame": 4499}}, "label": "Place item on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00224_ctx4429_4548_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00224_ctx4429_4548_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4480-4499\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place item on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4499, \"start_frame\": 4480}, \"next_action\": \"Place item on table\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:225", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4500, "end_frame": 4519, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4449, "context_end_frame": 4568, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00225_ctx4449_4568_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00225_ctx4449_4568_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 225, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Place item on table", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4500, "end_frame": 4519}}, "label": "Place item on table", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00225_ctx4449_4568_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00225_ctx4449_4568_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4500-4519\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Place item on table\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4519, \"start_frame\": 4500}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:226", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4520, "end_frame": 4539, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4469, "context_end_frame": 4588, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00226_ctx4469_4588_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00226_ctx4469_4588_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 226, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4520, "end_frame": 4539}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00226_ctx4469_4588_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00226_ctx4469_4588_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4520-4539\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4539, \"start_frame\": 4520}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:227", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4540, "end_frame": 4559, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4489, "context_end_frame": 4608, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00227_ctx4489_4608_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00227_ctx4489_4608_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 227, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4540, "end_frame": 4559}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00227_ctx4489_4608_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00227_ctx4489_4608_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4540-4559\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4559, \"start_frame\": 4540}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:228", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4560, "end_frame": 4579, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4509, "context_end_frame": 4628, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00228_ctx4509_4628_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00228_ctx4509_4628_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 228, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4560, "end_frame": 4579}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00228_ctx4509_4628_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00228_ctx4509_4628_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4560-4579\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4579, \"start_frame\": 4560}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:229", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4580, "end_frame": 4599, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4529, "context_end_frame": 4648, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00229_ctx4529_4648_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00229_ctx4529_4648_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 229, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4580, "end_frame": 4599}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00229_ctx4529_4648_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00229_ctx4529_4648_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4580-4599\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4599, \"start_frame\": 4580}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:230", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4600, "end_frame": 4619, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4549, "context_end_frame": 4668, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00230_ctx4549_4668_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00230_ctx4549_4668_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 230, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4600, "end_frame": 4619}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00230_ctx4549_4668_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00230_ctx4549_4668_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4600-4619\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4619, \"start_frame\": 4600}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:231", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4620, "end_frame": 4639, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4569, "context_end_frame": 4688, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00231_ctx4569_4688_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00231_ctx4569_4688_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 231, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4620, "end_frame": 4639}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00231_ctx4569_4688_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00231_ctx4569_4688_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4620-4639\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4639, \"start_frame\": 4620}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:232", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4640, "end_frame": 4659, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4589, "context_end_frame": 4708, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00232_ctx4589_4708_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00232_ctx4589_4708_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 232, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4640, "end_frame": 4659}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00232_ctx4589_4708_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00232_ctx4589_4708_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4640-4659\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4659, \"start_frame\": 4640}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:233", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 4660, "end_frame": 4679, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4609, "context_end_frame": 4728, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00233_ctx4609_4728_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00233_ctx4609_4728_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 233, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Wait/Prepare for pouring", "subtask": "Prepare for pouring", "objects": ["coffee equipment", "small bottle", "white mug", "weighing scale"], "contact": "yes", "transition": "no", "next_action": "Wait/Prepare for pouring", "evidence_window": {"start_frame": 4660, "end_frame": 4679}}, "label": "Wait/Prepare for pouring", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00233_ctx4609_4728_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00233_ctx4609_4728_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 4660-4679\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Wait/Prepare for pouring\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 4679, \"start_frame\": 4660}, \"next_action\": \"Wait/Prepare for pouring\", \"objects\": [\"coffee equipment\", \"small bottle\", \"white mug\", \"weighing scale\"], \"subtask\": \"Prepare for pouring\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:250", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5000, "end_frame": 5019, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4949, "context_end_frame": 5068, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00250_ctx4949_5068_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00250_ctx4949_5068_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 250, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5000, "end_frame": 5019}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00250_ctx4949_5068_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00250_ctx4949_5068_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5000-5019\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5019, \"start_frame\": 5000}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:251", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5020, "end_frame": 5039, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4969, "context_end_frame": 5088, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00251_ctx4969_5088_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00251_ctx4969_5088_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 251, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5020, "end_frame": 5039}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00251_ctx4969_5088_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00251_ctx4969_5088_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5020-5039\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5039, \"start_frame\": 5020}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:252", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5040, "end_frame": 5059, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 4989, "context_end_frame": 5108, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00252_ctx4989_5108_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00252_ctx4989_5108_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 252, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5040, "end_frame": 5059}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00252_ctx4989_5108_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00252_ctx4989_5108_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5040-5059\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5059, \"start_frame\": 5040}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:253", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5060, "end_frame": 5079, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 5009, "context_end_frame": 5128, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00253_ctx5009_5128_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00253_ctx5009_5128_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 253, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5060, "end_frame": 5079}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00253_ctx5009_5128_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00253_ctx5009_5128_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5060-5079\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5079, \"start_frame\": 5060}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:254", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5080, "end_frame": 5099, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 5029, "context_end_frame": 5148, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00254_ctx5029_5148_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00254_ctx5029_5148_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 254, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5080, "end_frame": 5099}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00254_ctx5029_5148_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00254_ctx5029_5148_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5080-5099\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5099, \"start_frame\": 5080}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:255", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5100, "end_frame": 5119, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 5049, "context_end_frame": 5168, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00255_ctx5049_5168_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00255_ctx5049_5168_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 255, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5100, "end_frame": 5119}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00255_ctx5049_5168_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00255_ctx5049_5168_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5100-5119\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5119, \"start_frame\": 5100}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:256", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5120, "end_frame": 5139, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 5069, "context_end_frame": 5188, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00256_ctx5069_5188_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00256_ctx5069_5188_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 256, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5120, "end_frame": 5139}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00256_ctx5069_5188_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00256_ctx5069_5188_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5120-5139\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5139, \"start_frame\": 5120}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:257", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5140, "end_frame": 5159, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 5089, "context_end_frame": 5208, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00257_ctx5089_5208_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00257_ctx5089_5208_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 257, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour coffee", "subtask": "Pour coffee", "objects": ["carafe", "coffee mug", "scale"], "contact": "yes", "transition": "no", "next_action": "Pour coffee", "evidence_window": {"start_frame": 5140, "end_frame": 5159}}, "label": "Pour coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00257_ctx5089_5208_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00257_ctx5089_5208_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5140-5159\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5159, \"start_frame\": 5140}, \"next_action\": \"Pour coffee\", \"objects\": [\"carafe\", \"coffee mug\", \"scale\"], \"subtask\": \"Pour coffee\", \"transition\": \"no\"}"}]} {"id": "xperience-10m-sample:qa:290", "episode_id": "xperience-10m-sample", "split": "train", "target": "episode_qa", "prompt_type": "json_episode_understanding", "center_window": {"start_frame": 5800, "end_frame": 5819, "num_frames": 20}, "media": {"video_paths": [{"name": "fisheye_cam0.mp4", "path": "/path/to/ropedia_workspace/modelscope_data/xperience-10m-sample/fisheye_cam0.mp4"}], "context_start_frame": 5701, "context_end_frame": 5820, "max_video_frames": 16, "mosaic_video_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00290_ctx5701_5820_mosaic.mp4", "audio_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00290_ctx5701_5820_audio.wav"}, "sensor_feature_path": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/sensor_features/xperience-10m-sample_sensor_features.npz", "sensor_feature_index": 290, "sensor_feature_dim": 4262, "question": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.", "answer_json": {"action": "Pour milk into coffee", "subtask": "Pour milk into coffee", "objects": ["coffee mug", "stainless steel milk pitcher", "table", "milk bottle", "digital scale"], "contact": "yes", "transition": "no", "next_action": "Pour milk into coffee", "evidence_window": {"start_frame": 5800, "end_frame": 5819}}, "label": "Pour milk into coffee", "action_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "subtask_options": ["Handle gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Pick up and position kettle", "Pour and close white bottle", "Pour coffee", "Pour milk into coffee", "Prepare coffee equipment and scoop grounds", "Prepare for pouring", "Set down kettle and retrieve white bottle", "Transfer coffee grounds to dripper"], "label_options": ["Close bottle cap", "Grasp coffee scoop", "Grasp gooseneck kettle", "Hold coffee carafe", "Hold gooseneck kettle", "Lift gooseneck kettle", "Move kettle", "Move kettle away", "Pick up kettle", "Pick up white bottle", "Place item on table", "Place kettle on table", "Position kettle to pour", "Pour coffee", "Pour liquid from white bottle", "Pour milk into coffee", "Transfer coffee to dripper", "Wait/Prepare for pouring"], "messages": [{"role": "system", "content": [{"type": "text", "text": "You are an embodied episode-understanding model for Ropedia/Xperience-10M. Answer every question as strict JSON with these keys: action, subtask, objects, contact, transition, next_action, evidence_window. Use \"unknown\" when the evidence is missing instead of guessing."}]}, {"role": "user", "content": [{"type": "video", "video": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00290_ctx5701_5820_mosaic.mp4"}, {"type": "audio", "audio": "/path/to/ropedia_workspace/ropedia-episode-task-suite/results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/media/xperience-10m-sample/xperience-10m-sample_w00290_ctx5701_5820_audio.wav"}, {"type": "text", "text": "Given the synchronized egocentric video/audio context and sensor window, identify the current embodied episode state.\nEpisode: xperience-10m-sample\nLabel window frames: 5800-5819\nReturn strict JSON only with keys: action, subtask, objects, contact, transition, next_action, evidence_window.\nUse \"unknown\" for fields that cannot be determined.\nKnown action labels:\n- Close bottle cap\n- Grasp coffee scoop\n- Grasp gooseneck kettle\n- Hold coffee carafe\n- Hold gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Move kettle away\n- Pick up kettle\n- Pick up white bottle\n- Place item on table\n- Place kettle on table\n- Position kettle to pour\n- Pour coffee\n- Pour liquid from white bottle\n- Pour milk into coffee\n- Transfer coffee to dripper\n- Wait/Prepare for pouring\nKnown subtask labels:\n- Handle gooseneck kettle\n- Lift gooseneck kettle\n- Move kettle\n- Pick up and position kettle\n- Pour and close white bottle\n- Pour coffee\n- Pour milk into coffee\n- Prepare coffee equipment and scoop grounds\n- Prepare for pouring\n- Set down kettle and retrieve white bottle\n- Transfer coffee grounds to dripper"}]}, {"role": "assistant", "content": "{\"action\": \"Pour milk into coffee\", \"contact\": \"yes\", \"evidence_window\": {\"end_frame\": 5819, \"start_frame\": 5800}, \"next_action\": \"Pour milk into coffee\", \"objects\": [\"coffee mug\", \"stainless steel milk pitcher\", \"table\", \"milk bottle\", \"digital scale\"], \"subtask\": \"Pour milk into coffee\", \"transition\": \"no\"}"}]}