Qwen3-Omni-Instruct Fine-Tuning Notes

This directory separates the Xperience-10M -> Qwen3-Omni plan into two layers.

Native Qwen3-Omni inputs: synchronized 2x3 multi-view video mosaic, aligned audio clips, and text prompts.
Ropedia-specific adapter inputs: depth, pose/SLAM, mocap, contacts, IMU, and calibration features.

The v1 objective is embodied episode understanding and question answering, not a deployable robot-control policy. The default backbone is Qwen/Qwen3-Omni-30B-A3B-Instruct; Thinking is reserved for later inference comparison. Assistant outputs are strict JSON with these fields:

{
  "action": "unknown",
  "subtask": "unknown",
  "objects": [],
  "contact": "unknown",
  "transition": "unknown",
  "next_action": "unknown",
  "evidence_window": {"start_frame": 0, "end_frame": 0}
}

Suggested progression:

Phase 0: preflight accelerator runtime, CUDA/PyTorch, free disk, dataset access, ffmpeg, HOMIE, and local Qwen3-Omni-Instruct weights.
Phase 1: one-episode smoke with adapter-only plus JSONL/media validation.
Phase 2: three-episode overfit for adapter-only and Qwen LoRA.
Phase 3: 32-episode pilot with held-out episodes and all comparisons.
Phase 4: scale to 64 only if the 32-episode run is stable and the sensor bridge beats video/audio/text-only LoRA on at least three primary metrics.

Concrete command sequence:

python scripts/omni/build_episode_manifest.py \
  --data-root /path/to/xperience10m_data \
  --max-episodes 32 \
  --output results/omni_finetune/episode_manifest.json

python scripts/omni/export_qwen3_omni_action_dataset.py \
  --manifest results/omni_finetune/episode_manifest.json \
  --run-id xperience10m_qwen3_omni_32ep_dataset

python scripts/omni/qwen3_omni_inference_smoke.py \
  --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/dataset.jsonl \
  --split test \
  --sample-limit 3 \
  --run-id xperience10m_qwen3_omni_32ep_zero_shot

python scripts/omni/train_qwen3_omni_lora.py \
  --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/dataset.jsonl \
  --run-id xperience10m_qwen3_omni_32ep_lora

python scripts/omni/eval_qwen3_omni_lora.py \
  --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_32ep_dataset/dataset.jsonl \
  --adapter-dir checkpoints/xperience10m_qwen3_omni_32ep_lora/adapter_lora \
  --run-id xperience10m_qwen3_omni_32ep_eval

python scripts/omni/qwen3_omni_sensor_bridge.py \
  --sensor-adapter-model results/omni_finetune/adapter_only/adapter_only/sensor_adapter_model.pt \
  --qwen-config ../modelscope_models/Qwen__Qwen3-Omni-30B-A3B-Instruct/config.json

python scripts/omni/omni_finetune_runbook.py \
  --run-id xperience10m_qwen3_omni_32ep \
  --metric-file results/omni_finetune/xperience10m_qwen3_omni_32ep_eval/metrics.json

The bridge step is intentionally after native Qwen video/audio/text LoRA has overfit a tiny shard and evaluated on held-out episodes. The full 32-episode pilot should compare the existing 12-task baseline, adapter-only baseline, frozen Qwen zero-shot, Qwen LoRA without sensor bridge, and Qwen LoRA with sensor bridge before any scale-up decision.