File size: 8,900 Bytes

2bd560e

# Omni Model Extension Contract

This project uses one shared Xperience-10M data spine and separate backbone
adapters. Qwen3-Omni is the first implemented fine-tuning path; future
Cosmos-style world models and VLA/policy models should plug into the same
manifest, split, artifact, and evaluation discipline.

## Shared Pipeline

Every trainable branch should keep these stages:

1. **Episode selection:** choose complete Xperience-10M episodes before export.
2. **Episode split:** split by episode/session, not by adjacent windows.
3. **Manifest guard:** record every episode id, path, split, size, and missing
   modality before training.
4. **Backbone export:** convert raw windows into the model-specific sample
   format.
5. **Training:** save model config, adapter config, progress JSONL, and
   checkpoint path.
6. **Held-out evaluation:** evaluate on test episodes only after training.
7. **Run report:** write metrics, predictions, confusion matrices or
   task-specific scoring files, and skipped-episode reasons.
8. **Long-run observability:** stream `progress.jsonl` and
   `predictions.partial.jsonl` during evaluation so multi-hour held-out runs can
   be monitored and resumed without changing the final metric definitions.

The current 128-episode pilot uses a fixed `96/16/16` train/val/test split by
episode.

## Backbone Registry

Backbone contracts live in:

```text
configs/omni_backbones/
```

Inspect them with:

```bash
python scripts/omni/backbone_registry.py --validate --json
```

Create a new planned backbone config from an existing contract template with:

```bash
python scripts/omni/scaffold_omni_backbone.py \
  --template-backbone policy_vla_branch \
  --id new_policy_branch \
  --display-name "New Policy Branch" \
  --model-family "Model family name" \
  --dataset-contract xperience10m_observation_action_v1 \
  --training-objective observation_to_action_policy \
  --checkpoint-gate policy_checkpoint_action_space_and_normalizer \
  --dry-run
```

Current contracts:

| Backbone | Status | Purpose |
| --- | --- | --- |
| `qwen3_omni_lora` | implemented | Structured episode-understanding JSON QA over video/audio/text plus sensor bridge features |
| `cosmos_world_model` | planned adapter | Future-window and action-conditioned world modeling |
| `policy_vla_branch` | planned adapter | Observation-to-action or motion-policy training after action-space conversion |

## Model-Neutral Window Index

The Qwen exporter produces model-ready JSONL records. To avoid tying future
branches to Qwen chat-message formatting, convert those records into a
backbone-neutral window index:

```bash
python scripts/omni/export_model_neutral_window_index.py \
  --dataset-jsonl results/omni_finetune/<run_id>_dataset/dataset.jsonl
```

This writes:

- `window_index.jsonl`
- `window_index_manifest.json`

Each neutral record keeps the same episode split and window boundaries, then
separates:

- media paths,
- sensor feature pointers,
- language context,
- JSON supervision,
- Qwen, Cosmos-style, and policy/VLA adapter views.

Future exporters should consume this neutral index when possible, then add only
the model-specific target conversion that they need.

## Artifact Contract

Every backbone config must declare an `artifact_contract` with:

- `checkpoint_gate`: the model-specific checkpoint validation rule,
- `required_training_files`: files that prove training state and configuration,
- `required_eval_files`: files that prove held-out evaluation outputs,
- `public_package_allowed`: small derived artifacts that may be published,
- `public_package_forbidden`: raw data, weights, checkpoints, or large files
  that must stay out of public packages.

`scripts/omni/backbone_registry.py --validate --json` checks that the contract
exists for Qwen, Cosmos-style, and policy/VLA branches. The validator and
public-safe packager read `required_eval_files`, `primary_metrics`, and
publication rules from the selected backbone config. Export, training, and
evaluation code still remain model-specific, but the final validation and
publication gate follows the same contract for every future branch.

The registry validation also enforces the minimum held-out evidence surface:
episode-level `train`/`val`/`test` split defaults, a leakage guard,
`held_out_episode_count`, `metrics.json`, a JSONL prediction file,
`RUN_REPORT.md`, training metadata, progress logs, and explicit forbidden
artifact categories for raw data, model weights, checkpoints, and archives.

## Qwen3-Omni Contract

Qwen3-Omni consumes:

- rendered multi-camera mosaic video,
- extracted MP4 audio,
- language prompt and label options,
- optional sensor-bridge summaries/features.

It predicts strict JSON:

```json
{
  "action": "string",
  "subtask": "string",
  "objects": ["string"],
  "contact": "string",
  "transition": "string",
  "next_action": "string",
  "evidence_window": {"start_frame": 0, "end_frame": 0}
}
```

Implemented entrypoints:

- `scripts/omni/parallel_export_qwen3_omni_action_dataset.py`
- `scripts/omni/train_qwen3_omni_lora.py`
- `scripts/omni/eval_qwen3_omni_lora.py`
- `scripts/omni/watch_omni_train_then_eval.py`
- `scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh`

The watcher is the current post-training gate runner. For the Qwen3-Omni LoRA
branch it waits for `progress.jsonl` to end in `complete`, checks the PEFT LoRA
safetensors shapes, runs the training validator, runs a held-out eval smoke,
then runs the full held-out test evaluation.

The Qwen evaluator writes partial predictions during inference and finalizes the
same `predictions.jsonl`, `predictions.csv`, `metrics.json`,
`confusion_matrix.csv`, and `RUN_REPORT.md` files after all selected held-out
windows finish. A restarted eval can resume from the partial prediction file.
For faster held-out evaluation, the Qwen evaluator can also run deterministic
sample shards via `--sample-offset` and `--sample-stride`. Sharded outputs must
be merged with `scripts/omni/merge_qwen3_omni_eval_shards.py`, which recomputes
the final metrics from combined predictions and checks missing or duplicate
sample ids.

Future model families can reuse the same wait/eval sequence only if their
checkpoint artifact has a compatible gate. Otherwise they should provide a
model-specific checkpoint check and evaluator, while keeping the same episode
split and held-out reporting discipline.

## Cosmos-Style World Model Contract

Cosmos-style work should not reuse the JSON QA exporter as-is. It needs a
future-window exporter with samples shaped like:

```json
{
  "episode_id": "session__ep",
  "split": "train",
  "context_window": {"start_frame": 0, "end_frame": 119},
  "target_window": {"start_frame": 120, "end_frame": 179},
  "conditioning": {
    "video": "path-or-latent",
    "audio": "path-or-features",
    "pose": "feature path",
    "depth": "feature path",
    "mocap": "feature path",
    "imu": "feature path",
    "language": "task context"
  },
  "target": {
    "future_video": "path-or-latent",
    "future_sensor_features": "path",
    "transition": "label"
  }
}
```

Minimum evaluators:

- future retrieval MRR / recall@5,
- temporal consistency,
- feature reconstruction error,
- transition/contact prediction,
- qualitative generated or retrieved examples.

Cosmos-style checkpoints are not LoRA adapters by default. Their post-training
gate should verify generated latent/video checkpoints, model config, scheduler
state, and future-window evaluator outputs instead of using the Qwen LoRA
safetensors check.

## VLA / Policy Contract

Policy branches need an explicit action target before training. A valid sample
must state whether the target is an action class, next action, hand trajectory,
contact event, retargeted humanoid action, or robot-compatible action token.

The first policy exporter should save:

- observation media/features,
- language instruction or task context,
- action target,
- action normalization metadata fit on train episodes only,
- target provenance from the original annotation/mocap/contact fields.

Minimum evaluators:

- action or next-action accuracy,
- contact accuracy,
- trajectory MPJPE when trajectories are used,
- object-affordance F1,
- held-out episode count and leakage check.

Policy checkpoints should additionally save the action-space definition,
normalization statistics, and retargeting/conversion metadata. These must be
fit from train episodes only and validated before any held-out policy metrics
are reported.

## Non-Negotiable Invariants

- Do not train on held-out test episodes.
- Do not report model quality without predictions and metrics from held-out
  episodes.
- Do not redistribute raw gated MP4, HDF5, RRD, full checkpoint, or full model
  weight files.
- Do not treat a smoke run or one-episode overfit run as a real held-out model
  result.
- Record skipped episodes with reasons instead of silently dropping them.