ropedia-xperience-10m-task-baselines / OMNI_MODEL_EXTENSION_CONTRACT.md

Expose omni backbone extension contract

2bd560e verified 21 days ago

8.9 kB

	# Omni Model Extension Contract

	This project uses one shared Xperience-10M data spine and separate backbone
	adapters. Qwen3-Omni is the first implemented fine-tuning path; future
	Cosmos-style world models and VLA/policy models should plug into the same
	manifest, split, artifact, and evaluation discipline.

	## Shared Pipeline

	Every trainable branch should keep these stages:

	1. Episode selection: choose complete Xperience-10M episodes before export.
	2. Episode split: split by episode/session, not by adjacent windows.
	3. Manifest guard: record every episode id, path, split, size, and missing
	modality before training.
	4. Backbone export: convert raw windows into the model-specific sample
	format.
	5. Training: save model config, adapter config, progress JSONL, and
	checkpoint path.
	6. Held-out evaluation: evaluate on test episodes only after training.
	7. Run report: write metrics, predictions, confusion matrices or
	task-specific scoring files, and skipped-episode reasons.
	8. Long-run observability: stream `progress.jsonl` and
	`predictions.partial.jsonl` during evaluation so multi-hour held-out runs can
	be monitored and resumed without changing the final metric definitions.

	The current 128-episode pilot uses a fixed `96/16/16` train/val/test split by
	episode.

	## Backbone Registry

	Backbone contracts live in:

	```text
	configs/omni_backbones/
	```

	Inspect them with:

	```bash
	python scripts/omni/backbone_registry.py --validate --json
	```

	Create a new planned backbone config from an existing contract template with:

	```bash
	python scripts/omni/scaffold_omni_backbone.py \
	--template-backbone policy_vla_branch \
	--id new_policy_branch \
	--display-name "New Policy Branch" \
	--model-family "Model family name" \
	--dataset-contract xperience10m_observation_action_v1 \
	--training-objective observation_to_action_policy \
	--checkpoint-gate policy_checkpoint_action_space_and_normalizer \
	--dry-run
	```

	Current contracts:

	\| Backbone \| Status \| Purpose \|
	\| --- \| --- \| --- \|
	\| `qwen3_omni_lora` \| implemented \| Structured episode-understanding JSON QA over video/audio/text plus sensor bridge features \|
	\| `cosmos_world_model` \| planned adapter \| Future-window and action-conditioned world modeling \|
	\| `policy_vla_branch` \| planned adapter \| Observation-to-action or motion-policy training after action-space conversion \|

	## Model-Neutral Window Index

	The Qwen exporter produces model-ready JSONL records. To avoid tying future
	branches to Qwen chat-message formatting, convert those records into a
	backbone-neutral window index:

	```bash
	python scripts/omni/export_model_neutral_window_index.py \
	--dataset-jsonl results/omni_finetune/<run_id>_dataset/dataset.jsonl
	```

	This writes:

	- `window_index.jsonl`
	- `window_index_manifest.json`

	Each neutral record keeps the same episode split and window boundaries, then
	separates:

	- media paths,
	- sensor feature pointers,
	- language context,
	- JSON supervision,
	- Qwen, Cosmos-style, and policy/VLA adapter views.

	Future exporters should consume this neutral index when possible, then add only
	the model-specific target conversion that they need.

	## Artifact Contract

	Every backbone config must declare an `artifact_contract` with:

	- `checkpoint_gate`: the model-specific checkpoint validation rule,
	- `required_training_files`: files that prove training state and configuration,
	- `required_eval_files`: files that prove held-out evaluation outputs,
	- `public_package_allowed`: small derived artifacts that may be published,
	- `public_package_forbidden`: raw data, weights, checkpoints, or large files
	that must stay out of public packages.

	`scripts/omni/backbone_registry.py --validate --json` checks that the contract
	exists for Qwen, Cosmos-style, and policy/VLA branches. The validator and
	public-safe packager read `required_eval_files`, `primary_metrics`, and
	publication rules from the selected backbone config. Export, training, and
	evaluation code still remain model-specific, but the final validation and
	publication gate follows the same contract for every future branch.

	The registry validation also enforces the minimum held-out evidence surface:
	episode-level `train`/`val`/`test` split defaults, a leakage guard,
	`held_out_episode_count`, `metrics.json`, a JSONL prediction file,
	`RUN_REPORT.md`, training metadata, progress logs, and explicit forbidden
	artifact categories for raw data, model weights, checkpoints, and archives.

	## Qwen3-Omni Contract

	Qwen3-Omni consumes:

	- rendered multi-camera mosaic video,
	- extracted MP4 audio,
	- language prompt and label options,
	- optional sensor-bridge summaries/features.

	It predicts strict JSON:

	```json
	{
	"action": "string",
	"subtask": "string",
	"objects": ["string"],
	"contact": "string",
	"transition": "string",
	"next_action": "string",
	"evidence_window": {"start_frame": 0, "end_frame": 0}
	}
	```

	Implemented entrypoints:

	- `scripts/omni/parallel_export_qwen3_omni_action_dataset.py`
	- `scripts/omni/train_qwen3_omni_lora.py`
	- `scripts/omni/eval_qwen3_omni_lora.py`
	- `scripts/omni/watch_omni_train_then_eval.py`
	- `scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh`

	The watcher is the current post-training gate runner. For the Qwen3-Omni LoRA
	branch it waits for `progress.jsonl` to end in `complete`, checks the PEFT LoRA
	safetensors shapes, runs the training validator, runs a held-out eval smoke,
	then runs the full held-out test evaluation.

	The Qwen evaluator writes partial predictions during inference and finalizes the
	same `predictions.jsonl`, `predictions.csv`, `metrics.json`,
	`confusion_matrix.csv`, and `RUN_REPORT.md` files after all selected held-out
	windows finish. A restarted eval can resume from the partial prediction file.
	For faster held-out evaluation, the Qwen evaluator can also run deterministic
	sample shards via `--sample-offset` and `--sample-stride`. Sharded outputs must
	be merged with `scripts/omni/merge_qwen3_omni_eval_shards.py`, which recomputes
	the final metrics from combined predictions and checks missing or duplicate
	sample ids.

	Future model families can reuse the same wait/eval sequence only if their
	checkpoint artifact has a compatible gate. Otherwise they should provide a
	model-specific checkpoint check and evaluator, while keeping the same episode
	split and held-out reporting discipline.

	## Cosmos-Style World Model Contract

	Cosmos-style work should not reuse the JSON QA exporter as-is. It needs a
	future-window exporter with samples shaped like:

	```json
	{
	"episode_id": "session__ep",
	"split": "train",
	"context_window": {"start_frame": 0, "end_frame": 119},
	"target_window": {"start_frame": 120, "end_frame": 179},
	"conditioning": {
	"video": "path-or-latent",
	"audio": "path-or-features",
	"pose": "feature path",
	"depth": "feature path",
	"mocap": "feature path",
	"imu": "feature path",
	"language": "task context"
	},
	"target": {
	"future_video": "path-or-latent",
	"future_sensor_features": "path",
	"transition": "label"
	}
	}
	```

	Minimum evaluators:

	- future retrieval MRR / recall@5,
	- temporal consistency,
	- feature reconstruction error,
	- transition/contact prediction,
	- qualitative generated or retrieved examples.

	Cosmos-style checkpoints are not LoRA adapters by default. Their post-training
	gate should verify generated latent/video checkpoints, model config, scheduler
	state, and future-window evaluator outputs instead of using the Qwen LoRA
	safetensors check.

	## VLA / Policy Contract

	Policy branches need an explicit action target before training. A valid sample
	must state whether the target is an action class, next action, hand trajectory,
	contact event, retargeted humanoid action, or robot-compatible action token.

	The first policy exporter should save:

	- observation media/features,
	- language instruction or task context,
	- action target,
	- action normalization metadata fit on train episodes only,
	- target provenance from the original annotation/mocap/contact fields.

	Minimum evaluators:

	- action or next-action accuracy,
	- contact accuracy,
	- trajectory MPJPE when trajectories are used,
	- object-affordance F1,
	- held-out episode count and leakage check.

	Policy checkpoints should additionally save the action-space definition,
	normalization statistics, and retargeting/conversion metadata. These must be
	fit from train episodes only and validated before any held-out policy metrics
	are reported.

	## Non-Negotiable Invariants

	- Do not train on held-out test episodes.
	- Do not report model quality without predictions and metrics from held-out
	episodes.
	- Do not redistribute raw gated MP4, HDF5, RRD, full checkpoint, or full model
	weight files.
	- Do not treat a smoke run or one-episode overfit run as a real held-out model
	result.
	- Record skipped episodes with reasons instead of silently dropping them.