File size: 8,900 Bytes
2bd560e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | # Omni Model Extension Contract
This project uses one shared Xperience-10M data spine and separate backbone
adapters. Qwen3-Omni is the first implemented fine-tuning path; future
Cosmos-style world models and VLA/policy models should plug into the same
manifest, split, artifact, and evaluation discipline.
## Shared Pipeline
Every trainable branch should keep these stages:
1. **Episode selection:** choose complete Xperience-10M episodes before export.
2. **Episode split:** split by episode/session, not by adjacent windows.
3. **Manifest guard:** record every episode id, path, split, size, and missing
modality before training.
4. **Backbone export:** convert raw windows into the model-specific sample
format.
5. **Training:** save model config, adapter config, progress JSONL, and
checkpoint path.
6. **Held-out evaluation:** evaluate on test episodes only after training.
7. **Run report:** write metrics, predictions, confusion matrices or
task-specific scoring files, and skipped-episode reasons.
8. **Long-run observability:** stream `progress.jsonl` and
`predictions.partial.jsonl` during evaluation so multi-hour held-out runs can
be monitored and resumed without changing the final metric definitions.
The current 128-episode pilot uses a fixed `96/16/16` train/val/test split by
episode.
## Backbone Registry
Backbone contracts live in:
```text
configs/omni_backbones/
```
Inspect them with:
```bash
python scripts/omni/backbone_registry.py --validate --json
```
Create a new planned backbone config from an existing contract template with:
```bash
python scripts/omni/scaffold_omni_backbone.py \
--template-backbone policy_vla_branch \
--id new_policy_branch \
--display-name "New Policy Branch" \
--model-family "Model family name" \
--dataset-contract xperience10m_observation_action_v1 \
--training-objective observation_to_action_policy \
--checkpoint-gate policy_checkpoint_action_space_and_normalizer \
--dry-run
```
Current contracts:
| Backbone | Status | Purpose |
| --- | --- | --- |
| `qwen3_omni_lora` | implemented | Structured episode-understanding JSON QA over video/audio/text plus sensor bridge features |
| `cosmos_world_model` | planned adapter | Future-window and action-conditioned world modeling |
| `policy_vla_branch` | planned adapter | Observation-to-action or motion-policy training after action-space conversion |
## Model-Neutral Window Index
The Qwen exporter produces model-ready JSONL records. To avoid tying future
branches to Qwen chat-message formatting, convert those records into a
backbone-neutral window index:
```bash
python scripts/omni/export_model_neutral_window_index.py \
--dataset-jsonl results/omni_finetune/<run_id>_dataset/dataset.jsonl
```
This writes:
- `window_index.jsonl`
- `window_index_manifest.json`
Each neutral record keeps the same episode split and window boundaries, then
separates:
- media paths,
- sensor feature pointers,
- language context,
- JSON supervision,
- Qwen, Cosmos-style, and policy/VLA adapter views.
Future exporters should consume this neutral index when possible, then add only
the model-specific target conversion that they need.
## Artifact Contract
Every backbone config must declare an `artifact_contract` with:
- `checkpoint_gate`: the model-specific checkpoint validation rule,
- `required_training_files`: files that prove training state and configuration,
- `required_eval_files`: files that prove held-out evaluation outputs,
- `public_package_allowed`: small derived artifacts that may be published,
- `public_package_forbidden`: raw data, weights, checkpoints, or large files
that must stay out of public packages.
`scripts/omni/backbone_registry.py --validate --json` checks that the contract
exists for Qwen, Cosmos-style, and policy/VLA branches. The validator and
public-safe packager read `required_eval_files`, `primary_metrics`, and
publication rules from the selected backbone config. Export, training, and
evaluation code still remain model-specific, but the final validation and
publication gate follows the same contract for every future branch.
The registry validation also enforces the minimum held-out evidence surface:
episode-level `train`/`val`/`test` split defaults, a leakage guard,
`held_out_episode_count`, `metrics.json`, a JSONL prediction file,
`RUN_REPORT.md`, training metadata, progress logs, and explicit forbidden
artifact categories for raw data, model weights, checkpoints, and archives.
## Qwen3-Omni Contract
Qwen3-Omni consumes:
- rendered multi-camera mosaic video,
- extracted MP4 audio,
- language prompt and label options,
- optional sensor-bridge summaries/features.
It predicts strict JSON:
```json
{
"action": "string",
"subtask": "string",
"objects": ["string"],
"contact": "string",
"transition": "string",
"next_action": "string",
"evidence_window": {"start_frame": 0, "end_frame": 0}
}
```
Implemented entrypoints:
- `scripts/omni/parallel_export_qwen3_omni_action_dataset.py`
- `scripts/omni/train_qwen3_omni_lora.py`
- `scripts/omni/eval_qwen3_omni_lora.py`
- `scripts/omni/watch_omni_train_then_eval.py`
- `scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh`
The watcher is the current post-training gate runner. For the Qwen3-Omni LoRA
branch it waits for `progress.jsonl` to end in `complete`, checks the PEFT LoRA
safetensors shapes, runs the training validator, runs a held-out eval smoke,
then runs the full held-out test evaluation.
The Qwen evaluator writes partial predictions during inference and finalizes the
same `predictions.jsonl`, `predictions.csv`, `metrics.json`,
`confusion_matrix.csv`, and `RUN_REPORT.md` files after all selected held-out
windows finish. A restarted eval can resume from the partial prediction file.
For faster held-out evaluation, the Qwen evaluator can also run deterministic
sample shards via `--sample-offset` and `--sample-stride`. Sharded outputs must
be merged with `scripts/omni/merge_qwen3_omni_eval_shards.py`, which recomputes
the final metrics from combined predictions and checks missing or duplicate
sample ids.
Future model families can reuse the same wait/eval sequence only if their
checkpoint artifact has a compatible gate. Otherwise they should provide a
model-specific checkpoint check and evaluator, while keeping the same episode
split and held-out reporting discipline.
## Cosmos-Style World Model Contract
Cosmos-style work should not reuse the JSON QA exporter as-is. It needs a
future-window exporter with samples shaped like:
```json
{
"episode_id": "session__ep",
"split": "train",
"context_window": {"start_frame": 0, "end_frame": 119},
"target_window": {"start_frame": 120, "end_frame": 179},
"conditioning": {
"video": "path-or-latent",
"audio": "path-or-features",
"pose": "feature path",
"depth": "feature path",
"mocap": "feature path",
"imu": "feature path",
"language": "task context"
},
"target": {
"future_video": "path-or-latent",
"future_sensor_features": "path",
"transition": "label"
}
}
```
Minimum evaluators:
- future retrieval MRR / recall@5,
- temporal consistency,
- feature reconstruction error,
- transition/contact prediction,
- qualitative generated or retrieved examples.
Cosmos-style checkpoints are not LoRA adapters by default. Their post-training
gate should verify generated latent/video checkpoints, model config, scheduler
state, and future-window evaluator outputs instead of using the Qwen LoRA
safetensors check.
## VLA / Policy Contract
Policy branches need an explicit action target before training. A valid sample
must state whether the target is an action class, next action, hand trajectory,
contact event, retargeted humanoid action, or robot-compatible action token.
The first policy exporter should save:
- observation media/features,
- language instruction or task context,
- action target,
- action normalization metadata fit on train episodes only,
- target provenance from the original annotation/mocap/contact fields.
Minimum evaluators:
- action or next-action accuracy,
- contact accuracy,
- trajectory MPJPE when trajectories are used,
- object-affordance F1,
- held-out episode count and leakage check.
Policy checkpoints should additionally save the action-space definition,
normalization statistics, and retargeting/conversion metadata. These must be
fit from train episodes only and validated before any held-out policy metrics
are reported.
## Non-Negotiable Invariants
- Do not train on held-out test episodes.
- Do not report model quality without predictions and metrics from held-out
episodes.
- Do not redistribute raw gated MP4, HDF5, RRD, full checkpoint, or full model
weight files.
- Do not treat a smoke run or one-episode overfit run as a real held-out model
result.
- Record skipped episodes with reasons instead of silently dropping them.
|