File size: 8,900 Bytes
2bd560e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Omni Model Extension Contract

This project uses one shared Xperience-10M data spine and separate backbone
adapters. Qwen3-Omni is the first implemented fine-tuning path; future
Cosmos-style world models and VLA/policy models should plug into the same
manifest, split, artifact, and evaluation discipline.

## Shared Pipeline

Every trainable branch should keep these stages:

1. **Episode selection:** choose complete Xperience-10M episodes before export.
2. **Episode split:** split by episode/session, not by adjacent windows.
3. **Manifest guard:** record every episode id, path, split, size, and missing
   modality before training.
4. **Backbone export:** convert raw windows into the model-specific sample
   format.
5. **Training:** save model config, adapter config, progress JSONL, and
   checkpoint path.
6. **Held-out evaluation:** evaluate on test episodes only after training.
7. **Run report:** write metrics, predictions, confusion matrices or
   task-specific scoring files, and skipped-episode reasons.
8. **Long-run observability:** stream `progress.jsonl` and
   `predictions.partial.jsonl` during evaluation so multi-hour held-out runs can
   be monitored and resumed without changing the final metric definitions.

The current 128-episode pilot uses a fixed `96/16/16` train/val/test split by
episode.

## Backbone Registry

Backbone contracts live in:

```text
configs/omni_backbones/
```

Inspect them with:

```bash
python scripts/omni/backbone_registry.py --validate --json
```

Create a new planned backbone config from an existing contract template with:

```bash
python scripts/omni/scaffold_omni_backbone.py \
  --template-backbone policy_vla_branch \
  --id new_policy_branch \
  --display-name "New Policy Branch" \
  --model-family "Model family name" \
  --dataset-contract xperience10m_observation_action_v1 \
  --training-objective observation_to_action_policy \
  --checkpoint-gate policy_checkpoint_action_space_and_normalizer \
  --dry-run
```

Current contracts:

| Backbone | Status | Purpose |
| --- | --- | --- |
| `qwen3_omni_lora` | implemented | Structured episode-understanding JSON QA over video/audio/text plus sensor bridge features |
| `cosmos_world_model` | planned adapter | Future-window and action-conditioned world modeling |
| `policy_vla_branch` | planned adapter | Observation-to-action or motion-policy training after action-space conversion |

## Model-Neutral Window Index

The Qwen exporter produces model-ready JSONL records. To avoid tying future
branches to Qwen chat-message formatting, convert those records into a
backbone-neutral window index:

```bash
python scripts/omni/export_model_neutral_window_index.py \
  --dataset-jsonl results/omni_finetune/<run_id>_dataset/dataset.jsonl
```

This writes:

- `window_index.jsonl`
- `window_index_manifest.json`

Each neutral record keeps the same episode split and window boundaries, then
separates:

- media paths,
- sensor feature pointers,
- language context,
- JSON supervision,
- Qwen, Cosmos-style, and policy/VLA adapter views.

Future exporters should consume this neutral index when possible, then add only
the model-specific target conversion that they need.

## Artifact Contract

Every backbone config must declare an `artifact_contract` with:

- `checkpoint_gate`: the model-specific checkpoint validation rule,
- `required_training_files`: files that prove training state and configuration,
- `required_eval_files`: files that prove held-out evaluation outputs,
- `public_package_allowed`: small derived artifacts that may be published,
- `public_package_forbidden`: raw data, weights, checkpoints, or large files
  that must stay out of public packages.

`scripts/omni/backbone_registry.py --validate --json` checks that the contract
exists for Qwen, Cosmos-style, and policy/VLA branches. The validator and
public-safe packager read `required_eval_files`, `primary_metrics`, and
publication rules from the selected backbone config. Export, training, and
evaluation code still remain model-specific, but the final validation and
publication gate follows the same contract for every future branch.

The registry validation also enforces the minimum held-out evidence surface:
episode-level `train`/`val`/`test` split defaults, a leakage guard,
`held_out_episode_count`, `metrics.json`, a JSONL prediction file,
`RUN_REPORT.md`, training metadata, progress logs, and explicit forbidden
artifact categories for raw data, model weights, checkpoints, and archives.

## Qwen3-Omni Contract

Qwen3-Omni consumes:

- rendered multi-camera mosaic video,
- extracted MP4 audio,
- language prompt and label options,
- optional sensor-bridge summaries/features.

It predicts strict JSON:

```json
{
  "action": "string",
  "subtask": "string",
  "objects": ["string"],
  "contact": "string",
  "transition": "string",
  "next_action": "string",
  "evidence_window": {"start_frame": 0, "end_frame": 0}
}
```

Implemented entrypoints:

- `scripts/omni/parallel_export_qwen3_omni_action_dataset.py`
- `scripts/omni/train_qwen3_omni_lora.py`
- `scripts/omni/eval_qwen3_omni_lora.py`
- `scripts/omni/watch_omni_train_then_eval.py`
- `scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh`

The watcher is the current post-training gate runner. For the Qwen3-Omni LoRA
branch it waits for `progress.jsonl` to end in `complete`, checks the PEFT LoRA
safetensors shapes, runs the training validator, runs a held-out eval smoke,
then runs the full held-out test evaluation.

The Qwen evaluator writes partial predictions during inference and finalizes the
same `predictions.jsonl`, `predictions.csv`, `metrics.json`,
`confusion_matrix.csv`, and `RUN_REPORT.md` files after all selected held-out
windows finish. A restarted eval can resume from the partial prediction file.
For faster held-out evaluation, the Qwen evaluator can also run deterministic
sample shards via `--sample-offset` and `--sample-stride`. Sharded outputs must
be merged with `scripts/omni/merge_qwen3_omni_eval_shards.py`, which recomputes
the final metrics from combined predictions and checks missing or duplicate
sample ids.

Future model families can reuse the same wait/eval sequence only if their
checkpoint artifact has a compatible gate. Otherwise they should provide a
model-specific checkpoint check and evaluator, while keeping the same episode
split and held-out reporting discipline.

## Cosmos-Style World Model Contract

Cosmos-style work should not reuse the JSON QA exporter as-is. It needs a
future-window exporter with samples shaped like:

```json
{
  "episode_id": "session__ep",
  "split": "train",
  "context_window": {"start_frame": 0, "end_frame": 119},
  "target_window": {"start_frame": 120, "end_frame": 179},
  "conditioning": {
    "video": "path-or-latent",
    "audio": "path-or-features",
    "pose": "feature path",
    "depth": "feature path",
    "mocap": "feature path",
    "imu": "feature path",
    "language": "task context"
  },
  "target": {
    "future_video": "path-or-latent",
    "future_sensor_features": "path",
    "transition": "label"
  }
}
```

Minimum evaluators:

- future retrieval MRR / recall@5,
- temporal consistency,
- feature reconstruction error,
- transition/contact prediction,
- qualitative generated or retrieved examples.

Cosmos-style checkpoints are not LoRA adapters by default. Their post-training
gate should verify generated latent/video checkpoints, model config, scheduler
state, and future-window evaluator outputs instead of using the Qwen LoRA
safetensors check.

## VLA / Policy Contract

Policy branches need an explicit action target before training. A valid sample
must state whether the target is an action class, next action, hand trajectory,
contact event, retargeted humanoid action, or robot-compatible action token.

The first policy exporter should save:

- observation media/features,
- language instruction or task context,
- action target,
- action normalization metadata fit on train episodes only,
- target provenance from the original annotation/mocap/contact fields.

Minimum evaluators:

- action or next-action accuracy,
- contact accuracy,
- trajectory MPJPE when trajectories are used,
- object-affordance F1,
- held-out episode count and leakage check.

Policy checkpoints should additionally save the action-space definition,
normalization statistics, and retargeting/conversion metadata. These must be
fit from train episodes only and validated before any held-out policy metrics
are reported.

## Non-Negotiable Invariants

- Do not train on held-out test episodes.
- Do not report model quality without predictions and metrics from held-out
  episodes.
- Do not redistribute raw gated MP4, HDF5, RRD, full checkpoint, or full model
  weight files.
- Do not treat a smoke run or one-episode overfit run as a real held-out model
  result.
- Record skipped episodes with reasons instead of silently dropping them.