File size: 6,757 Bytes
9371cfb
 
 
 
 
 
 
 
 
 
 
45c1706
9371cfb
 
45c1706
 
 
9371cfb
45c1706
9371cfb
45c1706
 
 
9371cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45c1706
9371cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45c1706
 
 
 
9371cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45c1706
 
 
 
9371cfb
 
 
 
 
45c1706
 
9371cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# Xperience-10M 128-Episode Relay and Fine-Tune Plan

This is the executable plan for moving from metadata selection to real
multi-episode training. It does not claim model-quality results until data is
downloaded, staged, audited, trained, and evaluated on held-out sessions.

## Current Preflight

| Host | Role | Status |
| --- | --- | --- |
| HF-reachable relay host | Dataset download relay | Needs Hugging Face access and enough scratch storage for one batch |
| Training host | Persistent data + training | Needs enough storage for the staged selection and the training/eval stack |

Conclusion: use a Hugging Face reachable machine as the download relay and the
training machine as the persistent data store. The current transfer path uses
chunked parallel file transfer and overlapping batch prefetch so the relay can
transfer the current batch while future batches are downloaded.

Current execution status:

- a 128-episode relay job has been launched on an HF-reachable host,
- chunked parallel transfer is active for staged files,
- overlapping batch prefetch is active for later batches,
- no multi-episode model-quality training result is claimed yet.

## Selected Data

- Selection file: `results/omni_finetune/xperience10m_128_episode_selection.json`
- Download list: `results/omni_finetune/xperience10m_128_episode_download_files.txt`
- Episodes: 128
- Sessions: 128 unique sessions
- Split: 96 train / 16 val / 16 test
- Files: 896 training files
- Excluded: `visualization.rrd`
- Estimated training-host storage: 277.71 GiB excluding RRD

## Relay Setup

Define host-specific paths outside the public artifact:

```bash
export RELAY_WORKDIR=/path/to/ropedia-episode-task-suite
export RELAY_ROOT=/path/to/xperience10m_relay
export TRAINING_HOST=<training-user>@<training-host>
export TRAINING_REPO=/path/to/ropedia-episode-task-suite
export TRAINING_DATA_ROOT=/path/to/xperience10m_128
```

Create a dedicated relay-to-training SSH key:

```bash
ssh <relay-host> 'mkdir -p ~/.ssh && chmod 700 ~/.ssh && test -f ~/.ssh/xperience10m_relay_ed25519 || ssh-keygen -t ed25519 -N "" -f ~/.ssh/xperience10m_relay_ed25519 -C xperience10m-relay-to-training'
ssh <relay-host> 'cat ~/.ssh/xperience10m_relay_ed25519.pub'
```

Append that public key to the training host `~/.ssh/authorized_keys`, then verify from the relay:

```bash
ssh <relay-host> 'ssh -i ~/.ssh/xperience10m_relay_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=accept-new <training-user>@<training-host> hostname'
```

## Copy Minimal Repo Files to Relay

```bash
ssh <relay-host> 'mkdir -p "$RELAY_WORKDIR"'
rsync -av \
  scripts/omni/relay_xperience10m_selection.py \
  scripts/omni/parallel_chunk_transfer.py \
  results/omni_finetune/xperience10m_128_episode_selection.json \
  <relay-host>:"$RELAY_WORKDIR"/
```

## Relay Dry Run

```bash
ssh <relay-host> '
cd "$RELAY_WORKDIR" &&
python3 relay_xperience10m_selection.py \
  --selection-json xperience10m_128_episode_selection.json \
  --relay-root "$RELAY_ROOT" \
  --batch-max-gib 40 \
  --batch-max-episodes 16 \
  --transfer-host "$TRAINING_HOST" \
  --transfer-root "$TRAINING_DATA_ROOT" \
  --ssh-key ~/.ssh/xperience10m_relay_ed25519 \
  --transfer-mode chunked \
  --chunk-parallel 8 \
  --chunk-size-mib 8 \
  --chunk-threshold-mib 8 \
  --delete-after-transfer \
  --dry-run
'
```

## Start Relay

Run in a persistent terminal or `tmux` session on the relay:

```bash
export HF_TOKEN=...
cd "$RELAY_WORKDIR"
python3 relay_xperience10m_selection.py \
  --selection-json xperience10m_128_episode_selection.json \
  --relay-root "$RELAY_ROOT" \
  --batch-max-gib 40 \
  --batch-max-episodes 16 \
  --transfer-host "$TRAINING_HOST" \
  --transfer-root "$TRAINING_DATA_ROOT" \
  --ssh-key ~/.ssh/xperience10m_relay_ed25519 \
  --transfer-mode chunked \
  --chunk-parallel 8 \
  --chunk-size-mib 8 \
  --chunk-threshold-mib 8 \
  --delete-after-transfer
```

Batch sizing is intentionally conservative. A 40 GiB batch size keeps restarts
and partial-transfer cleanup cheaper than treating the full 277.71 GiB selection
as one unit. Future batches can be downloaded in a separate prefetch-only relay
process after disk headroom is checked.

## Training-Host Data Validation

After transfer completes:

```bash
cd "$TRAINING_REPO"
python3 scripts/omni/discover_xperience10m_sources.py \
  --workspace "$TRAINING_REPO" \
  --data-root "$TRAINING_DATA_ROOT" \
  --output results/omni_finetune/source_discovery_128.json \
  --report-output results/omni_finetune/DATA_BLOCKER_REPORT_128.md \
  --target-episodes 128 \
  --skip-modelscope \
  --skip-huggingface
```

Then build the episode manifest:

```bash
python3 scripts/omni/build_episode_manifest.py \
  --workspace "$TRAINING_REPO" \
  --data-root "$TRAINING_DATA_ROOT" \
  --max-episodes 128 \
  --train-fraction 0.75 \
  --val-fraction 0.125 \
  --test-fraction 0.125 \
  --split-seed 7 \
  --output results/omni_finetune/episode_manifest_128.json
```

## Content Rebalance Gate

Parse staged annotations before training:

```bash
python3 scripts/omni/audit_staged_xperience10m_content.py \
  --data-root "$TRAINING_DATA_ROOT" \
  --selection-json results/omni_finetune/xperience10m_128_episode_selection.json \
  --output-json results/omni_finetune/staged_content_audit_128.json \
  --output-csv results/omni_finetune/staged_content_audit_128.csv \
  --report-output results/omni_finetune/STAGED_CONTENT_AUDIT_128.md
```

If a category dominates train, val, or test, swap episodes before training.

## Training Order

### 1. Qwen3-Omni LoRA Baseline

Use this as the first real multi-episode SFT run because the repo already has
working Qwen3-Omni training/eval scripts.

Expected dataset:

- 128 episodes
- 32,768 max windows at 256 windows per episode
- held-out sessions in val/test

### 2. Cosmos3-Nano Compatibility

Cosmos3-Nano should be treated as a second branch:

- first run inference compatibility on a few staged clips,
- then adapt data format for Cosmos video/action tasks,
- then run post-training only after Qwen3-Omni and content audit pass.

Good Cosmos tasks:

- video + text -> physical reasoning text,
- video + text -> future state/action label,
- video + action/text -> future video,
- video + text -> action trajectory proxy.

Do not start with Cosmos3-Super. Cosmos3-Nano is the practical first target;
Super is for a later run after data format, metrics, and compute are stable.

## Acceptance Gates

- 128 selected episodes staged on the training host.
- No `visualization.rrd` in training data.
- 128 unique sessions preserved.
- Train/val/test session leakage is zero.
- Content audit reviewed before training.
- Qwen3-Omni eval runs on held-out sessions.
- Cosmos3-Nano branch starts with compatibility, not immediate full fine-tune.