File size: 6,757 Bytes
9371cfb 45c1706 9371cfb 45c1706 9371cfb 45c1706 9371cfb 45c1706 9371cfb 45c1706 9371cfb 45c1706 9371cfb 45c1706 9371cfb 45c1706 9371cfb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | # Xperience-10M 128-Episode Relay and Fine-Tune Plan
This is the executable plan for moving from metadata selection to real
multi-episode training. It does not claim model-quality results until data is
downloaded, staged, audited, trained, and evaluated on held-out sessions.
## Current Preflight
| Host | Role | Status |
| --- | --- | --- |
| HF-reachable relay host | Dataset download relay | Needs Hugging Face access and enough scratch storage for one batch |
| Training host | Persistent data + training | Needs enough storage for the staged selection and the training/eval stack |
Conclusion: use a Hugging Face reachable machine as the download relay and the
training machine as the persistent data store. The current transfer path uses
chunked parallel file transfer and overlapping batch prefetch so the relay can
transfer the current batch while future batches are downloaded.
Current execution status:
- a 128-episode relay job has been launched on an HF-reachable host,
- chunked parallel transfer is active for staged files,
- overlapping batch prefetch is active for later batches,
- no multi-episode model-quality training result is claimed yet.
## Selected Data
- Selection file: `results/omni_finetune/xperience10m_128_episode_selection.json`
- Download list: `results/omni_finetune/xperience10m_128_episode_download_files.txt`
- Episodes: 128
- Sessions: 128 unique sessions
- Split: 96 train / 16 val / 16 test
- Files: 896 training files
- Excluded: `visualization.rrd`
- Estimated training-host storage: 277.71 GiB excluding RRD
## Relay Setup
Define host-specific paths outside the public artifact:
```bash
export RELAY_WORKDIR=/path/to/ropedia-episode-task-suite
export RELAY_ROOT=/path/to/xperience10m_relay
export TRAINING_HOST=<training-user>@<training-host>
export TRAINING_REPO=/path/to/ropedia-episode-task-suite
export TRAINING_DATA_ROOT=/path/to/xperience10m_128
```
Create a dedicated relay-to-training SSH key:
```bash
ssh <relay-host> 'mkdir -p ~/.ssh && chmod 700 ~/.ssh && test -f ~/.ssh/xperience10m_relay_ed25519 || ssh-keygen -t ed25519 -N "" -f ~/.ssh/xperience10m_relay_ed25519 -C xperience10m-relay-to-training'
ssh <relay-host> 'cat ~/.ssh/xperience10m_relay_ed25519.pub'
```
Append that public key to the training host `~/.ssh/authorized_keys`, then verify from the relay:
```bash
ssh <relay-host> 'ssh -i ~/.ssh/xperience10m_relay_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=accept-new <training-user>@<training-host> hostname'
```
## Copy Minimal Repo Files to Relay
```bash
ssh <relay-host> 'mkdir -p "$RELAY_WORKDIR"'
rsync -av \
scripts/omni/relay_xperience10m_selection.py \
scripts/omni/parallel_chunk_transfer.py \
results/omni_finetune/xperience10m_128_episode_selection.json \
<relay-host>:"$RELAY_WORKDIR"/
```
## Relay Dry Run
```bash
ssh <relay-host> '
cd "$RELAY_WORKDIR" &&
python3 relay_xperience10m_selection.py \
--selection-json xperience10m_128_episode_selection.json \
--relay-root "$RELAY_ROOT" \
--batch-max-gib 40 \
--batch-max-episodes 16 \
--transfer-host "$TRAINING_HOST" \
--transfer-root "$TRAINING_DATA_ROOT" \
--ssh-key ~/.ssh/xperience10m_relay_ed25519 \
--transfer-mode chunked \
--chunk-parallel 8 \
--chunk-size-mib 8 \
--chunk-threshold-mib 8 \
--delete-after-transfer \
--dry-run
'
```
## Start Relay
Run in a persistent terminal or `tmux` session on the relay:
```bash
export HF_TOKEN=...
cd "$RELAY_WORKDIR"
python3 relay_xperience10m_selection.py \
--selection-json xperience10m_128_episode_selection.json \
--relay-root "$RELAY_ROOT" \
--batch-max-gib 40 \
--batch-max-episodes 16 \
--transfer-host "$TRAINING_HOST" \
--transfer-root "$TRAINING_DATA_ROOT" \
--ssh-key ~/.ssh/xperience10m_relay_ed25519 \
--transfer-mode chunked \
--chunk-parallel 8 \
--chunk-size-mib 8 \
--chunk-threshold-mib 8 \
--delete-after-transfer
```
Batch sizing is intentionally conservative. A 40 GiB batch size keeps restarts
and partial-transfer cleanup cheaper than treating the full 277.71 GiB selection
as one unit. Future batches can be downloaded in a separate prefetch-only relay
process after disk headroom is checked.
## Training-Host Data Validation
After transfer completes:
```bash
cd "$TRAINING_REPO"
python3 scripts/omni/discover_xperience10m_sources.py \
--workspace "$TRAINING_REPO" \
--data-root "$TRAINING_DATA_ROOT" \
--output results/omni_finetune/source_discovery_128.json \
--report-output results/omni_finetune/DATA_BLOCKER_REPORT_128.md \
--target-episodes 128 \
--skip-modelscope \
--skip-huggingface
```
Then build the episode manifest:
```bash
python3 scripts/omni/build_episode_manifest.py \
--workspace "$TRAINING_REPO" \
--data-root "$TRAINING_DATA_ROOT" \
--max-episodes 128 \
--train-fraction 0.75 \
--val-fraction 0.125 \
--test-fraction 0.125 \
--split-seed 7 \
--output results/omni_finetune/episode_manifest_128.json
```
## Content Rebalance Gate
Parse staged annotations before training:
```bash
python3 scripts/omni/audit_staged_xperience10m_content.py \
--data-root "$TRAINING_DATA_ROOT" \
--selection-json results/omni_finetune/xperience10m_128_episode_selection.json \
--output-json results/omni_finetune/staged_content_audit_128.json \
--output-csv results/omni_finetune/staged_content_audit_128.csv \
--report-output results/omni_finetune/STAGED_CONTENT_AUDIT_128.md
```
If a category dominates train, val, or test, swap episodes before training.
## Training Order
### 1. Qwen3-Omni LoRA Baseline
Use this as the first real multi-episode SFT run because the repo already has
working Qwen3-Omni training/eval scripts.
Expected dataset:
- 128 episodes
- 32,768 max windows at 256 windows per episode
- held-out sessions in val/test
### 2. Cosmos3-Nano Compatibility
Cosmos3-Nano should be treated as a second branch:
- first run inference compatibility on a few staged clips,
- then adapt data format for Cosmos video/action tasks,
- then run post-training only after Qwen3-Omni and content audit pass.
Good Cosmos tasks:
- video + text -> physical reasoning text,
- video + text -> future state/action label,
- video + action/text -> future video,
- video + text -> action trajectory proxy.
Do not start with Cosmos3-Super. Cosmos3-Nano is the practical first target;
Super is for a later run after data format, metrics, and compute are stable.
## Acceptance Gates
- 128 selected episodes staged on the training host.
- No `visualization.rrd` in training data.
- 128 unique sessions preserved.
- Train/val/test session leakage is zero.
- Content audit reviewed before training.
- Qwen3-Omni eval runs on held-out sessions.
- Cosmos3-Nano branch starts with compatibility, not immediate full fine-tune.
|