cy0307
/

ropedia-xperience-10m-task-baselines

@@ -74,7 +74,7 @@ before the multi-episode omni-model stage becomes a real held-out evaluation.
 | Task suite | 12 human-readable embodied-AI task contracts with input, process, output, metrics, predictions, and case-study walkthroughs |
 | Baselines | Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split; companion simple/NN metadata baselines are also aligned to the selected 128-episode 96/16/16 split |
 | Research directions | Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling |
-| Scale-up path | A first selected-episode Qwen3-Omni LoRA diagnostic pilot has completed on the 96/16/16 split; same-split simple/NN metadata baselines now cover the 12 task ids as a companion comparison. The Qwen result proves the multi-episode export/train/eval/package loop, but the weak held-out metrics make it a baseline for error analysis rather than a strong model. Cosmos 3/world-model and VLA/policy branches reuse the same split and package contract after their targets are implemented. |
 | Public surfaces | GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection |
 For the fastest interpretation of the current metrics, start with
@@ -111,7 +111,7 @@ This project is best read as a staged embodied-AI research study:
 | Task suite | Twelve human-readable tasks cover action, procedure, contact, object, language, retrieval, reconstruction, order, and synchronization questions. | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) |
 | Baselines | Minimal heads and compact PyTorch MLP heads provide a first controlled comparison on the same chronological split; the selected 128-episode setup also has same-split simple/NN metadata baselines for JSON-supported tasks. | [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), [`results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md`](results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md) |
 | Diagnostics | Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard. | [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md), [`docs/single_episode_explorer.html`](docs/single_episode_explorer.html) |
-| Scale-up | The selected 128-episode Qwen3-Omni LoRA diagnostic pilot has a verified validation-aware held-out package: 96/16/16 selected episodes, 3,808 exported windows, 512 validation windows, 448 held-out test windows, and public-safe metrics/predictions. Same-split simple/NN metadata baselines are published separately for the 12 task ids. JSON validity is 87.50%, below the 98% target, so the next pass focuses on structured-output reliability and task-quality error analysis. | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`results/omni_finetune/verified_public/`](results/omni_finetune/verified_public/), [`results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md`](results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md) |
 Detailed dataset notes, reproduction checks, and generated JSON reports are
 included for readers who want to inspect the implementation, but they are
@@ -133,7 +133,7 @@ They give the current research state in one compact table:
 | Dataset context | Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented |
 | Evaluation protocol | Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics |
 | Website and Hub pages | Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links |
-| Qwen3-Omni multi-episode pilot | Verified diagnostic result package exists for the selected 96/16/16 episode split; current held-out metrics are weak and below the JSON-validity quality target |
 | Raw Xperience-10M data / full Qwen weights | Not redistributed |
 ## 90-Second Research Project Path
@@ -152,7 +152,7 @@ If you are reading the project cold, open these in order:
 | 8 | What research directions does this support? | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`docs/data/research_directions.json`](docs/data/research_directions.json), [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) | The tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling. |
 | 9 | Which foundation model comes next? | [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json), [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) | Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 is the first world-model branch; policy models wait for explicit action targets; Xperience-native pretraining is the full-corpus future goal. |
 | 10 | How do I reproduce it? | [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md), [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) | Public commands and expected outputs are documented for the sample-episode task suite. |
-| 11 | What is still pending? | [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md), [`MULTI_EPISODE_ACCESS_STATUS.md`](results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md) | The first held-out diagnostic pilot is verified; strong model quality remains pending because JSON validity is 87.50% and action/subtask metrics remain weak. |
 A compact reader-path summary is available at
 [`docs/data/project_packet.json`](docs/data/project_packet.json).
@@ -481,8 +481,8 @@ python scripts/train_all_modalities_model.py --workspace /path/to/workspace
 This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The
 repository separates public-sample evidence from multi-episode fine-tuning
-artifacts. The validation-aware selected-episode held-out package is now verified as a
-diagnostic pilot, not a strong final model.
 The useful distinction is:
 - direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language
@@ -505,6 +505,11 @@ for public README, website, or Hugging Face updates only after the validator
 passes and `scripts/omni/package_verified_omni_result.py` creates a
 public-safe derived-artifact package. The current verified package is listed in
 [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json).
 ### Sample Count Decision
@@ -544,13 +549,16 @@ Current status in this repo:
 - gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions
 - selected_episode_plan: 128 source-balanced episodes, 96/16/16 train/val/test
 - selected_download_size: 277.71 GiB excluding `visualization.rrd`
-- verified_validation_aware_diagnostic_package: true
 - selected_split: 96 train / 16 validation / 16 held-out test episodes
 - exported_windows: 2,848 train / 512 validation / 448 test
 - validation_samples_used: 512
 - held_out_eval: 448 test windows from 14 exported test episodes
-- train_loss / val_loss: 0.4130 / 0.0331
-- current_quality_target: JSON validity 87.50%, below the 98% target
 - gated dataset: available for selected multi-episode data preparation
 - source_discovery: `results/omni_finetune/source_discovery.json`
 - data_status: `results/omni_finetune/DATA_ACCESS_STATUS.md`
@@ -716,12 +724,12 @@ windows without depending on Qwen chat-message records.
 The public-safe verified package intentionally excludes raw data, base Qwen
 weights, LoRA weights, and full checkpoints. Adapter upload is a separate step:
 use it only when the intended adapter directory is present and the model card
-clearly distinguishes older smoke weights from the selected-episode diagnostic
-or validation-aware run.
 ```bash
 python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \
-  --repo-id cy0307/ropedia-qwen3-omni-lora-smoke \
   --source-dir /path/to/adapter_upload_package \
   --message "Upload Xperience-10M Qwen3-Omni LoRA pilot"
 ```

 | Task suite | 12 human-readable embodied-AI task contracts with input, process, output, metrics, predictions, and case-study walkthroughs |
 | Baselines | Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split; companion simple/NN metadata baselines are also aligned to the selected 128-episode 96/16/16 split |
 | Research directions | Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling |
+| Scale-up path | The selected-episode Qwen3-Omni LoRA final diagnostic result is verified on the 96/16/16 split; same-split simple/NN metadata baselines now cover the 12 task ids as a companion comparison. The Qwen result proves the multi-episode export/train/eval/package loop and meets the strict-JSON target, but weak action/subtask metrics make it a baseline for error analysis rather than a strong model. Cosmos3/world-model and VLA/policy branches reuse the same split and package contract after their targets are implemented. |
 | Public surfaces | GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection |
 For the fastest interpretation of the current metrics, start with
 | Task suite | Twelve human-readable tasks cover action, procedure, contact, object, language, retrieval, reconstruction, order, and synchronization questions. | [`RESEARCH_TAKEAWAYS.md`](RESEARCH_TAKEAWAYS.md), [`results/episode_task_suite/summary_report.json`](results/episode_task_suite/summary_report.json) |
 | Baselines | Minimal heads and compact PyTorch MLP heads provide a first controlled comparison on the same chronological split; the selected 128-episode setup also has same-split simple/NN metadata baselines for JSON-supported tasks. | [`results/episode_task_suite/neural_mlp/`](results/episode_task_suite/neural_mlp/), [`results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md`](results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md) |
 | Diagnostics | Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard. | [`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`](results/audio_ablation/AUDIO_ABLATION_SUMMARY.md), [`docs/single_episode_explorer.html`](docs/single_episode_explorer.html) |
+| Scale-up | The selected 128-episode Qwen3-Omni LoRA diagnostic path has a final verified held-out package: 96/16/16 selected episodes, 3,808 exported windows, 512 validation windows, 448 held-out test windows, and public-safe metrics/predictions. Same-split simple/NN metadata baselines are published for the 12 task ids, and the first Cosmos3-Nano future-window compatibility package is verified as a separate world-model branch. The final Qwen pass reaches 99.78% JSON validity, meeting the 98% target, while action/subtask quality remains weak and is the next error-analysis target. | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/omni_model_comparison.json`](docs/data/omni_model_comparison.json), [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`results/omni_finetune/OMNI_MODEL_COMPARISON.md`](results/omni_finetune/OMNI_MODEL_COMPARISON.md), [`results/omni_finetune/verified_public/`](results/omni_finetune/verified_public/), [`results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md`](results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md) |
 Detailed dataset notes, reproduction checks, and generated JSON reports are
 included for readers who want to inspect the implementation, but they are
 | Dataset context | Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented |
 | Evaluation protocol | Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics |
 | Website and Hub pages | Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links |
+| Qwen3-Omni multi-episode pilot | Final verified diagnostic result package exists for the selected 96/16/16 episode split; JSON validity meets the target, while action/subtask metrics remain weak |
 | Raw Xperience-10M data / full Qwen weights | Not redistributed |
 ## 90-Second Research Project Path
 | 8 | What research directions does this support? | [`RESEARCH_ROADMAP.md`](RESEARCH_ROADMAP.md), [`docs/data/research_directions.json`](docs/data/research_directions.json), [`docs/data/research_direction_extensions.json`](docs/data/research_direction_extensions.json) | The tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling. |
 | 9 | Which foundation model comes next? | [`FOUNDATION_MODEL_PLAN.md`](FOUNDATION_MODEL_PLAN.md), [`docs/data/foundation_model_plan.json`](docs/data/foundation_model_plan.json), [`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`](XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md) | Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 is the first world-model branch; policy models wait for explicit action targets; Xperience-native pretraining is the full-corpus future goal. |
 | 10 | How do I reproduce it? | [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md), [`notes/reproducibility_audit.md`](notes/reproducibility_audit.md) | Public commands and expected outputs are documented for the sample-episode task suite. |
+| 11 | What is still pending? | [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json), [`DATA_ACCESS_STATUS.md`](results/omni_finetune/DATA_ACCESS_STATUS.md), [`MULTI_EPISODE_ACCESS_STATUS.md`](results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md) | The final held-out diagnostic Qwen pass is verified and JSON-validity target is met; strong action/subtask model quality remains pending. |
 A compact reader-path summary is available at
 [`docs/data/project_packet.json`](docs/data/project_packet.json).
 This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The
 repository separates public-sample evidence from multi-episode fine-tuning
+artifacts. The selected-episode held-out package is now verified as a
+diagnostic result, not a strong final action/subtask model.
 The useful distinction is:
 - direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language
 passes and `scripts/omni/package_verified_omni_result.py` creates a
 public-safe derived-artifact package. The current verified package is listed in
 [`docs/data/omni_finetune_verified_result.json`](docs/data/omni_finetune_verified_result.json).
+The current cross-version comparison is generated at
+[`docs/data/omni_model_comparison.json`](docs/data/omni_model_comparison.json)
+and [`results/omni_finetune/OMNI_MODEL_COMPARISON.md`](results/omni_finetune/OMNI_MODEL_COMPARISON.md);
+it separates the single-episode task suite, 128-episode aligned simple/NN
+baselines, and verified Qwen3/Cosmos model-branch packages.
 ### Sample Count Decision
 - gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions
 - selected_episode_plan: 128 source-balanced episodes, 96/16/16 train/val/test
 - selected_download_size: 277.71 GiB excluding `visualization.rrd`
+- verified_final_diagnostic_package: true
 - selected_split: 96 train / 16 validation / 16 held-out test episodes
 - exported_windows: 2,848 train / 512 validation / 448 test
 - validation_samples_used: 512
 - held_out_eval: 448 test windows from 14 exported test episodes
+- final_train_loss / final_val_loss: 0.0277 / 0.0278
+- current_quality_target: JSON validity 99.78%, meeting the 98% target; action/subtask quality remains weak
+- qwen3_lora_adapter_repo: https://huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep
+- 128_aligned_baselines: 12 task ids, 8 simple metadata/text baselines, 6 neural metadata/text baselines
+- cosmos3_branch: verified Cosmos3-Nano future-window compatibility package, 378 held-out future-window predictions from 14 test episodes
 - gated dataset: available for selected multi-episode data preparation
 - source_discovery: `results/omni_finetune/source_discovery.json`
 - data_status: `results/omni_finetune/DATA_ACCESS_STATUS.md`
 The public-safe verified package intentionally excludes raw data, base Qwen
 weights, LoRA weights, and full checkpoints. Adapter upload is a separate step:
 use it only when the intended adapter directory is present and the model card
+clearly distinguishes older smoke weights from the final selected-episode
+diagnostic run.
 ```bash
 python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \
+  --repo-id cy0307/ropedia-qwen3-omni-lora-128ep \
   --source-dir /path/to/adapter_upload_package \
   --message "Upload Xperience-10M Qwen3-Omni LoRA pilot"
 ```