| # Omni Model Comparison |
|
|
| Generated: `2026-06-21T15:17:00+00:00` |
|
|
| Compare only rows with the same scope and target. Single-episode raw-feature metrics, 128-episode metadata baselines, Qwen3 structured JSON metrics, and the two Cosmos3 targets answer different questions: Nano future-window retrieval versus Super structured JSON Reasoner evaluation. |
|
|
| ## Current Result Versions |
|
|
| | version | status | scope | source | |
| | --- | --- | --- | --- | |
| | Single-Episode Public-Sample 20-Task Suite | verified | one public Xperience-10M sample episode | `results/episode_task_suite/summary_report.json` | |
| | 128-Episode Aligned Simple/NN Baselines | pass | selected 128-episode 96/16/16 split | `results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md` | |
| | 128-Episode Foundation-Model Branches | partial_verified | selected 128-episode split and compatible derived windows | `results/omni_finetune/verified_public/` | |
| |
| Read the three rows this way: |
| |
| - Version 1 is the public-sample 20-task surface: unified task heads, historical provenance rows, and the 180-row method-task matrix. |
| - Version 2 is the selected 128-episode same-split simple/NN baseline alignment. |
| - The selected-128 model-diagnostic group contains the current Qwen3-Omni LoRA JSON-task row, Cosmos3-Nano future-window compatibility result, Cosmos3-Super Reasoner base-weight JSON-task evaluation, and the separate Cosmos3-Super Forward-Dynamics LoRA adapter artifact. |
| |
| ## Model-Family Grouped View |
| |
| - Use model_groups when comparing one-episode and 128-episode artifacts within the same model family. |
| - Task-head baselines have both a one-episode public-sample run and a 128-episode same-split metadata/text run. |
| - Qwen3-Omni has a one-episode sensor-adapter smoke test, full-parameter feasibility gates, and separate 128-episode LoRA diagnostic packages; the newest verified full-eval 128-episode adapter belongs in the Qwen LoRA model repo. |
| - Cosmos3-Nano has a 128-episode future-window compatibility package. |
| - Cosmos3-Super now has both a 128-episode base-weight Reasoner evaluation on the JSON task and a fine-tuned forward-dynamics LoRA branch over camera-pose proxy targets. |
|
|
| ### Minimal and Neural Task Heads |
|
|
| This is the cleanest 1-episode versus 128-episode grouping for the same simple/NN task-head family, but the feature surface changes from raw public-sample features to public-safe 128-episode metadata/text features. |
|
|
| - Weight repo policy: https://huggingface.co/cy0307/ropedia-xperience-10m-task-baselines |
|
|
| | scope | status | run | counts | metrics | source | |
| | --- | --- | --- | --- | --- | --- | |
| | 1 episode | verified | Single-Episode Public-Sample 20-Task Suite | 1 episodes, 1161 windows/samples | | `results/episode_task_suite/summary_report.json` | |
| | 128 episode | pass | 128-Episode Aligned Simple/NN Baselines | 3808 windows/samples | | `results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md` | |
|
|
| ### Qwen3-Omni LoRA |
|
|
| The one-episode Qwen entry is only a sensor-adapter smoke test with Qwen3 weights unloaded. The 128-episode entries are real held-out LoRA diagnostics; the current final adapter belongs in the separate Qwen model repo. The full-parameter rows are feasibility gates only and intentionally publish no checkpoints or full-parameter weights. |
|
|
| - Weight repo policy: https://huggingface.co/cy0307/ropedia-qwen3-omni-lora-128ep |
|
|
| | scope | status | run | counts | metrics | source | |
| | --- | --- | --- | --- | --- | --- | |
| | 1 episode | verified_smoke | Qwen3-Omni Sensor-Adapter Smoke | 1 episodes, 59 windows/samples | train_final_loss=1.4479, accuracy=0.0000, macro_f1=0.0000 | `results/omni_exploration/qwen3_adapter_smoke/metrics.json` | |
| | full-param gate | passed | Full-Parameter 1-Step Feasibility Smoke | 8 windows/samples | full_parameter_gate=passed, observed_train_steps=1, final_step_loss=1.2726, epoch_train_loss=1.2726, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_smoke_preemptible_8gpu_20260609/fullparam_feasibility_summary.json` | |
| | full-param gate | passed | Full-Parameter 8-Step Short Train | 64 windows/samples | full_parameter_gate=passed, observed_train_steps=8, final_step_loss=1.1805, epoch_train_loss=1.2190, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_shorttrain8_preemptible_8gpu_20260609/fullparam_shorttrain8_summary.json` | |
| | full-param gate | passed | Full-Parameter 32-Step Pilot | 256 windows/samples | full_parameter_gate=passed, observed_train_steps=32, final_step_loss=0.2206, epoch_train_loss=0.8451, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot32_preemptible_8gpu_20260609/fullparam_pilot32_summary.json` | |
| | full-param gate | passed | Full-Parameter 64-Step Pilot | 512 windows/samples | full_parameter_gate=passed, observed_train_steps=64, final_step_loss=0.0112, epoch_train_loss=0.4434, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot64_preemptible_8gpu_20260609/fullparam_pilot64_summary.json` | |
| | full-param gate | preempted_for_qwen_v5_handoff | Full-Parameter 128-Step Opportunistic Pilot | 1024 windows/samples | full_parameter_gate=preempted_for_qwen_v5_handoff, observed_train_steps=0, final_step_loss=, epoch_train_loss=, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot128_preemptible_8gpu_20260609/fullparam_pilot128_summary.json` | |
| | full-param gate | passed | Full-Parameter 128-Step Post-Qwen-v5 Pilot | 1024 windows/samples | full_parameter_gate=passed, observed_train_steps=128, final_step_loss=0.0137, epoch_train_loss=0.2158, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot128_after_qwen_v5_preemptible_8gpu_20260609/training_metadata.json` | |
| | full-param gate | passed | Full-Parameter 256-Step Post-Qwen-v6 Pilot | 2048 windows/samples | full_parameter_gate=passed, observed_train_steps=256, final_step_loss=0.0096, epoch_train_loss=0.1158, checkpoint_saved=False | `results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot256_after_qwen_v6_preemptible_8gpu_20260611/training_metadata.json` | |
| | 128 episode | verified | Qwen3-Omni LoRA | 119 episodes, 3808 windows/samples, 448 eval | json_validity_rate=0.8750, action_macro_f1=0.0027, transition_accuracy=0.8504, contact_accuracy=0.6451 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_96train_16val_16test_valmon_20260605_eval/verified_result_summary.json` | |
| | 128 episode | verified | Qwen3-Omni LoRA | 119 episodes, 3808 windows/samples, 448 eval | json_validity_rate=0.8527, action_macro_f1=0.0021, transition_accuracy=0.8281, contact_accuracy=0.6518 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_lora_fsdp_full_train_noval_tail_logits_fullstatesave_v6_eval_test_full/verified_result_summary.json` | |
| | 128 episode | verified | Qwen3-Omni LoRA | 119 episodes, 34269 windows/samples, 4032 eval | json_validity_rate=1.0000, action_macro_f1=0.0023, transition_accuracy=0.9908, contact_accuracy=0.7865 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v5_full8gpu_lora_eval_test_full/verified_result_summary.json` | |
| | 128 episode | verified current | Qwen3-Omni LoRA | 119 episodes, 34269 windows/samples, 4032 eval | json_validity_rate=0.9990, action_macro_f1=0.0029, transition_accuracy=0.9898, contact_accuracy=0.8177 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/verified_result_summary.json` | |
| | 128 episode | verified | Qwen3-Omni LoRA | 119 episodes, 3808 windows/samples, 448 eval | json_validity_rate=0.9978, action_macro_f1=0.0024, transition_accuracy=0.9710, contact_accuracy=0.7188 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_structured_json_v2_reuse_full8gpu_lora_eval_test_full/verified_result_summary.json` | |
| | 128 episode | verified | Qwen3-Omni LoRA | 119 episodes, 3808 windows/samples, 448 eval | json_validity_rate=1.0000, action_macro_f1=0.0022, transition_accuracy=0.9732, contact_accuracy=0.7210 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_structured_json_v3_strict_label_prompt_reuse_lora_eval_test_full/verified_result_summary.json` | |
| | 128 episode | verified | Qwen3-Omni LoRA | 119 episodes, 3808 windows/samples, 448 eval | json_validity_rate=1.0000, action_macro_f1=0.0019, transition_accuracy=0.9732, contact_accuracy=0.7299 | `results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_structured_json_v4_4epoch_full8gpu_lora_eval_test_full/verified_result_summary.json` | |
|
|
| ### Cosmos3-Nano Future-Window World Model |
|
|
| The current 128-episode Cosmos result is a public-safe future-window compatibility adapter. It is not yet a full Cosmos diffusion/LoRA weight release. |
|
|
| - Weight repo policy: planned: cy0307/ropedia-cosmos3-nano-future-window-lora-128ep after real adapter weights exist |
|
|
| | scope | status | run | counts | metrics | source | |
| | --- | --- | --- | --- | --- | --- | |
| | 1 episode | not_run | Cosmos3-Nano One-Episode Fine-Tune | | | | |
| | 128 episode | verified current | Cosmos3-Nano Future-Window World Model | 119 episodes, 3213 windows/samples, 378 eval | future_retrieval_mrr=0.0221, temporal_consistency=0.0952, transition_accuracy=0.9683, contact_accuracy=0.7434 | `results/omni_finetune/verified_public/xperience10m_cosmos3_nano_128ep_future_window_h5_compat_adapter_eval_test_full/verified_result_summary.json` | |
|
|
| ### Cosmos3-Super Reasoner |
|
|
| Cosmos3-Super is now represented by a verified 448-window held-out Reasoner evaluation on the same JSON task as Qwen3. It uses staged base weights through vLLM, so it is a Cosmos3 diagnostic, not a weight release. A camera-pose proxy forward-dynamics target export now passes the contract audit and schema-only packer smoke; the separate Forward-Dynamics LoRA group records the trainable adapter run and loss-based held-out evaluation. |
|
|
| - Weight repo policy: none for this run; staged base weights only, no new fine-tuned weights |
|
|
| | scope | status | run | counts | metrics | source | |
| | --- | --- | --- | --- | --- | --- | |
| | 1 episode | not_run | Cosmos3-Super One-Episode Fine-Tune | | | | |
| | readiness | blocked_until_trainer_implemented | Cosmos3-Super Training Readiness Probe | 3808 windows/samples | diffusers_runtime_supported=True, chat_sft_supported=False, weights_updated=False | `results/omni_finetune/xperience10m_cosmos3_super_training_readiness_20260607/training_readiness.json` | |
| | staging readiness | blocked_until_trainer_implemented | Cosmos3-Super Remote Staging Readiness Probe | 3808 windows/samples | diffusers_runtime_supported=False, weights_updated=False | `results/omni_finetune/xperience10m_cosmos3_super_training_readiness_metadata_a100_20260609/training_readiness.json` | |
| | action target contract | ready_for_forward_dynamics_trainer | Cosmos3-Super Camera-Pose Target Audit | 3808 windows/samples | domain_name=camera_pose, raw_action_dim=9, mode=forward_dynamics, valid_action_targets=3808, weights_updated=False | `results/omni_finetune/xperience10m_cosmos3_super_training_contract_audit_camera_pose_20260608/training_contract_audit.json` | |
| | batch packer | pass | Cosmos3-Super Action Batch Packer Smoke | 1 windows/samples | mode=forward_dynamics, loss_surface=vision_velocity_conditioned_on_camera_pose, pipeline_loaded=False, weights_updated=False | `results/omni_finetune/xperience10m_cosmos3_super_action_packer_schema_smoke_20260608/packer_summary.json` | |
| | 128 episode | verified current | Cosmos3-Super Reasoner | 119 episodes, 3808 windows/samples, 448 eval | json_validity_rate=0.5112, action_macro_f1=0.0008, transition_accuracy=0.3683, contact_accuracy=0.3214 | `results/omni_finetune/verified_public/xperience10m_cosmos3_super_reasoner_128ep_test_full_20260607/verified_result_summary.json` | |
|
|
| ### Cosmos3-Super Forward-Dynamics LoRA |
|
|
| This is the first verified Cosmos3-Super fine-tuned adapter branch. Its metric is forward-dynamics MSE, so compare it to world-model loss or future-prediction targets, not to Qwen JSON classification accuracy. |
|
|
| - Weight repo policy: https://huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep |
|
|
| | scope | status | run | counts | metrics | source | |
| | --- | --- | --- | --- | --- | --- | |
| | 1 episode | verified_smoke | Cosmos3-Super Forward-Dynamics Overfit Smoke | | | `results/omni_finetune/xperience10m_cosmos3_super_forward_dynamics_lora_overfit_after_qwen_v4_20260608_fsdp8_attn256_gradfix_savefix2/` | |
| | 128 episode | verified current | Cosmos3-Super Forward-Dynamics LoRA | 119 episodes, 3808 windows/samples, 448 eval | test_forward_dynamics_mse=3.6853, val_forward_dynamics_mse=4.0082, train_final_loss=1.0785, adapter_parameter_numel=26214400 | `results/omni_finetune/verified_public/xperience10m_cosmos3_super_forward_dynamics_lora_128ep_train1epoch_256_attn_full8gpu_20260608_eval_test_full_fsdp/verified_result_summary.json` | |
|
|
| ## 128-Episode Task Baselines |
|
|
| | task | simple | neural | |
| | --- | ---: | ---: | |
| | Action Recognition | macro_f1 0.0002 | macro_f1 0.0000 | |
| | Procedure Step Recognition | macro_f1 0.0000 | macro_f1 0.0000 | |
| | Action Boundary Detection | macro_f1 0.5220 | macro_f1 0.4582 | |
| | Next-Action Prediction | macro_f1 0.0002 | macro_f1 0.0000 | |
| | Hand Trajectory Forecasting | mpjpe | | |
| | Contact State Prediction | macro_f1 0.5168 | macro_f1 0.2195 | |
| | Object Relevance Prediction | micro_f1 0.1822 | micro_f1 0.1054 | |
| | Language Grounding | mrr 0.0128 | | |
| | Cross-Modal Retrieval | mrr | | |
| | Cross-Modal Reconstruction | r2 | | |
| | Temporal Order Verification | f1 0.3271 | | |
| | Multimodal Synchronization Detection | f1 | | |
|
|
| ## Verified Model Branches |
|
|
| | branch | backbone | eval samples | held-out episodes | key metrics | |
| | --- | --- | ---: | ---: | --- | |
| | Cosmos3-Nano Future-Window World Model | `cosmos_world_model` | 378 | 14 | future_retrieval_mrr=0.0221, temporal_consistency=0.0952, transition_accuracy=0.9683, contact_accuracy=0.7434 | |
| | Cosmos3-Super Forward-Dynamics LoRA | `cosmos3_super_forward_dynamics` | 448 | 14 | adapter_parameter_numel=26214400, test_forward_dynamics_mse=3.6853, train_final_loss=1.0785, val_forward_dynamics_mse=4.0082 | |
| | Cosmos3-Super Reasoner | `cosmos3_super_reasoner` | 448 | 14 | json_validity_rate=0.5112, action_macro_f1=0.0008, transition_accuracy=0.3683, contact_accuracy=0.3214 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 448 | 14 | json_validity_rate=0.8750, action_macro_f1=0.0027, transition_accuracy=0.8504, contact_accuracy=0.6451 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 448 | 14 | json_validity_rate=0.8527, action_macro_f1=0.0021, transition_accuracy=0.8281, contact_accuracy=0.6518 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 4032 | 14 | json_validity_rate=1.0000, action_macro_f1=0.0023, transition_accuracy=0.9908, contact_accuracy=0.7865 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 4032 | 14 | json_validity_rate=0.9990, action_macro_f1=0.0029, transition_accuracy=0.9898, contact_accuracy=0.8177 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 448 | 14 | json_validity_rate=0.9978, action_macro_f1=0.0024, transition_accuracy=0.9710, contact_accuracy=0.7188 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 448 | 14 | json_validity_rate=1.0000, action_macro_f1=0.0022, transition_accuracy=0.9732, contact_accuracy=0.7210 | |
| | Qwen3-Omni LoRA | `qwen3_omni_lora` | 448 | 14 | json_validity_rate=1.0000, action_macro_f1=0.0019, transition_accuracy=0.9732, contact_accuracy=0.7299 | |
|
|
| ## Pending |
|
|
| - Use the verified Qwen3 v6 rank64/lr5e-5 dense multiscale full-eval package as the latest current Qwen row; the v5 release tag remains pinned as the previous verified release. |
| - Read results/omni_finetune/QWEN3_V5_V6_COMPARISON_20260614.md before claiming v6 is globally better than v5, because v6 improves action macro-F1 and contact accuracy but regresses subtask, next-action, object micro-F1, and JSON validity slightly. |
| |