cy0307's picture
Add files using upload-large-folder tool
6460b80 verified
|
Raw
History Blame
16 kB

Omni Model Comparison

Generated: 2026-06-21T15:17:00+00:00

Compare only rows with the same scope and target. Single-episode raw-feature metrics, 128-episode metadata baselines, Qwen3 structured JSON metrics, and the two Cosmos3 targets answer different questions: Nano future-window retrieval versus Super structured JSON Reasoner evaluation.

Current Result Versions

version status scope source
Single-Episode Public-Sample 20-Task Suite verified one public Xperience-10M sample episode results/episode_task_suite/summary_report.json
128-Episode Aligned Simple/NN Baselines pass selected 128-episode 96/16/16 split results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md
128-Episode Foundation-Model Branches partial_verified selected 128-episode split and compatible derived windows results/omni_finetune/verified_public/

Read the three rows this way:

  • Version 1 is the public-sample 20-task surface: unified task heads, historical provenance rows, and the 180-row method-task matrix.
  • Version 2 is the selected 128-episode same-split simple/NN baseline alignment.
  • The selected-128 model-diagnostic group contains the current Qwen3-Omni LoRA JSON-task row, Cosmos3-Nano future-window compatibility result, Cosmos3-Super Reasoner base-weight JSON-task evaluation, and the separate Cosmos3-Super Forward-Dynamics LoRA adapter artifact.

Model-Family Grouped View

  • Use model_groups when comparing one-episode and 128-episode artifacts within the same model family.
  • Task-head baselines have both a one-episode public-sample run and a 128-episode same-split metadata/text run.
  • Qwen3-Omni has a one-episode sensor-adapter smoke test, full-parameter feasibility gates, and separate 128-episode LoRA diagnostic packages; the newest verified full-eval 128-episode adapter belongs in the Qwen LoRA model repo.
  • Cosmos3-Nano has a 128-episode future-window compatibility package.
  • Cosmos3-Super now has both a 128-episode base-weight Reasoner evaluation on the JSON task and a fine-tuned forward-dynamics LoRA branch over camera-pose proxy targets.

Minimal and Neural Task Heads

This is the cleanest 1-episode versus 128-episode grouping for the same simple/NN task-head family, but the feature surface changes from raw public-sample features to public-safe 128-episode metadata/text features.

scope status run counts metrics source
1 episode verified Single-Episode Public-Sample 20-Task Suite 1 episodes, 1161 windows/samples results/episode_task_suite/summary_report.json
128 episode pass 128-Episode Aligned Simple/NN Baselines 3808 windows/samples results/omni_finetune/multi_episode_128_task_baselines/BASELINE_ALIGNMENT_REPORT.md

Qwen3-Omni LoRA

The one-episode Qwen entry is only a sensor-adapter smoke test with Qwen3 weights unloaded. The 128-episode entries are real held-out LoRA diagnostics; the current final adapter belongs in the separate Qwen model repo. The full-parameter rows are feasibility gates only and intentionally publish no checkpoints or full-parameter weights.

scope status run counts metrics source
1 episode verified_smoke Qwen3-Omni Sensor-Adapter Smoke 1 episodes, 59 windows/samples train_final_loss=1.4479, accuracy=0.0000, macro_f1=0.0000 results/omni_exploration/qwen3_adapter_smoke/metrics.json
full-param gate passed Full-Parameter 1-Step Feasibility Smoke 8 windows/samples full_parameter_gate=passed, observed_train_steps=1, final_step_loss=1.2726, epoch_train_loss=1.2726, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_smoke_preemptible_8gpu_20260609/fullparam_feasibility_summary.json
full-param gate passed Full-Parameter 8-Step Short Train 64 windows/samples full_parameter_gate=passed, observed_train_steps=8, final_step_loss=1.1805, epoch_train_loss=1.2190, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_shorttrain8_preemptible_8gpu_20260609/fullparam_shorttrain8_summary.json
full-param gate passed Full-Parameter 32-Step Pilot 256 windows/samples full_parameter_gate=passed, observed_train_steps=32, final_step_loss=0.2206, epoch_train_loss=0.8451, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot32_preemptible_8gpu_20260609/fullparam_pilot32_summary.json
full-param gate passed Full-Parameter 64-Step Pilot 512 windows/samples full_parameter_gate=passed, observed_train_steps=64, final_step_loss=0.0112, epoch_train_loss=0.4434, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot64_preemptible_8gpu_20260609/fullparam_pilot64_summary.json
full-param gate preempted_for_qwen_v5_handoff Full-Parameter 128-Step Opportunistic Pilot 1024 windows/samples full_parameter_gate=preempted_for_qwen_v5_handoff, observed_train_steps=0, final_step_loss=, epoch_train_loss=, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot128_preemptible_8gpu_20260609/fullparam_pilot128_summary.json
full-param gate passed Full-Parameter 128-Step Post-Qwen-v5 Pilot 1024 windows/samples full_parameter_gate=passed, observed_train_steps=128, final_step_loss=0.0137, epoch_train_loss=0.2158, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot128_after_qwen_v5_preemptible_8gpu_20260609/training_metadata.json
full-param gate passed Full-Parameter 256-Step Post-Qwen-v6 Pilot 2048 windows/samples full_parameter_gate=passed, observed_train_steps=256, final_step_loss=0.0096, epoch_train_loss=0.1158, checkpoint_saved=False results/omni_finetune/xperience10m_qwen3_omni_128ep_fullparam_pilot256_after_qwen_v6_preemptible_8gpu_20260611/training_metadata.json
128 episode verified Qwen3-Omni LoRA 119 episodes, 3808 windows/samples, 448 eval json_validity_rate=0.8750, action_macro_f1=0.0027, transition_accuracy=0.8504, contact_accuracy=0.6451 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_96train_16val_16test_valmon_20260605_eval/verified_result_summary.json
128 episode verified Qwen3-Omni LoRA 119 episodes, 3808 windows/samples, 448 eval json_validity_rate=0.8527, action_macro_f1=0.0021, transition_accuracy=0.8281, contact_accuracy=0.6518 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_lora_fsdp_full_train_noval_tail_logits_fullstatesave_v6_eval_test_full/verified_result_summary.json
128 episode verified Qwen3-Omni LoRA 119 episodes, 34269 windows/samples, 4032 eval json_validity_rate=1.0000, action_macro_f1=0.0023, transition_accuracy=0.9908, contact_accuracy=0.7865 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v5_full8gpu_lora_eval_test_full/verified_result_summary.json
128 episode verified current Qwen3-Omni LoRA 119 episodes, 34269 windows/samples, 4032 eval json_validity_rate=0.9990, action_macro_f1=0.0029, transition_accuracy=0.9898, contact_accuracy=0.8177 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/verified_result_summary.json
128 episode verified Qwen3-Omni LoRA 119 episodes, 3808 windows/samples, 448 eval json_validity_rate=0.9978, action_macro_f1=0.0024, transition_accuracy=0.9710, contact_accuracy=0.7188 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_structured_json_v2_reuse_full8gpu_lora_eval_test_full/verified_result_summary.json
128 episode verified Qwen3-Omni LoRA 119 episodes, 3808 windows/samples, 448 eval json_validity_rate=1.0000, action_macro_f1=0.0022, transition_accuracy=0.9732, contact_accuracy=0.7210 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_structured_json_v3_strict_label_prompt_reuse_lora_eval_test_full/verified_result_summary.json
128 episode verified Qwen3-Omni LoRA 119 episodes, 3808 windows/samples, 448 eval json_validity_rate=1.0000, action_macro_f1=0.0019, transition_accuracy=0.9732, contact_accuracy=0.7299 results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_structured_json_v4_4epoch_full8gpu_lora_eval_test_full/verified_result_summary.json

Cosmos3-Nano Future-Window World Model

The current 128-episode Cosmos result is a public-safe future-window compatibility adapter. It is not yet a full Cosmos diffusion/LoRA weight release.

  • Weight repo policy: planned: cy0307/ropedia-cosmos3-nano-future-window-lora-128ep after real adapter weights exist
scope status run counts metrics source
1 episode not_run Cosmos3-Nano One-Episode Fine-Tune
128 episode verified current Cosmos3-Nano Future-Window World Model 119 episodes, 3213 windows/samples, 378 eval future_retrieval_mrr=0.0221, temporal_consistency=0.0952, transition_accuracy=0.9683, contact_accuracy=0.7434 results/omni_finetune/verified_public/xperience10m_cosmos3_nano_128ep_future_window_h5_compat_adapter_eval_test_full/verified_result_summary.json

Cosmos3-Super Reasoner

Cosmos3-Super is now represented by a verified 448-window held-out Reasoner evaluation on the same JSON task as Qwen3. It uses staged base weights through vLLM, so it is a Cosmos3 diagnostic, not a weight release. A camera-pose proxy forward-dynamics target export now passes the contract audit and schema-only packer smoke; the separate Forward-Dynamics LoRA group records the trainable adapter run and loss-based held-out evaluation.

  • Weight repo policy: none for this run; staged base weights only, no new fine-tuned weights
scope status run counts metrics source
1 episode not_run Cosmos3-Super One-Episode Fine-Tune
readiness blocked_until_trainer_implemented Cosmos3-Super Training Readiness Probe 3808 windows/samples diffusers_runtime_supported=True, chat_sft_supported=False, weights_updated=False results/omni_finetune/xperience10m_cosmos3_super_training_readiness_20260607/training_readiness.json
staging readiness blocked_until_trainer_implemented Cosmos3-Super Remote Staging Readiness Probe 3808 windows/samples diffusers_runtime_supported=False, weights_updated=False results/omni_finetune/xperience10m_cosmos3_super_training_readiness_metadata_a100_20260609/training_readiness.json
action target contract ready_for_forward_dynamics_trainer Cosmos3-Super Camera-Pose Target Audit 3808 windows/samples domain_name=camera_pose, raw_action_dim=9, mode=forward_dynamics, valid_action_targets=3808, weights_updated=False results/omni_finetune/xperience10m_cosmos3_super_training_contract_audit_camera_pose_20260608/training_contract_audit.json
batch packer pass Cosmos3-Super Action Batch Packer Smoke 1 windows/samples mode=forward_dynamics, loss_surface=vision_velocity_conditioned_on_camera_pose, pipeline_loaded=False, weights_updated=False results/omni_finetune/xperience10m_cosmos3_super_action_packer_schema_smoke_20260608/packer_summary.json
128 episode verified current Cosmos3-Super Reasoner 119 episodes, 3808 windows/samples, 448 eval json_validity_rate=0.5112, action_macro_f1=0.0008, transition_accuracy=0.3683, contact_accuracy=0.3214 results/omni_finetune/verified_public/xperience10m_cosmos3_super_reasoner_128ep_test_full_20260607/verified_result_summary.json

Cosmos3-Super Forward-Dynamics LoRA

This is the first verified Cosmos3-Super fine-tuned adapter branch. Its metric is forward-dynamics MSE, so compare it to world-model loss or future-prediction targets, not to Qwen JSON classification accuracy.

scope status run counts metrics source
1 episode verified_smoke Cosmos3-Super Forward-Dynamics Overfit Smoke results/omni_finetune/xperience10m_cosmos3_super_forward_dynamics_lora_overfit_after_qwen_v4_20260608_fsdp8_attn256_gradfix_savefix2/
128 episode verified current Cosmos3-Super Forward-Dynamics LoRA 119 episodes, 3808 windows/samples, 448 eval test_forward_dynamics_mse=3.6853, val_forward_dynamics_mse=4.0082, train_final_loss=1.0785, adapter_parameter_numel=26214400 results/omni_finetune/verified_public/xperience10m_cosmos3_super_forward_dynamics_lora_128ep_train1epoch_256_attn_full8gpu_20260608_eval_test_full_fsdp/verified_result_summary.json

128-Episode Task Baselines

task simple neural
Action Recognition macro_f1 0.0002 macro_f1 0.0000
Procedure Step Recognition macro_f1 0.0000 macro_f1 0.0000
Action Boundary Detection macro_f1 0.5220 macro_f1 0.4582
Next-Action Prediction macro_f1 0.0002 macro_f1 0.0000
Hand Trajectory Forecasting mpjpe
Contact State Prediction macro_f1 0.5168 macro_f1 0.2195
Object Relevance Prediction micro_f1 0.1822 micro_f1 0.1054
Language Grounding mrr 0.0128
Cross-Modal Retrieval mrr
Cross-Modal Reconstruction r2
Temporal Order Verification f1 0.3271
Multimodal Synchronization Detection f1

Verified Model Branches

branch backbone eval samples held-out episodes key metrics
Cosmos3-Nano Future-Window World Model cosmos_world_model 378 14 future_retrieval_mrr=0.0221, temporal_consistency=0.0952, transition_accuracy=0.9683, contact_accuracy=0.7434
Cosmos3-Super Forward-Dynamics LoRA cosmos3_super_forward_dynamics 448 14 adapter_parameter_numel=26214400, test_forward_dynamics_mse=3.6853, train_final_loss=1.0785, val_forward_dynamics_mse=4.0082
Cosmos3-Super Reasoner cosmos3_super_reasoner 448 14 json_validity_rate=0.5112, action_macro_f1=0.0008, transition_accuracy=0.3683, contact_accuracy=0.3214
Qwen3-Omni LoRA qwen3_omni_lora 448 14 json_validity_rate=0.8750, action_macro_f1=0.0027, transition_accuracy=0.8504, contact_accuracy=0.6451
Qwen3-Omni LoRA qwen3_omni_lora 448 14 json_validity_rate=0.8527, action_macro_f1=0.0021, transition_accuracy=0.8281, contact_accuracy=0.6518
Qwen3-Omni LoRA qwen3_omni_lora 4032 14 json_validity_rate=1.0000, action_macro_f1=0.0023, transition_accuracy=0.9908, contact_accuracy=0.7865
Qwen3-Omni LoRA qwen3_omni_lora 4032 14 json_validity_rate=0.9990, action_macro_f1=0.0029, transition_accuracy=0.9898, contact_accuracy=0.8177
Qwen3-Omni LoRA qwen3_omni_lora 448 14 json_validity_rate=0.9978, action_macro_f1=0.0024, transition_accuracy=0.9710, contact_accuracy=0.7188
Qwen3-Omni LoRA qwen3_omni_lora 448 14 json_validity_rate=1.0000, action_macro_f1=0.0022, transition_accuracy=0.9732, contact_accuracy=0.7210
Qwen3-Omni LoRA qwen3_omni_lora 448 14 json_validity_rate=1.0000, action_macro_f1=0.0019, transition_accuracy=0.9732, contact_accuracy=0.7299

Pending

  • Use the verified Qwen3 v6 rank64/lr5e-5 dense multiscale full-eval package as the latest current Qwen row; the v5 release tag remains pinned as the previous verified release.
  • Read results/omni_finetune/QWEN3_V5_V6_COMPARISON_20260614.md before claiming v6 is globally better than v5, because v6 improves action macro-F1 and contact accuracy but regresses subtask, next-action, object micro-F1, and JSON validity slightly.