ropedia-xperience-10m-task-baselines / TWO_EVIDENCE_LINE_RESULT_SUMMARY.md
cy0307's picture
Refine reader-facing scope wording (1/4)
3797f17 verified
|
Raw
History Blame
8.74 kB

Two Evidence-Line Result Summary

Generated: 2026-06-22T09:56:30+00:00.

Source matrix: docs/data/task_method_20_result_matrix.json

Interpretation rule: Read the 1-episode line as the inspectable task lab. Read the 128-episode line as the selected comparison surface for metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super, and Cosmos3-Nano.

Read This First

The suite has two public reading lanes. Line 1 is the fully inspectable one-episode task lab. Line 2 is the 128-episode comparison surface for aligned baselines, the Qwen3-Omni series, and the Cosmos3 series. Compare scores within the same lane first.

Score formula: 2 single-episode methods x 20 tasks = 40 records; 7 selected-128 methods x 20 tasks = 140 records; total public matrix = 180/180 scored records.

Line What the scores mean Best use Read separately from
1 sample episode 40/40 direct scores from Minimal and Neural MLP heads on the same 20 task contracts. Inspect the raw sample, understand file organization, reproduce the 20 task targets, and compare Minimal vs Neural MLP behavior inside one episode. The selected-128 comparison rows and broader held-out model behavior.
128 selected episodes 140/140 selected-128 scores across seven methods: 134 direct scores plus 6 documented compact-proxy scores. Compare same-split metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano while keeping the 6 compact-proxy cells visible. Direct raw-target interpretation for the proxy-marked cells.

Public Score Totals

  • Lines: 2
  • Tasks per method: 20
  • Methods: 9
  • Scored records: 180/180
  • Direct scores: 174
  • Compact-proxy scores: 6 documented cells

Line Ledger And Entry Points

Line Methods Tasks Scored records Direct scores Proxy scores Primary visuals Source artifacts
1 sample episode 2 20 40/40 40 0 docs/assets/charts/two_evidence_line_map.svg
docs/assets/charts/single_episode_task_model_radar.svg
docs/data/single_episode_task_model_radar.json
docs/data/two_evidence_line_result_summary.json
results/episode_task_suite/summary_report.json
results/episode_task_suite/feature_manifest.json
docs/single_episode_explorer.html
128 selected episodes 7 20 140/140 134 6 docs/assets/charts/two_evidence_line_map.svg
docs/assets/charts/episode128_task_model_radar.svg
docs/assets/charts/unified_task_model_radar.svg
docs/data/episode128_task_model_radar.json
docs/data/two_evidence_line_result_summary.json
docs/data/xperience10m_128_episode_feature_index.json
docs/data/omni_model_comparison.json
docs/data/qwen3_omni_run_lineage.json
docs/data/task_method_20_gap_audit.json

Method Blocks By Evidence Line

Line Method block Methods Scored records Direct scores Proxy scores Evidence type Read as
1 sample episode Task-head baselines Minimal, Neural MLP 40/40 40 0 Direct target metrics on the public sample windows. Task construction, local reproducibility, and Minimal-vs-Neural behavior.
128 selected episodes Aligned baseline heads 128ep Aligned Simple, 128ep Aligned NN, 128ep Raw Simple, 128ep Raw NN 80/80 74 6 Direct processed-target metrics where available; compact proxies for documented raw-target gaps. Same-split metadata/raw-feature baseline comparison.
128 selected episodes Qwen3-Omni series Qwen3-Omni v6 LoRA 20/20 20 0 Verified selected-128 Qwen3-Omni v6 LoRA plus source-linked task-specific probes. Trainable Qwen3-Omni diagnostic baseline on the selected-128 surface.
128 selected episodes Cosmos3 series Cosmos3-Super Reasoner, Cosmos3-Nano Future Window 40/40 40 0 Verified Cosmos3-Super Reasoner and Cosmos3-Nano Future Window public-safe artifacts. Cosmos3 reasoner and future-window diagnostics on the selected-128 surface.

Method Detail By Line

Line Method Method detail Scored records Direct scores Proxy scores
1 sample episode Minimal Single-episode simple heads over the public sample split. 20/20 20 0
1 sample episode Neural MLP Single-episode compact PyTorch MLP heads on the same 20 task contracts. 20/20 20 0
128 selected episodes 128ep Aligned Simple 128-episode aligned simple baselines: JSONL metadata/text tasks plus staged sensor-block tasks where the processed target exists. 20/20 19 1
128 selected episodes 128ep Aligned NN 128-episode aligned MLP baselines: JSONL metadata/text tasks plus staged sensor-block tasks where the processed target exists. 20/20 19 1
128 selected episodes 128ep Raw Simple 128-episode 4430-dim sensor NPZ simple heads; tasks 15/19 use compact proxies. 20/20 18 2
128 selected episodes 128ep Raw NN 128-episode 4430-dim sensor NPZ MLP heads; tasks 15/19 use compact proxies. 20/20 18 2
128 selected episodes Qwen3-Omni v6 LoRA Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future/retrieval/sensor-target probes scored from task-specific JSON. 20/20 20 0
128 selected episodes Cosmos3-Super Reasoner Verified Cosmos3-Super base-weight Reasoner JSON-task evaluation, plus task 5/8/9/10/11/12/13/14/16/17/18/19/20 probes where public metrics exist. 20/20 20 0
128 selected episodes Cosmos3-Nano Future Window Verified Cosmos3-Nano future-window compatibility metrics, plus model-output probes for tasks 2/5/7/8/10/11/12/13/14/15/16/17/18/19 and a derived task-20 boundary timing probe scored from held-out future-window artifacts. 20/20 20 0

Related Model Artifacts

Artifact Role Link or path
Qwen3-Omni v1-v6 run lineage Explains the LoRA/evaluation version ladder; v6 is the current 20-task matrix row, v5 remains the pinned prior release, and v1-v4 are lineage/ablation evidence. docs/data/qwen3_omni_run_lineage.json
Cosmos3-Super Forward-Dynamics LoRA Separate fine-tuned adapter artifact for forward-dynamics loss metrics; published with weights/results but not counted as a 20-task matrix method row. https://huggingface.co/cy0307/ropedia-cosmos3-super-forward-dynamics-lora-128ep

Proxy-Scored Cells

Task Task label Method Metric Reason
15 Interaction Text Prediction 128ep Raw Simple macro_f1 documented compact proxy completion for this raw128 task axis
15 Interaction Text Prediction 128ep Raw NN macro_f1 documented compact proxy completion for this raw128 task axis
19 Camera-View Synchronization Retrieval 128ep Aligned Simple mrr paired camera-view embeddings are absent from the 128 JSONL/feature export; metadata features retrieve the synchronized same-window depth/audio block as a documented compact synchronization proxy
19 Camera-View Synchronization Retrieval 128ep Aligned NN mrr paired camera-view embeddings are absent from the 128 JSONL/feature export; metadata features retrieve the synchronized same-window depth/audio block as a documented compact synchronization proxy
19 Camera-View Synchronization Retrieval 128ep Raw Simple mrr documented compact proxy completion for this raw128 task axis
19 Camera-View Synchronization Retrieval 128ep Raw NN mrr documented compact proxy completion for this raw128 task axis

Reading Order

Step Reason
Choose the evidence line Line 1 answers task-lab and reproducibility questions; line 2 answers selected-128 comparison questions.
Open the matching radar Use the 1-episode radar for Minimal-vs-Neural behavior and the 128-episode radar for metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano.
Inspect the matrix row Every numeric score is tied to a method, task, metric key, source artifact, and proxy flag.
Check proxy cells before interpreting totals The six compact-proxy cells are numeric but are not direct raw-target measurements.

Reader Policy

  • 1 sample episode: Use for task construction, raw-file inspection, local reproducibility, and controlled Minimal-vs-Neural baseline behavior.
  • 128 selected episodes: Use for held-out comparison, metadata/raw-feature baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, Cosmos3-Nano Future Window, and scale-up decisions.
  • Proxy scores: Proxy-scored cells stay numeric only when the source artifact and reason are attached; they should not be read as direct raw-target measurements.