cy0307 commited on
Commit
ef97957
·
verified ·
1 Parent(s): 4ad6b11

Add files using upload-large-folder tool

Browse files
TASK_METHOD_20_GAP_AUDIT.md CHANGED
@@ -1,6 +1,6 @@
1
  # Task Method 20-Result Gap Audit
2
 
3
- Generated: `2026-06-17T13:55:12+00:00`
4
 
5
  This audit is the explicit gap ledger for the 9-method x 20-task result matrix.
6
  It keeps missing cells visible while preserving the rule that a numeric score
@@ -9,8 +9,8 @@ requires a real task target and source artifact.
9
  ## Score Summary
10
 
11
  - Method-task records: `180`
12
- - Numeric scored records: `116`
13
- - Scoreless records: `64`
14
  - Proxy-scored records: `4`
15
  - Source matrix: [`docs/data/task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json)
16
 
@@ -24,7 +24,7 @@ requires a real task target and source artifact.
24
  | 128ep Metadata NN | metadata128_neural_mlp | 6/20 | 14 | 0 | not_supported_by_metadata_only_package: 14, scored: 6 |
25
  | 128ep Raw Simple | raw128_simple | 20/20 | 0 | 2 | proxy_scored: 2, scored: 18 |
26
  | 128ep Raw NN | raw128_neural_mlp | 20/20 | 0 | 2 | proxy_scored: 2, scored: 18 |
27
- | Qwen3-Omni v6 LoRA | qwen3_omni_v6_lora | 10/20 | 10 | 0 | not_evaluated_in_verified_package: 10, scored: 10 |
28
  | Cosmos3-Super Reasoner | cosmos3_super_reasoner | 7/20 | 13 | 0 | not_evaluated_in_verified_package: 13, scored: 7 |
29
  | Cosmos3-Nano Future Window | cosmos3_nano_future_window | 5/20 | 15 | 0 | not_evaluated_in_verified_package: 15, scored: 5 |
30
 
@@ -32,7 +32,7 @@ requires a real task target and source artifact.
32
 
33
  | Status | Count | Next step |
34
  | --- | --- | --- |
35
- | not_evaluated_in_verified_package | 38 | Generate verified model outputs for this task contract and score them against the held-out labels. |
36
  | not_supported_by_metadata_only_package | 22 | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
37
  | unsupported_without_required_target | 4 | Export the missing target field for this 128-episode method, then rerun the same train/validation/test split. |
38
 
@@ -61,12 +61,10 @@ requires a real task target and source artifact.
61
  | 10 | Cross-Modal Reconstruction | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
62
  | 10 | Cross-Modal Reconstruction | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
63
  | 11 | Temporal Order Verification | 128ep Metadata NN | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
64
- | 11 | Temporal Order Verification | Qwen3-Omni v6 LoRA | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
65
  | 11 | Temporal Order Verification | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
66
  | 11 | Temporal Order Verification | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
67
  | 12 | Multimodal Synchronization Detection | 128ep Metadata Simple | unsupported | Export the missing target field for this 128-episode method, then rerun the same train/validation/test split. |
68
  | 12 | Multimodal Synchronization Detection | 128ep Metadata NN | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
69
- | 12 | Multimodal Synchronization Detection | Qwen3-Omni v6 LoRA | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
70
  | 12 | Multimodal Synchronization Detection | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
71
  | 12 | Multimodal Synchronization Detection | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
72
  | 13 | Long-Horizon Next-Action Forecasting | 128ep Metadata Simple | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
@@ -101,7 +99,6 @@ requires a real task target and source artifact.
101
  | 19 | Camera-View Synchronization Retrieval | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
102
  | 20 | Time-to-Next-Transition Regression | 128ep Metadata Simple | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
103
  | 20 | Time-to-Next-Transition Regression | 128ep Metadata NN | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
104
- | 20 | Time-to-Next-Transition Regression | Qwen3-Omni v6 LoRA | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
105
  | 20 | Time-to-Next-Transition Regression | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
106
  | 20 | Time-to-Next-Transition Regression | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
107
 
 
1
  # Task Method 20-Result Gap Audit
2
 
3
+ Generated: `2026-06-17T21:17:51+00:00`
4
 
5
  This audit is the explicit gap ledger for the 9-method x 20-task result matrix.
6
  It keeps missing cells visible while preserving the rule that a numeric score
 
9
  ## Score Summary
10
 
11
  - Method-task records: `180`
12
+ - Numeric scored records: `119`
13
+ - Scoreless records: `61`
14
  - Proxy-scored records: `4`
15
  - Source matrix: [`docs/data/task_method_20_result_matrix.json`](docs/data/task_method_20_result_matrix.json)
16
 
 
24
  | 128ep Metadata NN | metadata128_neural_mlp | 6/20 | 14 | 0 | not_supported_by_metadata_only_package: 14, scored: 6 |
25
  | 128ep Raw Simple | raw128_simple | 20/20 | 0 | 2 | proxy_scored: 2, scored: 18 |
26
  | 128ep Raw NN | raw128_neural_mlp | 20/20 | 0 | 2 | proxy_scored: 2, scored: 18 |
27
+ | Qwen3-Omni v6 LoRA | qwen3_omni_v6_lora | 13/20 | 7 | 0 | not_evaluated_in_verified_package: 7, scored: 13 |
28
  | Cosmos3-Super Reasoner | cosmos3_super_reasoner | 7/20 | 13 | 0 | not_evaluated_in_verified_package: 13, scored: 7 |
29
  | Cosmos3-Nano Future Window | cosmos3_nano_future_window | 5/20 | 15 | 0 | not_evaluated_in_verified_package: 15, scored: 5 |
30
 
 
32
 
33
  | Status | Count | Next step |
34
  | --- | --- | --- |
35
+ | not_evaluated_in_verified_package | 35 | Generate verified model outputs for this task contract and score them against the held-out labels. |
36
  | not_supported_by_metadata_only_package | 22 | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
37
  | unsupported_without_required_target | 4 | Export the missing target field for this 128-episode method, then rerun the same train/validation/test split. |
38
 
 
61
  | 10 | Cross-Modal Reconstruction | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
62
  | 10 | Cross-Modal Reconstruction | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
63
  | 11 | Temporal Order Verification | 128ep Metadata NN | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
 
64
  | 11 | Temporal Order Verification | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
65
  | 11 | Temporal Order Verification | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
66
  | 12 | Multimodal Synchronization Detection | 128ep Metadata Simple | unsupported | Export the missing target field for this 128-episode method, then rerun the same train/validation/test split. |
67
  | 12 | Multimodal Synchronization Detection | 128ep Metadata NN | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
 
68
  | 12 | Multimodal Synchronization Detection | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
69
  | 12 | Multimodal Synchronization Detection | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
70
  | 13 | Long-Horizon Next-Action Forecasting | 128ep Metadata Simple | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
 
99
  | 19 | Camera-View Synchronization Retrieval | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
100
  | 20 | Time-to-Next-Transition Regression | 128ep Metadata Simple | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
101
  | 20 | Time-to-Next-Transition Regression | 128ep Metadata NN | not supported | Run the task with raw sensor-feature blocks or add a task-specific metadata target builder before assigning a numeric score. |
 
102
  | 20 | Time-to-Next-Transition Regression | Cosmos3-Super Reasoner | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
103
  | 20 | Time-to-Next-Transition Regression | Cosmos3-Nano Future Window | not evaluated | Generate verified model outputs for this task contract and score them against the held-out labels. |
104
 
TASK_METHOD_20_RESULT_MATRIX.md CHANGED
@@ -12,7 +12,7 @@ Legend: `score` = numeric task score, `proxy` = documented raw128 compact proxy
12
  | 128ep Metadata NN | 20 | 6 | 0 | 14 | not supported 14, scored 6 |
13
  | 128ep Raw Simple | 20 | 20 | 2 | 0 | proxy scored 2, scored 18 |
14
  | 128ep Raw NN | 20 | 20 | 2 | 0 | proxy scored 2, scored 18 |
15
- | Qwen3-Omni v6 LoRA | 20 | 10 | 0 | 10 | not evaluated 10, scored 10 |
16
  | Cosmos3-Super Reasoner | 20 | 7 | 0 | 13 | not evaluated 13, scored 7 |
17
  | Cosmos3-Nano Future Window | 20 | 5 | 0 | 15 | not evaluated 15, scored 5 |
18
 
@@ -28,8 +28,8 @@ Legend: `score` = numeric task score, `proxy` = documented raw128 compact proxy
28
  | 08 | Language Grounding | score | score | score | not supported | score | score | not evaluated | not evaluated | not evaluated |
29
  | 09 | Cross-Modal Retrieval | score | score | unsupported | not supported | score | score | not evaluated | not evaluated | score |
30
  | 10 | Cross-Modal Reconstruction | score | score | unsupported | not supported | score | score | not evaluated | not evaluated | not evaluated |
31
- | 11 | Temporal Order Verification | score | score | score | not supported | score | score | not evaluated | not evaluated | not evaluated |
32
- | 12 | Multimodal Synchronization Detection | score | score | unsupported | not supported | score | score | not evaluated | not evaluated | not evaluated |
33
  | 13 | Long-Horizon Next-Action Forecasting | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
34
  | 14 | Long-Horizon Next-Subtask Forecasting | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
35
  | 15 | Interaction Text Prediction | score | score | not supported | not supported | proxy | proxy | not evaluated | not evaluated | not evaluated |
@@ -37,6 +37,6 @@ Legend: `score` = numeric task score, `proxy` = documented raw128 compact proxy
37
  | 17 | Future Object-Set Forecasting | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
38
  | 18 | IMU-to-Hand Pose Reconstruction | score | score | not supported | not supported | score | score | not evaluated | not evaluated | not evaluated |
39
  | 19 | Camera-View Synchronization Retrieval | score | score | not supported | not supported | proxy | proxy | not evaluated | not evaluated | not evaluated |
40
- | 20 | Time-to-Next-Transition Regression | score | score | not supported | not supported | score | score | not evaluated | not evaluated | not evaluated |
41
 
42
  Sources and raw values are in `docs/data/task_method_20_result_matrix.json` and `docs/data/unified_task_model_radar.json`.
 
12
  | 128ep Metadata NN | 20 | 6 | 0 | 14 | not supported 14, scored 6 |
13
  | 128ep Raw Simple | 20 | 20 | 2 | 0 | proxy scored 2, scored 18 |
14
  | 128ep Raw NN | 20 | 20 | 2 | 0 | proxy scored 2, scored 18 |
15
+ | Qwen3-Omni v6 LoRA | 20 | 13 | 0 | 7 | not evaluated 7, scored 13 |
16
  | Cosmos3-Super Reasoner | 20 | 7 | 0 | 13 | not evaluated 13, scored 7 |
17
  | Cosmos3-Nano Future Window | 20 | 5 | 0 | 15 | not evaluated 15, scored 5 |
18
 
 
28
  | 08 | Language Grounding | score | score | score | not supported | score | score | not evaluated | not evaluated | not evaluated |
29
  | 09 | Cross-Modal Retrieval | score | score | unsupported | not supported | score | score | not evaluated | not evaluated | score |
30
  | 10 | Cross-Modal Reconstruction | score | score | unsupported | not supported | score | score | not evaluated | not evaluated | not evaluated |
31
+ | 11 | Temporal Order Verification | score | score | score | not supported | score | score | score | not evaluated | not evaluated |
32
+ | 12 | Multimodal Synchronization Detection | score | score | unsupported | not supported | score | score | score | not evaluated | not evaluated |
33
  | 13 | Long-Horizon Next-Action Forecasting | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
34
  | 14 | Long-Horizon Next-Subtask Forecasting | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
35
  | 15 | Interaction Text Prediction | score | score | not supported | not supported | proxy | proxy | not evaluated | not evaluated | not evaluated |
 
37
  | 17 | Future Object-Set Forecasting | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
38
  | 18 | IMU-to-Hand Pose Reconstruction | score | score | not supported | not supported | score | score | not evaluated | not evaluated | not evaluated |
39
  | 19 | Camera-View Synchronization Retrieval | score | score | not supported | not supported | proxy | proxy | not evaluated | not evaluated | not evaluated |
40
+ | 20 | Time-to-Next-Transition Regression | score | score | not supported | not supported | score | score | score | not evaluated | not evaluated |
41
 
42
  Sources and raw values are in `docs/data/task_method_20_result_matrix.json` and `docs/data/unified_task_model_radar.json`.
assets/charts/episode128_task_model_radar.svg CHANGED
assets/charts/unified_task_model_radar.svg CHANGED
data/episode128_task_model_radar.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
  "title": "128-Episode 20-Task Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "description": "Selected 128-episode metadata/raw baselines plus verified Qwen3/Cosmos branches. Every method has 20 records; numeric scores appear only where the public artifact produced that task target.",
6
  "task_count": 20,
7
  "method_count": 7,
8
  "method_task_record_count": 140,
9
- "scored_method_task_count": 76,
10
  "normalization_policy": {
11
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
12
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
@@ -127,17 +127,17 @@
127
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
128
  "plotted_as": "colored point overlay",
129
  "result_record_count": 20,
130
- "scored_task_count": 10,
131
- "covered_task_count": 10,
132
  "proxy_scored_task_count": 0,
133
- "scoreless_task_count": 10,
134
  "unsupported_task_count": 0,
135
- "not_evaluated_task_count": 10,
136
  "status_counts": {
137
- "not_evaluated_in_verified_package": 10,
138
- "scored": 10
139
  },
140
- "coverage_fraction": 0.5,
141
  "result_record_fraction": 1.0
142
  },
143
  {
@@ -1157,15 +1157,15 @@
1157
  "status_label": "scored"
1158
  },
1159
  "qwen3_omni_v6_lora": {
1160
- "raw": null,
1161
- "metric_key": "f1",
1162
- "source": null,
1163
  "scope": "multi_episode_128_partial_model_overlay",
1164
- "status": "not_evaluated_in_verified_package",
1165
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1166
- "normalized_score": null,
1167
- "raw_text": "n/a",
1168
- "status_label": "not evaluated"
1169
  },
1170
  "cosmos3_super_reasoner": {
1171
  "raw": null,
@@ -1248,15 +1248,15 @@
1248
  "status_label": "scored"
1249
  },
1250
  "qwen3_omni_v6_lora": {
1251
- "raw": null,
1252
- "metric_key": "f1",
1253
- "source": null,
1254
  "scope": "multi_episode_128_partial_model_overlay",
1255
- "status": "not_evaluated_in_verified_package",
1256
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1257
- "normalized_score": null,
1258
- "raw_text": "n/a",
1259
- "status_label": "not evaluated"
1260
  },
1261
  "cosmos3_super_reasoner": {
1262
  "raw": null,
@@ -1976,15 +1976,15 @@
1976
  "status_label": "scored"
1977
  },
1978
  "qwen3_omni_v6_lora": {
1979
- "raw": null,
1980
- "metric_key": "mae",
1981
- "source": null,
1982
  "scope": "multi_episode_128_partial_model_overlay",
1983
- "status": "not_evaluated_in_verified_package",
1984
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1985
- "normalized_score": null,
1986
- "raw_text": "n/a",
1987
- "status_label": "not evaluated"
1988
  },
1989
  "cosmos3_super_reasoner": {
1990
  "raw": null,
@@ -3350,17 +3350,17 @@
3350
  "task_label": "Temporal Order Verification",
3351
  "series_id": "qwen3_omni_v6_lora",
3352
  "method": "Qwen3-Omni v6 LoRA",
3353
- "status": "not_evaluated_in_verified_package",
3354
- "status_label": "not evaluated",
3355
- "scored": false,
3356
  "proxy_scored": false,
3357
- "raw": null,
3358
- "raw_text": "n/a",
3359
- "normalized_score": null,
3360
- "metric_key": "f1",
3361
- "source": null,
3362
  "scope": "multi_episode_128_partial_model_overlay",
3363
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3364
  },
3365
  {
3366
  "task_number": 11,
@@ -3476,17 +3476,17 @@
3476
  "task_label": "Multimodal Synchronization Detection",
3477
  "series_id": "qwen3_omni_v6_lora",
3478
  "method": "Qwen3-Omni v6 LoRA",
3479
- "status": "not_evaluated_in_verified_package",
3480
- "status_label": "not evaluated",
3481
- "scored": false,
3482
  "proxy_scored": false,
3483
- "raw": null,
3484
- "raw_text": "n/a",
3485
- "normalized_score": null,
3486
- "metric_key": "f1",
3487
- "source": null,
3488
  "scope": "multi_episode_128_partial_model_overlay",
3489
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3490
  },
3491
  {
3492
  "task_number": 12,
@@ -4484,17 +4484,17 @@
4484
  "task_label": "Time-to-Next-Transition Regression",
4485
  "series_id": "qwen3_omni_v6_lora",
4486
  "method": "Qwen3-Omni v6 LoRA",
4487
- "status": "not_evaluated_in_verified_package",
4488
- "status_label": "not evaluated",
4489
- "scored": false,
4490
  "proxy_scored": false,
4491
- "raw": null,
4492
- "raw_text": "n/a",
4493
- "normalized_score": null,
4494
- "metric_key": "mae",
4495
- "source": null,
4496
  "scope": "multi_episode_128_partial_model_overlay",
4497
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4498
  },
4499
  {
4500
  "task_number": 20,
 
1
  {
2
  "title": "128-Episode 20-Task Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "description": "Selected 128-episode metadata/raw baselines plus verified Qwen3/Cosmos branches. Every method has 20 records; numeric scores appear only where the public artifact produced that task target.",
6
  "task_count": 20,
7
  "method_count": 7,
8
  "method_task_record_count": 140,
9
+ "scored_method_task_count": 79,
10
  "normalization_policy": {
11
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
12
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
 
127
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
128
  "plotted_as": "colored point overlay",
129
  "result_record_count": 20,
130
+ "scored_task_count": 13,
131
+ "covered_task_count": 13,
132
  "proxy_scored_task_count": 0,
133
+ "scoreless_task_count": 7,
134
  "unsupported_task_count": 0,
135
+ "not_evaluated_task_count": 7,
136
  "status_counts": {
137
+ "not_evaluated_in_verified_package": 7,
138
+ "scored": 13
139
  },
140
+ "coverage_fraction": 0.65,
141
  "result_record_fraction": 1.0
142
  },
143
  {
 
1157
  "status_label": "scored"
1158
  },
1159
  "qwen3_omni_v6_lora": {
1160
+ "raw": 0.40984631701404173,
1161
+ "metric_key": "temporal_order_f1",
1162
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1163
  "scope": "multi_episode_128_partial_model_overlay",
1164
+ "status": "scored",
1165
+ "reason": null,
1166
+ "normalized_score": 0.40984631701404173,
1167
+ "raw_text": "0.4098",
1168
+ "status_label": "scored"
1169
  },
1170
  "cosmos3_super_reasoner": {
1171
  "raw": null,
 
1248
  "status_label": "scored"
1249
  },
1250
  "qwen3_omni_v6_lora": {
1251
+ "raw": 0.3344936184319576,
1252
+ "metric_key": "misalignment_detection_f1",
1253
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
1254
  "scope": "multi_episode_128_partial_model_overlay",
1255
+ "status": "scored",
1256
+ "reason": null,
1257
+ "normalized_score": 0.3344936184319576,
1258
+ "raw_text": "0.3345",
1259
+ "status_label": "scored"
1260
  },
1261
  "cosmos3_super_reasoner": {
1262
  "raw": null,
 
1976
  "status_label": "scored"
1977
  },
1978
  "qwen3_omni_v6_lora": {
1979
+ "raw": 134.0687422166874,
1980
+ "metric_key": "time_to_transition_mae",
1981
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
1982
  "scope": "multi_episode_128_partial_model_overlay",
1983
+ "status": "scored",
1984
+ "reason": null,
1985
+ "normalized_score": 0.07859666766782253,
1986
+ "raw_text": "134.07",
1987
+ "status_label": "scored"
1988
  },
1989
  "cosmos3_super_reasoner": {
1990
  "raw": null,
 
3350
  "task_label": "Temporal Order Verification",
3351
  "series_id": "qwen3_omni_v6_lora",
3352
  "method": "Qwen3-Omni v6 LoRA",
3353
+ "status": "scored",
3354
+ "status_label": "scored",
3355
+ "scored": true,
3356
  "proxy_scored": false,
3357
+ "raw": 0.40984631701404173,
3358
+ "raw_text": "0.4098",
3359
+ "normalized_score": 0.40984631701404173,
3360
+ "metric_key": "temporal_order_f1",
3361
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
3362
  "scope": "multi_episode_128_partial_model_overlay",
3363
+ "reason": null
3364
  },
3365
  {
3366
  "task_number": 11,
 
3476
  "task_label": "Multimodal Synchronization Detection",
3477
  "series_id": "qwen3_omni_v6_lora",
3478
  "method": "Qwen3-Omni v6 LoRA",
3479
+ "status": "scored",
3480
+ "status_label": "scored",
3481
+ "scored": true,
3482
  "proxy_scored": false,
3483
+ "raw": 0.3344936184319576,
3484
+ "raw_text": "0.3345",
3485
+ "normalized_score": 0.3344936184319576,
3486
+ "metric_key": "misalignment_detection_f1",
3487
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
3488
  "scope": "multi_episode_128_partial_model_overlay",
3489
+ "reason": null
3490
  },
3491
  {
3492
  "task_number": 12,
 
4484
  "task_label": "Time-to-Next-Transition Regression",
4485
  "series_id": "qwen3_omni_v6_lora",
4486
  "method": "Qwen3-Omni v6 LoRA",
4487
+ "status": "scored",
4488
+ "status_label": "scored",
4489
+ "scored": true,
4490
  "proxy_scored": false,
4491
+ "raw": 134.0687422166874,
4492
+ "raw_text": "134.07",
4493
+ "normalized_score": 0.07859666766782253,
4494
+ "metric_key": "time_to_transition_mae",
4495
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
4496
  "scope": "multi_episode_128_partial_model_overlay",
4497
+ "reason": null
4498
  },
4499
  {
4500
  "task_number": 20,
data/publication_audit.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T21:12:50+00:00",
4
  "checks": [
5
  {
6
  "name": "required_publication_assets_present",
@@ -206,8 +206,8 @@
206
  "github_repo": {
207
  "root": "repo",
208
  "exists": true,
209
- "file_count": 1232,
210
- "text_file_count": 1034,
211
  "largest_file": {
212
  "path": "results/episode_task_suite/modality_reconstruction/predictions.npz",
213
  "bytes": 55702978
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:35+00:00",
4
  "checks": [
5
  {
6
  "name": "required_publication_assets_present",
 
206
  "github_repo": {
207
  "root": "repo",
208
  "exists": true,
209
+ "file_count": 1250,
210
+ "text_file_count": 1052,
211
  "largest_file": {
212
  "path": "results/episode_task_suite/modality_reconstruction/predictions.npz",
213
  "bytes": 55702978
data/single_episode_task_model_radar.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "title": "Single-Episode 20-Task Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "description": "Minimal and Neural MLP baselines on the one public sample episode, both scored on all 20 task contracts.",
6
  "task_count": 20,
7
  "method_count": 2,
 
1
  {
2
  "title": "Single-Episode 20-Task Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "description": "Minimal and Neural MLP baselines on the one public sample episode, both scored on all 20 task contracts.",
6
  "task_count": 20,
7
  "method_count": 2,
data/task_method_20_gap_audit.json CHANGED
@@ -1,10 +1,10 @@
1
  {
2
- "generated_at_utc": "2026-06-17T13:55:12+00:00",
3
  "immediate_actions": [
4
  {
5
  "artifact": "docs/data/task_method_20_gap_audit.json",
6
  "id": "gap_audit",
7
- "purpose": "Keep the 64 scoreless cells visible and reproducible."
8
  },
9
  {
10
  "artifact": "scripts/omni/score_model_output_probes.py",
@@ -101,11 +101,11 @@
101
  "proxy_scored_task_count": 0,
102
  "result_record_count": 20,
103
  "scope": "128 selected episodes, held-out test",
104
- "scored_task_count": 10,
105
- "scoreless_task_count": 10,
106
  "status_counts": {
107
- "not_evaluated_in_verified_package": 10,
108
- "scored": 10
109
  }
110
  },
111
  "raw128_neural_mlp": {
@@ -140,10 +140,10 @@
140
  "cosmos3_super_reasoner": 13,
141
  "metadata128_neural_mlp": 14,
142
  "metadata128_simple": 12,
143
- "qwen3_omni_v6_lora": 10
144
  },
145
  "missing_by_status": {
146
- "not_evaluated_in_verified_package": 38,
147
  "not_supported_by_metadata_only_package": 22,
148
  "unsupported_without_required_target": 4
149
  },
@@ -183,15 +183,13 @@
183
  "11 Temporal Order Verification": [
184
  "cosmos3_nano_future_window",
185
  "cosmos3_super_reasoner",
186
- "metadata128_neural_mlp",
187
- "qwen3_omni_v6_lora"
188
  ],
189
  "12 Multimodal Synchronization Detection": [
190
  "cosmos3_nano_future_window",
191
  "cosmos3_super_reasoner",
192
  "metadata128_neural_mlp",
193
- "metadata128_simple",
194
- "qwen3_omni_v6_lora"
195
  ],
196
  "13 Long-Horizon Next-Action Forecasting": [
197
  "cosmos3_nano_future_window",
@@ -241,8 +239,7 @@
241
  "cosmos3_nano_future_window",
242
  "cosmos3_super_reasoner",
243
  "metadata128_neural_mlp",
244
- "metadata128_simple",
245
- "qwen3_omni_v6_lora"
246
  ]
247
  },
248
  "missing_records": [
@@ -519,19 +516,6 @@
519
  "task_label": "Temporal Order Verification",
520
  "task_number": 11
521
  },
522
- {
523
- "method": "Qwen3-Omni v6 LoRA",
524
- "metric_key": "f1",
525
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
526
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
527
- "scope": "multi_episode_128_partial_model_overlay",
528
- "series_id": "qwen3_omni_v6_lora",
529
- "status": "not_evaluated_in_verified_package",
530
- "status_label": "not evaluated",
531
- "task_id": "temporal_order",
532
- "task_label": "Temporal Order Verification",
533
- "task_number": 11
534
- },
535
  {
536
  "method": "Cosmos3-Super Reasoner",
537
  "metric_key": "f1",
@@ -584,19 +568,6 @@
584
  "task_label": "Multimodal Synchronization Detection",
585
  "task_number": 12
586
  },
587
- {
588
- "method": "Qwen3-Omni v6 LoRA",
589
- "metric_key": "f1",
590
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
591
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
592
- "scope": "multi_episode_128_partial_model_overlay",
593
- "series_id": "qwen3_omni_v6_lora",
594
- "status": "not_evaluated_in_verified_package",
595
- "status_label": "not evaluated",
596
- "task_id": "misalignment_detection",
597
- "task_label": "Multimodal Synchronization Detection",
598
- "task_number": 12
599
- },
600
  {
601
  "method": "Cosmos3-Super Reasoner",
602
  "metric_key": "f1",
@@ -1039,19 +1010,6 @@
1039
  "task_label": "Time-to-Next-Transition Regression",
1040
  "task_number": 20
1041
  },
1042
- {
1043
- "method": "Qwen3-Omni v6 LoRA",
1044
- "metric_key": "mae",
1045
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1046
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
1047
- "scope": "multi_episode_128_partial_model_overlay",
1048
- "series_id": "qwen3_omni_v6_lora",
1049
- "status": "not_evaluated_in_verified_package",
1050
- "status_label": "not evaluated",
1051
- "task_id": "time_to_transition",
1052
- "task_label": "Time-to-Next-Transition Regression",
1053
- "task_number": 20
1054
- },
1055
  {
1056
  "method": "Cosmos3-Super Reasoner",
1057
  "metric_key": "mae",
@@ -1125,8 +1083,8 @@
1125
  "method_count": 9,
1126
  "method_task_record_count": 180,
1127
  "proxy_scored_method_task_count": 4,
1128
- "scored_method_task_count": 116,
1129
- "scoreless_method_task_count": 64,
1130
  "task_count": 20
1131
  },
1132
  "source_matrix": "docs/data/task_method_20_result_matrix.json",
 
1
  {
2
+ "generated_at_utc": "2026-06-17T21:17:51+00:00",
3
  "immediate_actions": [
4
  {
5
  "artifact": "docs/data/task_method_20_gap_audit.json",
6
  "id": "gap_audit",
7
+ "purpose": "Keep the 61 scoreless cells visible and reproducible."
8
  },
9
  {
10
  "artifact": "scripts/omni/score_model_output_probes.py",
 
101
  "proxy_scored_task_count": 0,
102
  "result_record_count": 20,
103
  "scope": "128 selected episodes, held-out test",
104
+ "scored_task_count": 13,
105
+ "scoreless_task_count": 7,
106
  "status_counts": {
107
+ "not_evaluated_in_verified_package": 7,
108
+ "scored": 13
109
  }
110
  },
111
  "raw128_neural_mlp": {
 
140
  "cosmos3_super_reasoner": 13,
141
  "metadata128_neural_mlp": 14,
142
  "metadata128_simple": 12,
143
+ "qwen3_omni_v6_lora": 7
144
  },
145
  "missing_by_status": {
146
+ "not_evaluated_in_verified_package": 35,
147
  "not_supported_by_metadata_only_package": 22,
148
  "unsupported_without_required_target": 4
149
  },
 
183
  "11 Temporal Order Verification": [
184
  "cosmos3_nano_future_window",
185
  "cosmos3_super_reasoner",
186
+ "metadata128_neural_mlp"
 
187
  ],
188
  "12 Multimodal Synchronization Detection": [
189
  "cosmos3_nano_future_window",
190
  "cosmos3_super_reasoner",
191
  "metadata128_neural_mlp",
192
+ "metadata128_simple"
 
193
  ],
194
  "13 Long-Horizon Next-Action Forecasting": [
195
  "cosmos3_nano_future_window",
 
239
  "cosmos3_nano_future_window",
240
  "cosmos3_super_reasoner",
241
  "metadata128_neural_mlp",
242
+ "metadata128_simple"
 
243
  ]
244
  },
245
  "missing_records": [
 
516
  "task_label": "Temporal Order Verification",
517
  "task_number": 11
518
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
519
  {
520
  "method": "Cosmos3-Super Reasoner",
521
  "metric_key": "f1",
 
568
  "task_label": "Multimodal Synchronization Detection",
569
  "task_number": 12
570
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
571
  {
572
  "method": "Cosmos3-Super Reasoner",
573
  "metric_key": "f1",
 
1010
  "task_label": "Time-to-Next-Transition Regression",
1011
  "task_number": 20
1012
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
1013
  {
1014
  "method": "Cosmos3-Super Reasoner",
1015
  "metric_key": "mae",
 
1083
  "method_count": 9,
1084
  "method_task_record_count": 180,
1085
  "proxy_scored_method_task_count": 4,
1086
+ "scored_method_task_count": 119,
1087
+ "scoreless_method_task_count": 61,
1088
  "task_count": 20
1089
  },
1090
  "source_matrix": "docs/data/task_method_20_result_matrix.json",
data/task_method_20_result_matrix.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
  "title": "Task Method 20-Result Matrix",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
- "scored_method_task_count": 116,
9
  "series": [
10
  {
11
  "id": "minimal",
@@ -161,17 +161,17 @@
161
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
162
  "plotted_as": "colored point overlay",
163
  "result_record_count": 20,
164
- "scored_task_count": 10,
165
- "covered_task_count": 10,
166
  "proxy_scored_task_count": 0,
167
- "scoreless_task_count": 10,
168
  "unsupported_task_count": 0,
169
- "not_evaluated_task_count": 10,
170
  "status_counts": {
171
- "not_evaluated_in_verified_package": 10,
172
- "scored": 10
173
  },
174
- "coverage_fraction": 0.5,
175
  "result_record_fraction": 1.0
176
  },
177
  {
@@ -1958,17 +1958,17 @@
1958
  "task_label": "Temporal Order Verification",
1959
  "series_id": "qwen3_omni_v6_lora",
1960
  "method": "Qwen3-Omni v6 LoRA",
1961
- "status": "not_evaluated_in_verified_package",
1962
- "status_label": "not evaluated",
1963
- "scored": false,
1964
  "proxy_scored": false,
1965
- "raw": null,
1966
- "raw_text": "n/a",
1967
- "normalized_score": null,
1968
- "metric_key": "f1",
1969
- "source": null,
1970
  "scope": "multi_episode_128_partial_model_overlay",
1971
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
1972
  },
1973
  {
1974
  "task_number": 11,
@@ -2120,17 +2120,17 @@
2120
  "task_label": "Multimodal Synchronization Detection",
2121
  "series_id": "qwen3_omni_v6_lora",
2122
  "method": "Qwen3-Omni v6 LoRA",
2123
- "status": "not_evaluated_in_verified_package",
2124
- "status_label": "not evaluated",
2125
- "scored": false,
2126
  "proxy_scored": false,
2127
- "raw": null,
2128
- "raw_text": "n/a",
2129
- "normalized_score": null,
2130
- "metric_key": "f1",
2131
- "source": null,
2132
  "scope": "multi_episode_128_partial_model_overlay",
2133
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
2134
  },
2135
  {
2136
  "task_number": 12,
@@ -3416,17 +3416,17 @@
3416
  "task_label": "Time-to-Next-Transition Regression",
3417
  "series_id": "qwen3_omni_v6_lora",
3418
  "method": "Qwen3-Omni v6 LoRA",
3419
- "status": "not_evaluated_in_verified_package",
3420
- "status_label": "not evaluated",
3421
- "scored": false,
3422
  "proxy_scored": false,
3423
- "raw": null,
3424
- "raw_text": "n/a",
3425
- "normalized_score": null,
3426
- "metric_key": "mae",
3427
- "source": null,
3428
  "scope": "multi_episode_128_partial_model_overlay",
3429
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3430
  },
3431
  {
3432
  "task_number": 20,
 
1
  {
2
  "title": "Task Method 20-Result Matrix",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
+ "scored_method_task_count": 119,
9
  "series": [
10
  {
11
  "id": "minimal",
 
161
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
162
  "plotted_as": "colored point overlay",
163
  "result_record_count": 20,
164
+ "scored_task_count": 13,
165
+ "covered_task_count": 13,
166
  "proxy_scored_task_count": 0,
167
+ "scoreless_task_count": 7,
168
  "unsupported_task_count": 0,
169
+ "not_evaluated_task_count": 7,
170
  "status_counts": {
171
+ "not_evaluated_in_verified_package": 7,
172
+ "scored": 13
173
  },
174
+ "coverage_fraction": 0.65,
175
  "result_record_fraction": 1.0
176
  },
177
  {
 
1958
  "task_label": "Temporal Order Verification",
1959
  "series_id": "qwen3_omni_v6_lora",
1960
  "method": "Qwen3-Omni v6 LoRA",
1961
+ "status": "scored",
1962
+ "status_label": "scored",
1963
+ "scored": true,
1964
  "proxy_scored": false,
1965
+ "raw": 0.40984631701404173,
1966
+ "raw_text": "0.4098",
1967
+ "normalized_score": 0.40984631701404173,
1968
+ "metric_key": "temporal_order_f1",
1969
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1970
  "scope": "multi_episode_128_partial_model_overlay",
1971
+ "reason": null
1972
  },
1973
  {
1974
  "task_number": 11,
 
2120
  "task_label": "Multimodal Synchronization Detection",
2121
  "series_id": "qwen3_omni_v6_lora",
2122
  "method": "Qwen3-Omni v6 LoRA",
2123
+ "status": "scored",
2124
+ "status_label": "scored",
2125
+ "scored": true,
2126
  "proxy_scored": false,
2127
+ "raw": 0.3344936184319576,
2128
+ "raw_text": "0.3345",
2129
+ "normalized_score": 0.3344936184319576,
2130
+ "metric_key": "misalignment_detection_f1",
2131
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
2132
  "scope": "multi_episode_128_partial_model_overlay",
2133
+ "reason": null
2134
  },
2135
  {
2136
  "task_number": 12,
 
3416
  "task_label": "Time-to-Next-Transition Regression",
3417
  "series_id": "qwen3_omni_v6_lora",
3418
  "method": "Qwen3-Omni v6 LoRA",
3419
+ "status": "scored",
3420
+ "status_label": "scored",
3421
+ "scored": true,
3422
  "proxy_scored": false,
3423
+ "raw": 134.0687422166874,
3424
+ "raw_text": "134.07",
3425
+ "normalized_score": 0.07859666766782253,
3426
+ "metric_key": "time_to_transition_mae",
3427
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
3428
  "scope": "multi_episode_128_partial_model_overlay",
3429
+ "reason": null
3430
  },
3431
  {
3432
  "task_number": 20,
data/task_surface_integrity.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T20:46:02+00:00",
4
  "summary": {
5
  "task_count": 12,
6
  "expected_task_count": 12,
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:26+00:00",
4
  "summary": {
5
  "task_count": 12,
6
  "expected_task_count": 12,
data/unified_task_model_radar.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
  "title": "Unified 20-Task Model Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
- "scored_method_task_count": 116,
9
  "normalization_policy": {
10
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
11
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
@@ -170,17 +170,17 @@
170
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
171
  "plotted_as": "colored point overlay",
172
  "result_record_count": 20,
173
- "scored_task_count": 10,
174
- "covered_task_count": 10,
175
  "proxy_scored_task_count": 0,
176
- "scoreless_task_count": 10,
177
  "unsupported_task_count": 0,
178
- "not_evaluated_task_count": 10,
179
  "status_counts": {
180
- "not_evaluated_in_verified_package": 10,
181
- "scored": 10
182
  },
183
- "coverage_fraction": 0.5,
184
  "result_record_fraction": 1.0
185
  },
186
  {
@@ -1375,6 +1375,17 @@
1375
  "raw_text": "0.8520",
1376
  "status_label": "scored"
1377
  },
 
 
 
 
 
 
 
 
 
 
 
1378
  "metadata128_simple": {
1379
  "raw": 0.4198864140782312,
1380
  "metric_key": "f1",
@@ -1419,17 +1430,6 @@
1419
  "raw_text": "n/a",
1420
  "status_label": "not supported"
1421
  },
1422
- "qwen3_omni_v6_lora": {
1423
- "raw": null,
1424
- "metric_key": "f1",
1425
- "source": null,
1426
- "scope": "multi_episode_128_partial_model_overlay",
1427
- "status": "not_evaluated_in_verified_package",
1428
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1429
- "normalized_score": null,
1430
- "raw_text": "n/a",
1431
- "status_label": "not evaluated"
1432
- },
1433
  "cosmos3_super_reasoner": {
1434
  "raw": null,
1435
  "metric_key": "f1",
@@ -1486,6 +1486,17 @@
1486
  "raw_text": "0.7153",
1487
  "status_label": "scored"
1488
  },
 
 
 
 
 
 
 
 
 
 
 
1489
  "metadata128_simple": {
1490
  "raw": null,
1491
  "metric_key": "f1",
@@ -1530,17 +1541,6 @@
1530
  "raw_text": "n/a",
1531
  "status_label": "not supported"
1532
  },
1533
- "qwen3_omni_v6_lora": {
1534
- "raw": null,
1535
- "metric_key": "f1",
1536
- "source": null,
1537
- "scope": "multi_episode_128_partial_model_overlay",
1538
- "status": "not_evaluated_in_verified_package",
1539
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1540
- "normalized_score": null,
1541
- "raw_text": "n/a",
1542
- "status_label": "not evaluated"
1543
- },
1544
  "cosmos3_super_reasoner": {
1545
  "raw": null,
1546
  "metric_key": "f1",
@@ -2374,6 +2374,17 @@
2374
  "raw_text": "10.55",
2375
  "status_label": "scored"
2376
  },
 
 
 
 
 
 
 
 
 
 
 
2377
  "raw128_simple": {
2378
  "raw": 52.32759475708008,
2379
  "metric_key": "mae",
@@ -2418,17 +2429,6 @@
2418
  "raw_text": "n/a",
2419
  "status_label": "not supported"
2420
  },
2421
- "qwen3_omni_v6_lora": {
2422
- "raw": null,
2423
- "metric_key": "mae",
2424
- "source": null,
2425
- "scope": "multi_episode_128_partial_model_overlay",
2426
- "status": "not_evaluated_in_verified_package",
2427
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
2428
- "normalized_score": null,
2429
- "raw_text": "n/a",
2430
- "status_label": "not evaluated"
2431
- },
2432
  "cosmos3_super_reasoner": {
2433
  "raw": null,
2434
  "metric_key": "mae",
@@ -2492,7 +2492,7 @@
2492
  "title": "Qwen3-Omni v6 LoRA",
2493
  "status": "verified",
2494
  "task_aligned_axes": "Qwen3",
2495
- "coverage": "20 records / 10 scored task-aligned axes",
2496
  "headline": "JSON validity 0.9990; action macro-F1 0.0029",
2497
  "source": "results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/eval/metrics.json"
2498
  },
@@ -4256,17 +4256,17 @@
4256
  "task_label": "Temporal Order Verification",
4257
  "series_id": "qwen3_omni_v6_lora",
4258
  "method": "Qwen3-Omni v6 LoRA",
4259
- "status": "not_evaluated_in_verified_package",
4260
- "status_label": "not evaluated",
4261
- "scored": false,
4262
  "proxy_scored": false,
4263
- "raw": null,
4264
- "raw_text": "n/a",
4265
- "normalized_score": null,
4266
- "metric_key": "f1",
4267
- "source": null,
4268
  "scope": "multi_episode_128_partial_model_overlay",
4269
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4270
  },
4271
  {
4272
  "task_number": 11,
@@ -4418,17 +4418,17 @@
4418
  "task_label": "Multimodal Synchronization Detection",
4419
  "series_id": "qwen3_omni_v6_lora",
4420
  "method": "Qwen3-Omni v6 LoRA",
4421
- "status": "not_evaluated_in_verified_package",
4422
- "status_label": "not evaluated",
4423
- "scored": false,
4424
  "proxy_scored": false,
4425
- "raw": null,
4426
- "raw_text": "n/a",
4427
- "normalized_score": null,
4428
- "metric_key": "f1",
4429
- "source": null,
4430
  "scope": "multi_episode_128_partial_model_overlay",
4431
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4432
  },
4433
  {
4434
  "task_number": 12,
@@ -5714,17 +5714,17 @@
5714
  "task_label": "Time-to-Next-Transition Regression",
5715
  "series_id": "qwen3_omni_v6_lora",
5716
  "method": "Qwen3-Omni v6 LoRA",
5717
- "status": "not_evaluated_in_verified_package",
5718
- "status_label": "not evaluated",
5719
- "scored": false,
5720
  "proxy_scored": false,
5721
- "raw": null,
5722
- "raw_text": "n/a",
5723
- "normalized_score": null,
5724
- "metric_key": "mae",
5725
- "source": null,
5726
  "scope": "multi_episode_128_partial_model_overlay",
5727
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
5728
  },
5729
  {
5730
  "task_number": 20,
 
1
  {
2
  "title": "Unified 20-Task Model Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
+ "scored_method_task_count": 119,
9
  "normalization_policy": {
10
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
11
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
 
170
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
171
  "plotted_as": "colored point overlay",
172
  "result_record_count": 20,
173
+ "scored_task_count": 13,
174
+ "covered_task_count": 13,
175
  "proxy_scored_task_count": 0,
176
+ "scoreless_task_count": 7,
177
  "unsupported_task_count": 0,
178
+ "not_evaluated_task_count": 7,
179
  "status_counts": {
180
+ "not_evaluated_in_verified_package": 7,
181
+ "scored": 13
182
  },
183
+ "coverage_fraction": 0.65,
184
  "result_record_fraction": 1.0
185
  },
186
  {
 
1375
  "raw_text": "0.8520",
1376
  "status_label": "scored"
1377
  },
1378
+ "qwen3_omni_v6_lora": {
1379
+ "raw": 0.40984631701404173,
1380
+ "metric_key": "temporal_order_f1",
1381
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1382
+ "scope": "multi_episode_128_partial_model_overlay",
1383
+ "status": "scored",
1384
+ "reason": null,
1385
+ "normalized_score": 0.40984631701404173,
1386
+ "raw_text": "0.4098",
1387
+ "status_label": "scored"
1388
+ },
1389
  "metadata128_simple": {
1390
  "raw": 0.4198864140782312,
1391
  "metric_key": "f1",
 
1430
  "raw_text": "n/a",
1431
  "status_label": "not supported"
1432
  },
 
 
 
 
 
 
 
 
 
 
 
1433
  "cosmos3_super_reasoner": {
1434
  "raw": null,
1435
  "metric_key": "f1",
 
1486
  "raw_text": "0.7153",
1487
  "status_label": "scored"
1488
  },
1489
+ "qwen3_omni_v6_lora": {
1490
+ "raw": 0.3344936184319576,
1491
+ "metric_key": "misalignment_detection_f1",
1492
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
1493
+ "scope": "multi_episode_128_partial_model_overlay",
1494
+ "status": "scored",
1495
+ "reason": null,
1496
+ "normalized_score": 0.3344936184319576,
1497
+ "raw_text": "0.3345",
1498
+ "status_label": "scored"
1499
+ },
1500
  "metadata128_simple": {
1501
  "raw": null,
1502
  "metric_key": "f1",
 
1541
  "raw_text": "n/a",
1542
  "status_label": "not supported"
1543
  },
 
 
 
 
 
 
 
 
 
 
 
1544
  "cosmos3_super_reasoner": {
1545
  "raw": null,
1546
  "metric_key": "f1",
 
2374
  "raw_text": "10.55",
2375
  "status_label": "scored"
2376
  },
2377
+ "qwen3_omni_v6_lora": {
2378
+ "raw": 134.0687422166874,
2379
+ "metric_key": "time_to_transition_mae",
2380
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
2381
+ "scope": "multi_episode_128_partial_model_overlay",
2382
+ "status": "scored",
2383
+ "reason": null,
2384
+ "normalized_score": 0.07859666766782253,
2385
+ "raw_text": "134.07",
2386
+ "status_label": "scored"
2387
+ },
2388
  "raw128_simple": {
2389
  "raw": 52.32759475708008,
2390
  "metric_key": "mae",
 
2429
  "raw_text": "n/a",
2430
  "status_label": "not supported"
2431
  },
 
 
 
 
 
 
 
 
 
 
 
2432
  "cosmos3_super_reasoner": {
2433
  "raw": null,
2434
  "metric_key": "mae",
 
2492
  "title": "Qwen3-Omni v6 LoRA",
2493
  "status": "verified",
2494
  "task_aligned_axes": "Qwen3",
2495
+ "coverage": "20 records / 13 scored task-aligned axes",
2496
  "headline": "JSON validity 0.9990; action macro-F1 0.0029",
2497
  "source": "results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/eval/metrics.json"
2498
  },
 
4256
  "task_label": "Temporal Order Verification",
4257
  "series_id": "qwen3_omni_v6_lora",
4258
  "method": "Qwen3-Omni v6 LoRA",
4259
+ "status": "scored",
4260
+ "status_label": "scored",
4261
+ "scored": true,
4262
  "proxy_scored": false,
4263
+ "raw": 0.40984631701404173,
4264
+ "raw_text": "0.4098",
4265
+ "normalized_score": 0.40984631701404173,
4266
+ "metric_key": "temporal_order_f1",
4267
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
4268
  "scope": "multi_episode_128_partial_model_overlay",
4269
+ "reason": null
4270
  },
4271
  {
4272
  "task_number": 11,
 
4418
  "task_label": "Multimodal Synchronization Detection",
4419
  "series_id": "qwen3_omni_v6_lora",
4420
  "method": "Qwen3-Omni v6 LoRA",
4421
+ "status": "scored",
4422
+ "status_label": "scored",
4423
+ "scored": true,
4424
  "proxy_scored": false,
4425
+ "raw": 0.3344936184319576,
4426
+ "raw_text": "0.3345",
4427
+ "normalized_score": 0.3344936184319576,
4428
+ "metric_key": "misalignment_detection_f1",
4429
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
4430
  "scope": "multi_episode_128_partial_model_overlay",
4431
+ "reason": null
4432
  },
4433
  {
4434
  "task_number": 12,
 
5714
  "task_label": "Time-to-Next-Transition Regression",
5715
  "series_id": "qwen3_omni_v6_lora",
5716
  "method": "Qwen3-Omni v6 LoRA",
5717
+ "status": "scored",
5718
+ "status_label": "scored",
5719
+ "scored": true,
5720
  "proxy_scored": false,
5721
+ "raw": 134.0687422166874,
5722
+ "raw_text": "134.07",
5723
+ "normalized_score": 0.07859666766782253,
5724
+ "metric_key": "time_to_transition_mae",
5725
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
5726
  "scope": "multi_episode_128_partial_model_overlay",
5727
+ "reason": null
5728
  },
5729
  {
5730
  "task_number": 20,
data/website_integrity.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T21:12:34+00:00",
4
  "docs_root": "docs",
5
  "site_base": "/ropedia-xperience-10m-task-suite/",
6
  "summary": {
@@ -316,7 +316,7 @@
316
  },
317
  {
318
  "path": "data/episode128_task_model_radar.json",
319
- "bytes": 187388,
320
  "top_level_type": "dict"
321
  },
322
  {
@@ -486,12 +486,12 @@
486
  },
487
  {
488
  "path": "data/task_method_20_gap_audit.json",
489
- "bytes": 55745,
490
  "top_level_type": "dict"
491
  },
492
  {
493
  "path": "data/task_method_20_result_matrix.json",
494
- "bytes": 129749,
495
  "top_level_type": "dict"
496
  },
497
  {
@@ -526,7 +526,7 @@
526
  },
527
  {
528
  "path": "data/unified_task_model_radar.json",
529
- "bytes": 231240,
530
  "top_level_type": "dict"
531
  },
532
  {
@@ -566,7 +566,7 @@
566
  {
567
  "path": "assets/charts/episode128_task_model_radar.svg",
568
  "exists": true,
569
- "bytes": 44044,
570
  "format": "SVG",
571
  "has_viewbox": true
572
  },
@@ -636,7 +636,7 @@
636
  {
637
  "path": "assets/charts/unified_task_model_radar.svg",
638
  "exists": true,
639
- "bytes": 50060,
640
  "format": "SVG",
641
  "has_viewbox": true
642
  },
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:27+00:00",
4
  "docs_root": "docs",
5
  "site_base": "/ropedia-xperience-10m-task-suite/",
6
  "summary": {
 
316
  },
317
  {
318
  "path": "data/episode128_task_model_radar.json",
319
+ "bytes": 187309,
320
  "top_level_type": "dict"
321
  },
322
  {
 
486
  },
487
  {
488
  "path": "data/task_method_20_gap_audit.json",
489
+ "bytes": 53574,
490
  "top_level_type": "dict"
491
  },
492
  {
493
  "path": "data/task_method_20_result_matrix.json",
494
+ "bytes": 129707,
495
  "top_level_type": "dict"
496
  },
497
  {
 
526
  },
527
  {
528
  "path": "data/unified_task_model_radar.json",
529
+ "bytes": 231161,
530
  "top_level_type": "dict"
531
  },
532
  {
 
566
  {
567
  "path": "assets/charts/episode128_task_model_radar.svg",
568
  "exists": true,
569
+ "bytes": 44378,
570
  "format": "SVG",
571
  "has_viewbox": true
572
  },
 
636
  {
637
  "path": "assets/charts/unified_task_model_radar.svg",
638
  "exists": true,
639
+ "bytes": 50394,
640
  "format": "SVG",
641
  "has_viewbox": true
642
  },
docs/data/episode128_task_model_radar.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
  "title": "128-Episode 20-Task Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "description": "Selected 128-episode metadata/raw baselines plus verified Qwen3/Cosmos branches. Every method has 20 records; numeric scores appear only where the public artifact produced that task target.",
6
  "task_count": 20,
7
  "method_count": 7,
8
  "method_task_record_count": 140,
9
- "scored_method_task_count": 76,
10
  "normalization_policy": {
11
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
12
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
@@ -127,17 +127,17 @@
127
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
128
  "plotted_as": "colored point overlay",
129
  "result_record_count": 20,
130
- "scored_task_count": 10,
131
- "covered_task_count": 10,
132
  "proxy_scored_task_count": 0,
133
- "scoreless_task_count": 10,
134
  "unsupported_task_count": 0,
135
- "not_evaluated_task_count": 10,
136
  "status_counts": {
137
- "not_evaluated_in_verified_package": 10,
138
- "scored": 10
139
  },
140
- "coverage_fraction": 0.5,
141
  "result_record_fraction": 1.0
142
  },
143
  {
@@ -1157,15 +1157,15 @@
1157
  "status_label": "scored"
1158
  },
1159
  "qwen3_omni_v6_lora": {
1160
- "raw": null,
1161
- "metric_key": "f1",
1162
- "source": null,
1163
  "scope": "multi_episode_128_partial_model_overlay",
1164
- "status": "not_evaluated_in_verified_package",
1165
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1166
- "normalized_score": null,
1167
- "raw_text": "n/a",
1168
- "status_label": "not evaluated"
1169
  },
1170
  "cosmos3_super_reasoner": {
1171
  "raw": null,
@@ -1248,15 +1248,15 @@
1248
  "status_label": "scored"
1249
  },
1250
  "qwen3_omni_v6_lora": {
1251
- "raw": null,
1252
- "metric_key": "f1",
1253
- "source": null,
1254
  "scope": "multi_episode_128_partial_model_overlay",
1255
- "status": "not_evaluated_in_verified_package",
1256
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1257
- "normalized_score": null,
1258
- "raw_text": "n/a",
1259
- "status_label": "not evaluated"
1260
  },
1261
  "cosmos3_super_reasoner": {
1262
  "raw": null,
@@ -1976,15 +1976,15 @@
1976
  "status_label": "scored"
1977
  },
1978
  "qwen3_omni_v6_lora": {
1979
- "raw": null,
1980
- "metric_key": "mae",
1981
- "source": null,
1982
  "scope": "multi_episode_128_partial_model_overlay",
1983
- "status": "not_evaluated_in_verified_package",
1984
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1985
- "normalized_score": null,
1986
- "raw_text": "n/a",
1987
- "status_label": "not evaluated"
1988
  },
1989
  "cosmos3_super_reasoner": {
1990
  "raw": null,
@@ -3350,17 +3350,17 @@
3350
  "task_label": "Temporal Order Verification",
3351
  "series_id": "qwen3_omni_v6_lora",
3352
  "method": "Qwen3-Omni v6 LoRA",
3353
- "status": "not_evaluated_in_verified_package",
3354
- "status_label": "not evaluated",
3355
- "scored": false,
3356
  "proxy_scored": false,
3357
- "raw": null,
3358
- "raw_text": "n/a",
3359
- "normalized_score": null,
3360
- "metric_key": "f1",
3361
- "source": null,
3362
  "scope": "multi_episode_128_partial_model_overlay",
3363
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3364
  },
3365
  {
3366
  "task_number": 11,
@@ -3476,17 +3476,17 @@
3476
  "task_label": "Multimodal Synchronization Detection",
3477
  "series_id": "qwen3_omni_v6_lora",
3478
  "method": "Qwen3-Omni v6 LoRA",
3479
- "status": "not_evaluated_in_verified_package",
3480
- "status_label": "not evaluated",
3481
- "scored": false,
3482
  "proxy_scored": false,
3483
- "raw": null,
3484
- "raw_text": "n/a",
3485
- "normalized_score": null,
3486
- "metric_key": "f1",
3487
- "source": null,
3488
  "scope": "multi_episode_128_partial_model_overlay",
3489
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3490
  },
3491
  {
3492
  "task_number": 12,
@@ -4484,17 +4484,17 @@
4484
  "task_label": "Time-to-Next-Transition Regression",
4485
  "series_id": "qwen3_omni_v6_lora",
4486
  "method": "Qwen3-Omni v6 LoRA",
4487
- "status": "not_evaluated_in_verified_package",
4488
- "status_label": "not evaluated",
4489
- "scored": false,
4490
  "proxy_scored": false,
4491
- "raw": null,
4492
- "raw_text": "n/a",
4493
- "normalized_score": null,
4494
- "metric_key": "mae",
4495
- "source": null,
4496
  "scope": "multi_episode_128_partial_model_overlay",
4497
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4498
  },
4499
  {
4500
  "task_number": 20,
 
1
  {
2
  "title": "128-Episode 20-Task Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "description": "Selected 128-episode metadata/raw baselines plus verified Qwen3/Cosmos branches. Every method has 20 records; numeric scores appear only where the public artifact produced that task target.",
6
  "task_count": 20,
7
  "method_count": 7,
8
  "method_task_record_count": 140,
9
+ "scored_method_task_count": 79,
10
  "normalization_policy": {
11
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
12
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
 
127
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
128
  "plotted_as": "colored point overlay",
129
  "result_record_count": 20,
130
+ "scored_task_count": 13,
131
+ "covered_task_count": 13,
132
  "proxy_scored_task_count": 0,
133
+ "scoreless_task_count": 7,
134
  "unsupported_task_count": 0,
135
+ "not_evaluated_task_count": 7,
136
  "status_counts": {
137
+ "not_evaluated_in_verified_package": 7,
138
+ "scored": 13
139
  },
140
+ "coverage_fraction": 0.65,
141
  "result_record_fraction": 1.0
142
  },
143
  {
 
1157
  "status_label": "scored"
1158
  },
1159
  "qwen3_omni_v6_lora": {
1160
+ "raw": 0.40984631701404173,
1161
+ "metric_key": "temporal_order_f1",
1162
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1163
  "scope": "multi_episode_128_partial_model_overlay",
1164
+ "status": "scored",
1165
+ "reason": null,
1166
+ "normalized_score": 0.40984631701404173,
1167
+ "raw_text": "0.4098",
1168
+ "status_label": "scored"
1169
  },
1170
  "cosmos3_super_reasoner": {
1171
  "raw": null,
 
1248
  "status_label": "scored"
1249
  },
1250
  "qwen3_omni_v6_lora": {
1251
+ "raw": 0.3344936184319576,
1252
+ "metric_key": "misalignment_detection_f1",
1253
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
1254
  "scope": "multi_episode_128_partial_model_overlay",
1255
+ "status": "scored",
1256
+ "reason": null,
1257
+ "normalized_score": 0.3344936184319576,
1258
+ "raw_text": "0.3345",
1259
+ "status_label": "scored"
1260
  },
1261
  "cosmos3_super_reasoner": {
1262
  "raw": null,
 
1976
  "status_label": "scored"
1977
  },
1978
  "qwen3_omni_v6_lora": {
1979
+ "raw": 134.0687422166874,
1980
+ "metric_key": "time_to_transition_mae",
1981
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
1982
  "scope": "multi_episode_128_partial_model_overlay",
1983
+ "status": "scored",
1984
+ "reason": null,
1985
+ "normalized_score": 0.07859666766782253,
1986
+ "raw_text": "134.07",
1987
+ "status_label": "scored"
1988
  },
1989
  "cosmos3_super_reasoner": {
1990
  "raw": null,
 
3350
  "task_label": "Temporal Order Verification",
3351
  "series_id": "qwen3_omni_v6_lora",
3352
  "method": "Qwen3-Omni v6 LoRA",
3353
+ "status": "scored",
3354
+ "status_label": "scored",
3355
+ "scored": true,
3356
  "proxy_scored": false,
3357
+ "raw": 0.40984631701404173,
3358
+ "raw_text": "0.4098",
3359
+ "normalized_score": 0.40984631701404173,
3360
+ "metric_key": "temporal_order_f1",
3361
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
3362
  "scope": "multi_episode_128_partial_model_overlay",
3363
+ "reason": null
3364
  },
3365
  {
3366
  "task_number": 11,
 
3476
  "task_label": "Multimodal Synchronization Detection",
3477
  "series_id": "qwen3_omni_v6_lora",
3478
  "method": "Qwen3-Omni v6 LoRA",
3479
+ "status": "scored",
3480
+ "status_label": "scored",
3481
+ "scored": true,
3482
  "proxy_scored": false,
3483
+ "raw": 0.3344936184319576,
3484
+ "raw_text": "0.3345",
3485
+ "normalized_score": 0.3344936184319576,
3486
+ "metric_key": "misalignment_detection_f1",
3487
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
3488
  "scope": "multi_episode_128_partial_model_overlay",
3489
+ "reason": null
3490
  },
3491
  {
3492
  "task_number": 12,
 
4484
  "task_label": "Time-to-Next-Transition Regression",
4485
  "series_id": "qwen3_omni_v6_lora",
4486
  "method": "Qwen3-Omni v6 LoRA",
4487
+ "status": "scored",
4488
+ "status_label": "scored",
4489
+ "scored": true,
4490
  "proxy_scored": false,
4491
+ "raw": 134.0687422166874,
4492
+ "raw_text": "134.07",
4493
+ "normalized_score": 0.07859666766782253,
4494
+ "metric_key": "time_to_transition_mae",
4495
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
4496
  "scope": "multi_episode_128_partial_model_overlay",
4497
+ "reason": null
4498
  },
4499
  {
4500
  "task_number": 20,
docs/data/publication_audit.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T21:12:50+00:00",
4
  "checks": [
5
  {
6
  "name": "required_publication_assets_present",
@@ -206,8 +206,8 @@
206
  "github_repo": {
207
  "root": "repo",
208
  "exists": true,
209
- "file_count": 1232,
210
- "text_file_count": 1034,
211
  "largest_file": {
212
  "path": "results/episode_task_suite/modality_reconstruction/predictions.npz",
213
  "bytes": 55702978
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:35+00:00",
4
  "checks": [
5
  {
6
  "name": "required_publication_assets_present",
 
206
  "github_repo": {
207
  "root": "repo",
208
  "exists": true,
209
+ "file_count": 1250,
210
+ "text_file_count": 1052,
211
  "largest_file": {
212
  "path": "results/episode_task_suite/modality_reconstruction/predictions.npz",
213
  "bytes": 55702978
docs/data/single_episode_task_model_radar.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "title": "Single-Episode 20-Task Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "description": "Minimal and Neural MLP baselines on the one public sample episode, both scored on all 20 task contracts.",
6
  "task_count": 20,
7
  "method_count": 2,
 
1
  {
2
  "title": "Single-Episode 20-Task Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "description": "Minimal and Neural MLP baselines on the one public sample episode, both scored on all 20 task contracts.",
6
  "task_count": 20,
7
  "method_count": 2,
docs/data/task_method_20_gap_audit.json CHANGED
@@ -1,10 +1,10 @@
1
  {
2
- "generated_at_utc": "2026-06-17T13:55:12+00:00",
3
  "immediate_actions": [
4
  {
5
  "artifact": "docs/data/task_method_20_gap_audit.json",
6
  "id": "gap_audit",
7
- "purpose": "Keep the 64 scoreless cells visible and reproducible."
8
  },
9
  {
10
  "artifact": "scripts/omni/score_model_output_probes.py",
@@ -101,11 +101,11 @@
101
  "proxy_scored_task_count": 0,
102
  "result_record_count": 20,
103
  "scope": "128 selected episodes, held-out test",
104
- "scored_task_count": 10,
105
- "scoreless_task_count": 10,
106
  "status_counts": {
107
- "not_evaluated_in_verified_package": 10,
108
- "scored": 10
109
  }
110
  },
111
  "raw128_neural_mlp": {
@@ -140,10 +140,10 @@
140
  "cosmos3_super_reasoner": 13,
141
  "metadata128_neural_mlp": 14,
142
  "metadata128_simple": 12,
143
- "qwen3_omni_v6_lora": 10
144
  },
145
  "missing_by_status": {
146
- "not_evaluated_in_verified_package": 38,
147
  "not_supported_by_metadata_only_package": 22,
148
  "unsupported_without_required_target": 4
149
  },
@@ -183,15 +183,13 @@
183
  "11 Temporal Order Verification": [
184
  "cosmos3_nano_future_window",
185
  "cosmos3_super_reasoner",
186
- "metadata128_neural_mlp",
187
- "qwen3_omni_v6_lora"
188
  ],
189
  "12 Multimodal Synchronization Detection": [
190
  "cosmos3_nano_future_window",
191
  "cosmos3_super_reasoner",
192
  "metadata128_neural_mlp",
193
- "metadata128_simple",
194
- "qwen3_omni_v6_lora"
195
  ],
196
  "13 Long-Horizon Next-Action Forecasting": [
197
  "cosmos3_nano_future_window",
@@ -241,8 +239,7 @@
241
  "cosmos3_nano_future_window",
242
  "cosmos3_super_reasoner",
243
  "metadata128_neural_mlp",
244
- "metadata128_simple",
245
- "qwen3_omni_v6_lora"
246
  ]
247
  },
248
  "missing_records": [
@@ -519,19 +516,6 @@
519
  "task_label": "Temporal Order Verification",
520
  "task_number": 11
521
  },
522
- {
523
- "method": "Qwen3-Omni v6 LoRA",
524
- "metric_key": "f1",
525
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
526
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
527
- "scope": "multi_episode_128_partial_model_overlay",
528
- "series_id": "qwen3_omni_v6_lora",
529
- "status": "not_evaluated_in_verified_package",
530
- "status_label": "not evaluated",
531
- "task_id": "temporal_order",
532
- "task_label": "Temporal Order Verification",
533
- "task_number": 11
534
- },
535
  {
536
  "method": "Cosmos3-Super Reasoner",
537
  "metric_key": "f1",
@@ -584,19 +568,6 @@
584
  "task_label": "Multimodal Synchronization Detection",
585
  "task_number": 12
586
  },
587
- {
588
- "method": "Qwen3-Omni v6 LoRA",
589
- "metric_key": "f1",
590
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
591
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
592
- "scope": "multi_episode_128_partial_model_overlay",
593
- "series_id": "qwen3_omni_v6_lora",
594
- "status": "not_evaluated_in_verified_package",
595
- "status_label": "not evaluated",
596
- "task_id": "misalignment_detection",
597
- "task_label": "Multimodal Synchronization Detection",
598
- "task_number": 12
599
- },
600
  {
601
  "method": "Cosmos3-Super Reasoner",
602
  "metric_key": "f1",
@@ -1039,19 +1010,6 @@
1039
  "task_label": "Time-to-Next-Transition Regression",
1040
  "task_number": 20
1041
  },
1042
- {
1043
- "method": "Qwen3-Omni v6 LoRA",
1044
- "metric_key": "mae",
1045
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1046
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
1047
- "scope": "multi_episode_128_partial_model_overlay",
1048
- "series_id": "qwen3_omni_v6_lora",
1049
- "status": "not_evaluated_in_verified_package",
1050
- "status_label": "not evaluated",
1051
- "task_id": "time_to_transition",
1052
- "task_label": "Time-to-Next-Transition Regression",
1053
- "task_number": 20
1054
- },
1055
  {
1056
  "method": "Cosmos3-Super Reasoner",
1057
  "metric_key": "mae",
@@ -1125,8 +1083,8 @@
1125
  "method_count": 9,
1126
  "method_task_record_count": 180,
1127
  "proxy_scored_method_task_count": 4,
1128
- "scored_method_task_count": 116,
1129
- "scoreless_method_task_count": 64,
1130
  "task_count": 20
1131
  },
1132
  "source_matrix": "docs/data/task_method_20_result_matrix.json",
 
1
  {
2
+ "generated_at_utc": "2026-06-17T21:17:51+00:00",
3
  "immediate_actions": [
4
  {
5
  "artifact": "docs/data/task_method_20_gap_audit.json",
6
  "id": "gap_audit",
7
+ "purpose": "Keep the 61 scoreless cells visible and reproducible."
8
  },
9
  {
10
  "artifact": "scripts/omni/score_model_output_probes.py",
 
101
  "proxy_scored_task_count": 0,
102
  "result_record_count": 20,
103
  "scope": "128 selected episodes, held-out test",
104
+ "scored_task_count": 13,
105
+ "scoreless_task_count": 7,
106
  "status_counts": {
107
+ "not_evaluated_in_verified_package": 7,
108
+ "scored": 13
109
  }
110
  },
111
  "raw128_neural_mlp": {
 
140
  "cosmos3_super_reasoner": 13,
141
  "metadata128_neural_mlp": 14,
142
  "metadata128_simple": 12,
143
+ "qwen3_omni_v6_lora": 7
144
  },
145
  "missing_by_status": {
146
+ "not_evaluated_in_verified_package": 35,
147
  "not_supported_by_metadata_only_package": 22,
148
  "unsupported_without_required_target": 4
149
  },
 
183
  "11 Temporal Order Verification": [
184
  "cosmos3_nano_future_window",
185
  "cosmos3_super_reasoner",
186
+ "metadata128_neural_mlp"
 
187
  ],
188
  "12 Multimodal Synchronization Detection": [
189
  "cosmos3_nano_future_window",
190
  "cosmos3_super_reasoner",
191
  "metadata128_neural_mlp",
192
+ "metadata128_simple"
 
193
  ],
194
  "13 Long-Horizon Next-Action Forecasting": [
195
  "cosmos3_nano_future_window",
 
239
  "cosmos3_nano_future_window",
240
  "cosmos3_super_reasoner",
241
  "metadata128_neural_mlp",
242
+ "metadata128_simple"
 
243
  ]
244
  },
245
  "missing_records": [
 
516
  "task_label": "Temporal Order Verification",
517
  "task_number": 11
518
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
519
  {
520
  "method": "Cosmos3-Super Reasoner",
521
  "metric_key": "f1",
 
568
  "task_label": "Multimodal Synchronization Detection",
569
  "task_number": 12
570
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
571
  {
572
  "method": "Cosmos3-Super Reasoner",
573
  "metric_key": "f1",
 
1010
  "task_label": "Time-to-Next-Transition Regression",
1011
  "task_number": 20
1012
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
1013
  {
1014
  "method": "Cosmos3-Super Reasoner",
1015
  "metric_key": "mae",
 
1083
  "method_count": 9,
1084
  "method_task_record_count": 180,
1085
  "proxy_scored_method_task_count": 4,
1086
+ "scored_method_task_count": 119,
1087
+ "scoreless_method_task_count": 61,
1088
  "task_count": 20
1089
  },
1090
  "source_matrix": "docs/data/task_method_20_result_matrix.json",
docs/data/task_method_20_result_matrix.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
  "title": "Task Method 20-Result Matrix",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
- "scored_method_task_count": 116,
9
  "series": [
10
  {
11
  "id": "minimal",
@@ -161,17 +161,17 @@
161
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
162
  "plotted_as": "colored point overlay",
163
  "result_record_count": 20,
164
- "scored_task_count": 10,
165
- "covered_task_count": 10,
166
  "proxy_scored_task_count": 0,
167
- "scoreless_task_count": 10,
168
  "unsupported_task_count": 0,
169
- "not_evaluated_task_count": 10,
170
  "status_counts": {
171
- "not_evaluated_in_verified_package": 10,
172
- "scored": 10
173
  },
174
- "coverage_fraction": 0.5,
175
  "result_record_fraction": 1.0
176
  },
177
  {
@@ -1958,17 +1958,17 @@
1958
  "task_label": "Temporal Order Verification",
1959
  "series_id": "qwen3_omni_v6_lora",
1960
  "method": "Qwen3-Omni v6 LoRA",
1961
- "status": "not_evaluated_in_verified_package",
1962
- "status_label": "not evaluated",
1963
- "scored": false,
1964
  "proxy_scored": false,
1965
- "raw": null,
1966
- "raw_text": "n/a",
1967
- "normalized_score": null,
1968
- "metric_key": "f1",
1969
- "source": null,
1970
  "scope": "multi_episode_128_partial_model_overlay",
1971
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
1972
  },
1973
  {
1974
  "task_number": 11,
@@ -2120,17 +2120,17 @@
2120
  "task_label": "Multimodal Synchronization Detection",
2121
  "series_id": "qwen3_omni_v6_lora",
2122
  "method": "Qwen3-Omni v6 LoRA",
2123
- "status": "not_evaluated_in_verified_package",
2124
- "status_label": "not evaluated",
2125
- "scored": false,
2126
  "proxy_scored": false,
2127
- "raw": null,
2128
- "raw_text": "n/a",
2129
- "normalized_score": null,
2130
- "metric_key": "f1",
2131
- "source": null,
2132
  "scope": "multi_episode_128_partial_model_overlay",
2133
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
2134
  },
2135
  {
2136
  "task_number": 12,
@@ -3416,17 +3416,17 @@
3416
  "task_label": "Time-to-Next-Transition Regression",
3417
  "series_id": "qwen3_omni_v6_lora",
3418
  "method": "Qwen3-Omni v6 LoRA",
3419
- "status": "not_evaluated_in_verified_package",
3420
- "status_label": "not evaluated",
3421
- "scored": false,
3422
  "proxy_scored": false,
3423
- "raw": null,
3424
- "raw_text": "n/a",
3425
- "normalized_score": null,
3426
- "metric_key": "mae",
3427
- "source": null,
3428
  "scope": "multi_episode_128_partial_model_overlay",
3429
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3430
  },
3431
  {
3432
  "task_number": 20,
 
1
  {
2
  "title": "Task Method 20-Result Matrix",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
+ "scored_method_task_count": 119,
9
  "series": [
10
  {
11
  "id": "minimal",
 
161
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
162
  "plotted_as": "colored point overlay",
163
  "result_record_count": 20,
164
+ "scored_task_count": 13,
165
+ "covered_task_count": 13,
166
  "proxy_scored_task_count": 0,
167
+ "scoreless_task_count": 7,
168
  "unsupported_task_count": 0,
169
+ "not_evaluated_task_count": 7,
170
  "status_counts": {
171
+ "not_evaluated_in_verified_package": 7,
172
+ "scored": 13
173
  },
174
+ "coverage_fraction": 0.65,
175
  "result_record_fraction": 1.0
176
  },
177
  {
 
1958
  "task_label": "Temporal Order Verification",
1959
  "series_id": "qwen3_omni_v6_lora",
1960
  "method": "Qwen3-Omni v6 LoRA",
1961
+ "status": "scored",
1962
+ "status_label": "scored",
1963
+ "scored": true,
1964
  "proxy_scored": false,
1965
+ "raw": 0.40984631701404173,
1966
+ "raw_text": "0.4098",
1967
+ "normalized_score": 0.40984631701404173,
1968
+ "metric_key": "temporal_order_f1",
1969
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1970
  "scope": "multi_episode_128_partial_model_overlay",
1971
+ "reason": null
1972
  },
1973
  {
1974
  "task_number": 11,
 
2120
  "task_label": "Multimodal Synchronization Detection",
2121
  "series_id": "qwen3_omni_v6_lora",
2122
  "method": "Qwen3-Omni v6 LoRA",
2123
+ "status": "scored",
2124
+ "status_label": "scored",
2125
+ "scored": true,
2126
  "proxy_scored": false,
2127
+ "raw": 0.3344936184319576,
2128
+ "raw_text": "0.3345",
2129
+ "normalized_score": 0.3344936184319576,
2130
+ "metric_key": "misalignment_detection_f1",
2131
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
2132
  "scope": "multi_episode_128_partial_model_overlay",
2133
+ "reason": null
2134
  },
2135
  {
2136
  "task_number": 12,
 
3416
  "task_label": "Time-to-Next-Transition Regression",
3417
  "series_id": "qwen3_omni_v6_lora",
3418
  "method": "Qwen3-Omni v6 LoRA",
3419
+ "status": "scored",
3420
+ "status_label": "scored",
3421
+ "scored": true,
3422
  "proxy_scored": false,
3423
+ "raw": 134.0687422166874,
3424
+ "raw_text": "134.07",
3425
+ "normalized_score": 0.07859666766782253,
3426
+ "metric_key": "time_to_transition_mae",
3427
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
3428
  "scope": "multi_episode_128_partial_model_overlay",
3429
+ "reason": null
3430
  },
3431
  {
3432
  "task_number": 20,
docs/data/task_surface_integrity.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T20:46:02+00:00",
4
  "summary": {
5
  "task_count": 12,
6
  "expected_task_count": 12,
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:26+00:00",
4
  "summary": {
5
  "task_count": 12,
6
  "expected_task_count": 12,
docs/data/unified_task_model_radar.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
  "title": "Unified 20-Task Model Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
- "scored_method_task_count": 116,
9
  "normalization_policy": {
10
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
11
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
@@ -170,17 +170,17 @@
170
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
171
  "plotted_as": "colored point overlay",
172
  "result_record_count": 20,
173
- "scored_task_count": 10,
174
- "covered_task_count": 10,
175
  "proxy_scored_task_count": 0,
176
- "scoreless_task_count": 10,
177
  "unsupported_task_count": 0,
178
- "not_evaluated_task_count": 10,
179
  "status_counts": {
180
- "not_evaluated_in_verified_package": 10,
181
- "scored": 10
182
  },
183
- "coverage_fraction": 0.5,
184
  "result_record_fraction": 1.0
185
  },
186
  {
@@ -1375,6 +1375,17 @@
1375
  "raw_text": "0.8520",
1376
  "status_label": "scored"
1377
  },
 
 
 
 
 
 
 
 
 
 
 
1378
  "metadata128_simple": {
1379
  "raw": 0.4198864140782312,
1380
  "metric_key": "f1",
@@ -1419,17 +1430,6 @@
1419
  "raw_text": "n/a",
1420
  "status_label": "not supported"
1421
  },
1422
- "qwen3_omni_v6_lora": {
1423
- "raw": null,
1424
- "metric_key": "f1",
1425
- "source": null,
1426
- "scope": "multi_episode_128_partial_model_overlay",
1427
- "status": "not_evaluated_in_verified_package",
1428
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1429
- "normalized_score": null,
1430
- "raw_text": "n/a",
1431
- "status_label": "not evaluated"
1432
- },
1433
  "cosmos3_super_reasoner": {
1434
  "raw": null,
1435
  "metric_key": "f1",
@@ -1486,6 +1486,17 @@
1486
  "raw_text": "0.7153",
1487
  "status_label": "scored"
1488
  },
 
 
 
 
 
 
 
 
 
 
 
1489
  "metadata128_simple": {
1490
  "raw": null,
1491
  "metric_key": "f1",
@@ -1530,17 +1541,6 @@
1530
  "raw_text": "n/a",
1531
  "status_label": "not supported"
1532
  },
1533
- "qwen3_omni_v6_lora": {
1534
- "raw": null,
1535
- "metric_key": "f1",
1536
- "source": null,
1537
- "scope": "multi_episode_128_partial_model_overlay",
1538
- "status": "not_evaluated_in_verified_package",
1539
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1540
- "normalized_score": null,
1541
- "raw_text": "n/a",
1542
- "status_label": "not evaluated"
1543
- },
1544
  "cosmos3_super_reasoner": {
1545
  "raw": null,
1546
  "metric_key": "f1",
@@ -2374,6 +2374,17 @@
2374
  "raw_text": "10.55",
2375
  "status_label": "scored"
2376
  },
 
 
 
 
 
 
 
 
 
 
 
2377
  "raw128_simple": {
2378
  "raw": 52.32759475708008,
2379
  "metric_key": "mae",
@@ -2418,17 +2429,6 @@
2418
  "raw_text": "n/a",
2419
  "status_label": "not supported"
2420
  },
2421
- "qwen3_omni_v6_lora": {
2422
- "raw": null,
2423
- "metric_key": "mae",
2424
- "source": null,
2425
- "scope": "multi_episode_128_partial_model_overlay",
2426
- "status": "not_evaluated_in_verified_package",
2427
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
2428
- "normalized_score": null,
2429
- "raw_text": "n/a",
2430
- "status_label": "not evaluated"
2431
- },
2432
  "cosmos3_super_reasoner": {
2433
  "raw": null,
2434
  "metric_key": "mae",
@@ -2492,7 +2492,7 @@
2492
  "title": "Qwen3-Omni v6 LoRA",
2493
  "status": "verified",
2494
  "task_aligned_axes": "Qwen3",
2495
- "coverage": "20 records / 10 scored task-aligned axes",
2496
  "headline": "JSON validity 0.9990; action macro-F1 0.0029",
2497
  "source": "results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/eval/metrics.json"
2498
  },
@@ -4256,17 +4256,17 @@
4256
  "task_label": "Temporal Order Verification",
4257
  "series_id": "qwen3_omni_v6_lora",
4258
  "method": "Qwen3-Omni v6 LoRA",
4259
- "status": "not_evaluated_in_verified_package",
4260
- "status_label": "not evaluated",
4261
- "scored": false,
4262
  "proxy_scored": false,
4263
- "raw": null,
4264
- "raw_text": "n/a",
4265
- "normalized_score": null,
4266
- "metric_key": "f1",
4267
- "source": null,
4268
  "scope": "multi_episode_128_partial_model_overlay",
4269
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4270
  },
4271
  {
4272
  "task_number": 11,
@@ -4418,17 +4418,17 @@
4418
  "task_label": "Multimodal Synchronization Detection",
4419
  "series_id": "qwen3_omni_v6_lora",
4420
  "method": "Qwen3-Omni v6 LoRA",
4421
- "status": "not_evaluated_in_verified_package",
4422
- "status_label": "not evaluated",
4423
- "scored": false,
4424
  "proxy_scored": false,
4425
- "raw": null,
4426
- "raw_text": "n/a",
4427
- "normalized_score": null,
4428
- "metric_key": "f1",
4429
- "source": null,
4430
  "scope": "multi_episode_128_partial_model_overlay",
4431
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4432
  },
4433
  {
4434
  "task_number": 12,
@@ -5714,17 +5714,17 @@
5714
  "task_label": "Time-to-Next-Transition Regression",
5715
  "series_id": "qwen3_omni_v6_lora",
5716
  "method": "Qwen3-Omni v6 LoRA",
5717
- "status": "not_evaluated_in_verified_package",
5718
- "status_label": "not evaluated",
5719
- "scored": false,
5720
  "proxy_scored": false,
5721
- "raw": null,
5722
- "raw_text": "n/a",
5723
- "normalized_score": null,
5724
- "metric_key": "mae",
5725
- "source": null,
5726
  "scope": "multi_episode_128_partial_model_overlay",
5727
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
5728
  },
5729
  {
5730
  "task_number": 20,
 
1
  {
2
  "title": "Unified 20-Task Model Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
+ "scored_method_task_count": 119,
9
  "normalization_policy": {
10
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
11
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
 
170
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
171
  "plotted_as": "colored point overlay",
172
  "result_record_count": 20,
173
+ "scored_task_count": 13,
174
+ "covered_task_count": 13,
175
  "proxy_scored_task_count": 0,
176
+ "scoreless_task_count": 7,
177
  "unsupported_task_count": 0,
178
+ "not_evaluated_task_count": 7,
179
  "status_counts": {
180
+ "not_evaluated_in_verified_package": 7,
181
+ "scored": 13
182
  },
183
+ "coverage_fraction": 0.65,
184
  "result_record_fraction": 1.0
185
  },
186
  {
 
1375
  "raw_text": "0.8520",
1376
  "status_label": "scored"
1377
  },
1378
+ "qwen3_omni_v6_lora": {
1379
+ "raw": 0.40984631701404173,
1380
+ "metric_key": "temporal_order_f1",
1381
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1382
+ "scope": "multi_episode_128_partial_model_overlay",
1383
+ "status": "scored",
1384
+ "reason": null,
1385
+ "normalized_score": 0.40984631701404173,
1386
+ "raw_text": "0.4098",
1387
+ "status_label": "scored"
1388
+ },
1389
  "metadata128_simple": {
1390
  "raw": 0.4198864140782312,
1391
  "metric_key": "f1",
 
1430
  "raw_text": "n/a",
1431
  "status_label": "not supported"
1432
  },
 
 
 
 
 
 
 
 
 
 
 
1433
  "cosmos3_super_reasoner": {
1434
  "raw": null,
1435
  "metric_key": "f1",
 
1486
  "raw_text": "0.7153",
1487
  "status_label": "scored"
1488
  },
1489
+ "qwen3_omni_v6_lora": {
1490
+ "raw": 0.3344936184319576,
1491
+ "metric_key": "misalignment_detection_f1",
1492
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
1493
+ "scope": "multi_episode_128_partial_model_overlay",
1494
+ "status": "scored",
1495
+ "reason": null,
1496
+ "normalized_score": 0.3344936184319576,
1497
+ "raw_text": "0.3345",
1498
+ "status_label": "scored"
1499
+ },
1500
  "metadata128_simple": {
1501
  "raw": null,
1502
  "metric_key": "f1",
 
1541
  "raw_text": "n/a",
1542
  "status_label": "not supported"
1543
  },
 
 
 
 
 
 
 
 
 
 
 
1544
  "cosmos3_super_reasoner": {
1545
  "raw": null,
1546
  "metric_key": "f1",
 
2374
  "raw_text": "10.55",
2375
  "status_label": "scored"
2376
  },
2377
+ "qwen3_omni_v6_lora": {
2378
+ "raw": 134.0687422166874,
2379
+ "metric_key": "time_to_transition_mae",
2380
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
2381
+ "scope": "multi_episode_128_partial_model_overlay",
2382
+ "status": "scored",
2383
+ "reason": null,
2384
+ "normalized_score": 0.07859666766782253,
2385
+ "raw_text": "134.07",
2386
+ "status_label": "scored"
2387
+ },
2388
  "raw128_simple": {
2389
  "raw": 52.32759475708008,
2390
  "metric_key": "mae",
 
2429
  "raw_text": "n/a",
2430
  "status_label": "not supported"
2431
  },
 
 
 
 
 
 
 
 
 
 
 
2432
  "cosmos3_super_reasoner": {
2433
  "raw": null,
2434
  "metric_key": "mae",
 
2492
  "title": "Qwen3-Omni v6 LoRA",
2493
  "status": "verified",
2494
  "task_aligned_axes": "Qwen3",
2495
+ "coverage": "20 records / 13 scored task-aligned axes",
2496
  "headline": "JSON validity 0.9990; action macro-F1 0.0029",
2497
  "source": "results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/eval/metrics.json"
2498
  },
 
4256
  "task_label": "Temporal Order Verification",
4257
  "series_id": "qwen3_omni_v6_lora",
4258
  "method": "Qwen3-Omni v6 LoRA",
4259
+ "status": "scored",
4260
+ "status_label": "scored",
4261
+ "scored": true,
4262
  "proxy_scored": false,
4263
+ "raw": 0.40984631701404173,
4264
+ "raw_text": "0.4098",
4265
+ "normalized_score": 0.40984631701404173,
4266
+ "metric_key": "temporal_order_f1",
4267
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
4268
  "scope": "multi_episode_128_partial_model_overlay",
4269
+ "reason": null
4270
  },
4271
  {
4272
  "task_number": 11,
 
4418
  "task_label": "Multimodal Synchronization Detection",
4419
  "series_id": "qwen3_omni_v6_lora",
4420
  "method": "Qwen3-Omni v6 LoRA",
4421
+ "status": "scored",
4422
+ "status_label": "scored",
4423
+ "scored": true,
4424
  "proxy_scored": false,
4425
+ "raw": 0.3344936184319576,
4426
+ "raw_text": "0.3345",
4427
+ "normalized_score": 0.3344936184319576,
4428
+ "metric_key": "misalignment_detection_f1",
4429
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
4430
  "scope": "multi_episode_128_partial_model_overlay",
4431
+ "reason": null
4432
  },
4433
  {
4434
  "task_number": 12,
 
5714
  "task_label": "Time-to-Next-Transition Regression",
5715
  "series_id": "qwen3_omni_v6_lora",
5716
  "method": "Qwen3-Omni v6 LoRA",
5717
+ "status": "scored",
5718
+ "status_label": "scored",
5719
+ "scored": true,
5720
  "proxy_scored": false,
5721
+ "raw": 134.0687422166874,
5722
+ "raw_text": "134.07",
5723
+ "normalized_score": 0.07859666766782253,
5724
+ "metric_key": "time_to_transition_mae",
5725
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
5726
  "scope": "multi_episode_128_partial_model_overlay",
5727
+ "reason": null
5728
  },
5729
  {
5730
  "task_number": 20,
docs/data/website_integrity.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T21:12:34+00:00",
4
  "docs_root": "docs",
5
  "site_base": "/ropedia-xperience-10m-task-suite/",
6
  "summary": {
@@ -316,7 +316,7 @@
316
  },
317
  {
318
  "path": "data/episode128_task_model_radar.json",
319
- "bytes": 187388,
320
  "top_level_type": "dict"
321
  },
322
  {
@@ -486,12 +486,12 @@
486
  },
487
  {
488
  "path": "data/task_method_20_gap_audit.json",
489
- "bytes": 55745,
490
  "top_level_type": "dict"
491
  },
492
  {
493
  "path": "data/task_method_20_result_matrix.json",
494
- "bytes": 129749,
495
  "top_level_type": "dict"
496
  },
497
  {
@@ -526,7 +526,7 @@
526
  },
527
  {
528
  "path": "data/unified_task_model_radar.json",
529
- "bytes": 231240,
530
  "top_level_type": "dict"
531
  },
532
  {
@@ -566,7 +566,7 @@
566
  {
567
  "path": "assets/charts/episode128_task_model_radar.svg",
568
  "exists": true,
569
- "bytes": 44044,
570
  "format": "SVG",
571
  "has_viewbox": true
572
  },
@@ -636,7 +636,7 @@
636
  {
637
  "path": "assets/charts/unified_task_model_radar.svg",
638
  "exists": true,
639
- "bytes": 50060,
640
  "format": "SVG",
641
  "has_viewbox": true
642
  },
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:27+00:00",
4
  "docs_root": "docs",
5
  "site_base": "/ropedia-xperience-10m-task-suite/",
6
  "summary": {
 
316
  },
317
  {
318
  "path": "data/episode128_task_model_radar.json",
319
+ "bytes": 187309,
320
  "top_level_type": "dict"
321
  },
322
  {
 
486
  },
487
  {
488
  "path": "data/task_method_20_gap_audit.json",
489
+ "bytes": 53574,
490
  "top_level_type": "dict"
491
  },
492
  {
493
  "path": "data/task_method_20_result_matrix.json",
494
+ "bytes": 129707,
495
  "top_level_type": "dict"
496
  },
497
  {
 
526
  },
527
  {
528
  "path": "data/unified_task_model_radar.json",
529
+ "bytes": 231161,
530
  "top_level_type": "dict"
531
  },
532
  {
 
566
  {
567
  "path": "assets/charts/episode128_task_model_radar.svg",
568
  "exists": true,
569
+ "bytes": 44378,
570
  "format": "SVG",
571
  "has_viewbox": true
572
  },
 
636
  {
637
  "path": "assets/charts/unified_task_model_radar.svg",
638
  "exists": true,
639
+ "bytes": 50394,
640
  "format": "SVG",
641
  "has_viewbox": true
642
  },
metrics/episode128_task_model_radar.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
  "title": "128-Episode 20-Task Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "description": "Selected 128-episode metadata/raw baselines plus verified Qwen3/Cosmos branches. Every method has 20 records; numeric scores appear only where the public artifact produced that task target.",
6
  "task_count": 20,
7
  "method_count": 7,
8
  "method_task_record_count": 140,
9
- "scored_method_task_count": 76,
10
  "normalization_policy": {
11
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
12
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
@@ -127,17 +127,17 @@
127
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
128
  "plotted_as": "colored point overlay",
129
  "result_record_count": 20,
130
- "scored_task_count": 10,
131
- "covered_task_count": 10,
132
  "proxy_scored_task_count": 0,
133
- "scoreless_task_count": 10,
134
  "unsupported_task_count": 0,
135
- "not_evaluated_task_count": 10,
136
  "status_counts": {
137
- "not_evaluated_in_verified_package": 10,
138
- "scored": 10
139
  },
140
- "coverage_fraction": 0.5,
141
  "result_record_fraction": 1.0
142
  },
143
  {
@@ -1157,15 +1157,15 @@
1157
  "status_label": "scored"
1158
  },
1159
  "qwen3_omni_v6_lora": {
1160
- "raw": null,
1161
- "metric_key": "f1",
1162
- "source": null,
1163
  "scope": "multi_episode_128_partial_model_overlay",
1164
- "status": "not_evaluated_in_verified_package",
1165
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1166
- "normalized_score": null,
1167
- "raw_text": "n/a",
1168
- "status_label": "not evaluated"
1169
  },
1170
  "cosmos3_super_reasoner": {
1171
  "raw": null,
@@ -1248,15 +1248,15 @@
1248
  "status_label": "scored"
1249
  },
1250
  "qwen3_omni_v6_lora": {
1251
- "raw": null,
1252
- "metric_key": "f1",
1253
- "source": null,
1254
  "scope": "multi_episode_128_partial_model_overlay",
1255
- "status": "not_evaluated_in_verified_package",
1256
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1257
- "normalized_score": null,
1258
- "raw_text": "n/a",
1259
- "status_label": "not evaluated"
1260
  },
1261
  "cosmos3_super_reasoner": {
1262
  "raw": null,
@@ -1976,15 +1976,15 @@
1976
  "status_label": "scored"
1977
  },
1978
  "qwen3_omni_v6_lora": {
1979
- "raw": null,
1980
- "metric_key": "mae",
1981
- "source": null,
1982
  "scope": "multi_episode_128_partial_model_overlay",
1983
- "status": "not_evaluated_in_verified_package",
1984
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1985
- "normalized_score": null,
1986
- "raw_text": "n/a",
1987
- "status_label": "not evaluated"
1988
  },
1989
  "cosmos3_super_reasoner": {
1990
  "raw": null,
@@ -3350,17 +3350,17 @@
3350
  "task_label": "Temporal Order Verification",
3351
  "series_id": "qwen3_omni_v6_lora",
3352
  "method": "Qwen3-Omni v6 LoRA",
3353
- "status": "not_evaluated_in_verified_package",
3354
- "status_label": "not evaluated",
3355
- "scored": false,
3356
  "proxy_scored": false,
3357
- "raw": null,
3358
- "raw_text": "n/a",
3359
- "normalized_score": null,
3360
- "metric_key": "f1",
3361
- "source": null,
3362
  "scope": "multi_episode_128_partial_model_overlay",
3363
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3364
  },
3365
  {
3366
  "task_number": 11,
@@ -3476,17 +3476,17 @@
3476
  "task_label": "Multimodal Synchronization Detection",
3477
  "series_id": "qwen3_omni_v6_lora",
3478
  "method": "Qwen3-Omni v6 LoRA",
3479
- "status": "not_evaluated_in_verified_package",
3480
- "status_label": "not evaluated",
3481
- "scored": false,
3482
  "proxy_scored": false,
3483
- "raw": null,
3484
- "raw_text": "n/a",
3485
- "normalized_score": null,
3486
- "metric_key": "f1",
3487
- "source": null,
3488
  "scope": "multi_episode_128_partial_model_overlay",
3489
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3490
  },
3491
  {
3492
  "task_number": 12,
@@ -4484,17 +4484,17 @@
4484
  "task_label": "Time-to-Next-Transition Regression",
4485
  "series_id": "qwen3_omni_v6_lora",
4486
  "method": "Qwen3-Omni v6 LoRA",
4487
- "status": "not_evaluated_in_verified_package",
4488
- "status_label": "not evaluated",
4489
- "scored": false,
4490
  "proxy_scored": false,
4491
- "raw": null,
4492
- "raw_text": "n/a",
4493
- "normalized_score": null,
4494
- "metric_key": "mae",
4495
- "source": null,
4496
  "scope": "multi_episode_128_partial_model_overlay",
4497
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4498
  },
4499
  {
4500
  "task_number": 20,
 
1
  {
2
  "title": "128-Episode 20-Task Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "description": "Selected 128-episode metadata/raw baselines plus verified Qwen3/Cosmos branches. Every method has 20 records; numeric scores appear only where the public artifact produced that task target.",
6
  "task_count": 20,
7
  "method_count": 7,
8
  "method_task_record_count": 140,
9
+ "scored_method_task_count": 79,
10
  "normalization_policy": {
11
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
12
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
 
127
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
128
  "plotted_as": "colored point overlay",
129
  "result_record_count": 20,
130
+ "scored_task_count": 13,
131
+ "covered_task_count": 13,
132
  "proxy_scored_task_count": 0,
133
+ "scoreless_task_count": 7,
134
  "unsupported_task_count": 0,
135
+ "not_evaluated_task_count": 7,
136
  "status_counts": {
137
+ "not_evaluated_in_verified_package": 7,
138
+ "scored": 13
139
  },
140
+ "coverage_fraction": 0.65,
141
  "result_record_fraction": 1.0
142
  },
143
  {
 
1157
  "status_label": "scored"
1158
  },
1159
  "qwen3_omni_v6_lora": {
1160
+ "raw": 0.40984631701404173,
1161
+ "metric_key": "temporal_order_f1",
1162
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1163
  "scope": "multi_episode_128_partial_model_overlay",
1164
+ "status": "scored",
1165
+ "reason": null,
1166
+ "normalized_score": 0.40984631701404173,
1167
+ "raw_text": "0.4098",
1168
+ "status_label": "scored"
1169
  },
1170
  "cosmos3_super_reasoner": {
1171
  "raw": null,
 
1248
  "status_label": "scored"
1249
  },
1250
  "qwen3_omni_v6_lora": {
1251
+ "raw": 0.3344936184319576,
1252
+ "metric_key": "misalignment_detection_f1",
1253
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
1254
  "scope": "multi_episode_128_partial_model_overlay",
1255
+ "status": "scored",
1256
+ "reason": null,
1257
+ "normalized_score": 0.3344936184319576,
1258
+ "raw_text": "0.3345",
1259
+ "status_label": "scored"
1260
  },
1261
  "cosmos3_super_reasoner": {
1262
  "raw": null,
 
1976
  "status_label": "scored"
1977
  },
1978
  "qwen3_omni_v6_lora": {
1979
+ "raw": 134.0687422166874,
1980
+ "metric_key": "time_to_transition_mae",
1981
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
1982
  "scope": "multi_episode_128_partial_model_overlay",
1983
+ "status": "scored",
1984
+ "reason": null,
1985
+ "normalized_score": 0.07859666766782253,
1986
+ "raw_text": "134.07",
1987
+ "status_label": "scored"
1988
  },
1989
  "cosmos3_super_reasoner": {
1990
  "raw": null,
 
3350
  "task_label": "Temporal Order Verification",
3351
  "series_id": "qwen3_omni_v6_lora",
3352
  "method": "Qwen3-Omni v6 LoRA",
3353
+ "status": "scored",
3354
+ "status_label": "scored",
3355
+ "scored": true,
3356
  "proxy_scored": false,
3357
+ "raw": 0.40984631701404173,
3358
+ "raw_text": "0.4098",
3359
+ "normalized_score": 0.40984631701404173,
3360
+ "metric_key": "temporal_order_f1",
3361
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
3362
  "scope": "multi_episode_128_partial_model_overlay",
3363
+ "reason": null
3364
  },
3365
  {
3366
  "task_number": 11,
 
3476
  "task_label": "Multimodal Synchronization Detection",
3477
  "series_id": "qwen3_omni_v6_lora",
3478
  "method": "Qwen3-Omni v6 LoRA",
3479
+ "status": "scored",
3480
+ "status_label": "scored",
3481
+ "scored": true,
3482
  "proxy_scored": false,
3483
+ "raw": 0.3344936184319576,
3484
+ "raw_text": "0.3345",
3485
+ "normalized_score": 0.3344936184319576,
3486
+ "metric_key": "misalignment_detection_f1",
3487
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
3488
  "scope": "multi_episode_128_partial_model_overlay",
3489
+ "reason": null
3490
  },
3491
  {
3492
  "task_number": 12,
 
4484
  "task_label": "Time-to-Next-Transition Regression",
4485
  "series_id": "qwen3_omni_v6_lora",
4486
  "method": "Qwen3-Omni v6 LoRA",
4487
+ "status": "scored",
4488
+ "status_label": "scored",
4489
+ "scored": true,
4490
  "proxy_scored": false,
4491
+ "raw": 134.0687422166874,
4492
+ "raw_text": "134.07",
4493
+ "normalized_score": 0.07859666766782253,
4494
+ "metric_key": "time_to_transition_mae",
4495
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
4496
  "scope": "multi_episode_128_partial_model_overlay",
4497
+ "reason": null
4498
  },
4499
  {
4500
  "task_number": 20,
metrics/publication_audit.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T21:12:50+00:00",
4
  "checks": [
5
  {
6
  "name": "required_publication_assets_present",
@@ -206,8 +206,8 @@
206
  "github_repo": {
207
  "root": "repo",
208
  "exists": true,
209
- "file_count": 1232,
210
- "text_file_count": 1034,
211
  "largest_file": {
212
  "path": "results/episode_task_suite/modality_reconstruction/predictions.npz",
213
  "bytes": 55702978
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:35+00:00",
4
  "checks": [
5
  {
6
  "name": "required_publication_assets_present",
 
206
  "github_repo": {
207
  "root": "repo",
208
  "exists": true,
209
+ "file_count": 1250,
210
+ "text_file_count": 1052,
211
  "largest_file": {
212
  "path": "results/episode_task_suite/modality_reconstruction/predictions.npz",
213
  "bytes": 55702978
metrics/single_episode_task_model_radar.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "title": "Single-Episode 20-Task Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "description": "Minimal and Neural MLP baselines on the one public sample episode, both scored on all 20 task contracts.",
6
  "task_count": 20,
7
  "method_count": 2,
 
1
  {
2
  "title": "Single-Episode 20-Task Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "description": "Minimal and Neural MLP baselines on the one public sample episode, both scored on all 20 task contracts.",
6
  "task_count": 20,
7
  "method_count": 2,
metrics/task_method_20_gap_audit.json CHANGED
@@ -1,10 +1,10 @@
1
  {
2
- "generated_at_utc": "2026-06-17T13:55:12+00:00",
3
  "immediate_actions": [
4
  {
5
  "artifact": "docs/data/task_method_20_gap_audit.json",
6
  "id": "gap_audit",
7
- "purpose": "Keep the 64 scoreless cells visible and reproducible."
8
  },
9
  {
10
  "artifact": "scripts/omni/score_model_output_probes.py",
@@ -101,11 +101,11 @@
101
  "proxy_scored_task_count": 0,
102
  "result_record_count": 20,
103
  "scope": "128 selected episodes, held-out test",
104
- "scored_task_count": 10,
105
- "scoreless_task_count": 10,
106
  "status_counts": {
107
- "not_evaluated_in_verified_package": 10,
108
- "scored": 10
109
  }
110
  },
111
  "raw128_neural_mlp": {
@@ -140,10 +140,10 @@
140
  "cosmos3_super_reasoner": 13,
141
  "metadata128_neural_mlp": 14,
142
  "metadata128_simple": 12,
143
- "qwen3_omni_v6_lora": 10
144
  },
145
  "missing_by_status": {
146
- "not_evaluated_in_verified_package": 38,
147
  "not_supported_by_metadata_only_package": 22,
148
  "unsupported_without_required_target": 4
149
  },
@@ -183,15 +183,13 @@
183
  "11 Temporal Order Verification": [
184
  "cosmos3_nano_future_window",
185
  "cosmos3_super_reasoner",
186
- "metadata128_neural_mlp",
187
- "qwen3_omni_v6_lora"
188
  ],
189
  "12 Multimodal Synchronization Detection": [
190
  "cosmos3_nano_future_window",
191
  "cosmos3_super_reasoner",
192
  "metadata128_neural_mlp",
193
- "metadata128_simple",
194
- "qwen3_omni_v6_lora"
195
  ],
196
  "13 Long-Horizon Next-Action Forecasting": [
197
  "cosmos3_nano_future_window",
@@ -241,8 +239,7 @@
241
  "cosmos3_nano_future_window",
242
  "cosmos3_super_reasoner",
243
  "metadata128_neural_mlp",
244
- "metadata128_simple",
245
- "qwen3_omni_v6_lora"
246
  ]
247
  },
248
  "missing_records": [
@@ -519,19 +516,6 @@
519
  "task_label": "Temporal Order Verification",
520
  "task_number": 11
521
  },
522
- {
523
- "method": "Qwen3-Omni v6 LoRA",
524
- "metric_key": "f1",
525
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
526
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
527
- "scope": "multi_episode_128_partial_model_overlay",
528
- "series_id": "qwen3_omni_v6_lora",
529
- "status": "not_evaluated_in_verified_package",
530
- "status_label": "not evaluated",
531
- "task_id": "temporal_order",
532
- "task_label": "Temporal Order Verification",
533
- "task_number": 11
534
- },
535
  {
536
  "method": "Cosmos3-Super Reasoner",
537
  "metric_key": "f1",
@@ -584,19 +568,6 @@
584
  "task_label": "Multimodal Synchronization Detection",
585
  "task_number": 12
586
  },
587
- {
588
- "method": "Qwen3-Omni v6 LoRA",
589
- "metric_key": "f1",
590
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
591
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
592
- "scope": "multi_episode_128_partial_model_overlay",
593
- "series_id": "qwen3_omni_v6_lora",
594
- "status": "not_evaluated_in_verified_package",
595
- "status_label": "not evaluated",
596
- "task_id": "misalignment_detection",
597
- "task_label": "Multimodal Synchronization Detection",
598
- "task_number": 12
599
- },
600
  {
601
  "method": "Cosmos3-Super Reasoner",
602
  "metric_key": "f1",
@@ -1039,19 +1010,6 @@
1039
  "task_label": "Time-to-Next-Transition Regression",
1040
  "task_number": 20
1041
  },
1042
- {
1043
- "method": "Qwen3-Omni v6 LoRA",
1044
- "metric_key": "mae",
1045
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1046
- "recommended_next_step": "Generate verified model outputs for this task contract and score them against the held-out labels.",
1047
- "scope": "multi_episode_128_partial_model_overlay",
1048
- "series_id": "qwen3_omni_v6_lora",
1049
- "status": "not_evaluated_in_verified_package",
1050
- "status_label": "not evaluated",
1051
- "task_id": "time_to_transition",
1052
- "task_label": "Time-to-Next-Transition Regression",
1053
- "task_number": 20
1054
- },
1055
  {
1056
  "method": "Cosmos3-Super Reasoner",
1057
  "metric_key": "mae",
@@ -1125,8 +1083,8 @@
1125
  "method_count": 9,
1126
  "method_task_record_count": 180,
1127
  "proxy_scored_method_task_count": 4,
1128
- "scored_method_task_count": 116,
1129
- "scoreless_method_task_count": 64,
1130
  "task_count": 20
1131
  },
1132
  "source_matrix": "docs/data/task_method_20_result_matrix.json",
 
1
  {
2
+ "generated_at_utc": "2026-06-17T21:17:51+00:00",
3
  "immediate_actions": [
4
  {
5
  "artifact": "docs/data/task_method_20_gap_audit.json",
6
  "id": "gap_audit",
7
+ "purpose": "Keep the 61 scoreless cells visible and reproducible."
8
  },
9
  {
10
  "artifact": "scripts/omni/score_model_output_probes.py",
 
101
  "proxy_scored_task_count": 0,
102
  "result_record_count": 20,
103
  "scope": "128 selected episodes, held-out test",
104
+ "scored_task_count": 13,
105
+ "scoreless_task_count": 7,
106
  "status_counts": {
107
+ "not_evaluated_in_verified_package": 7,
108
+ "scored": 13
109
  }
110
  },
111
  "raw128_neural_mlp": {
 
140
  "cosmos3_super_reasoner": 13,
141
  "metadata128_neural_mlp": 14,
142
  "metadata128_simple": 12,
143
+ "qwen3_omni_v6_lora": 7
144
  },
145
  "missing_by_status": {
146
+ "not_evaluated_in_verified_package": 35,
147
  "not_supported_by_metadata_only_package": 22,
148
  "unsupported_without_required_target": 4
149
  },
 
183
  "11 Temporal Order Verification": [
184
  "cosmos3_nano_future_window",
185
  "cosmos3_super_reasoner",
186
+ "metadata128_neural_mlp"
 
187
  ],
188
  "12 Multimodal Synchronization Detection": [
189
  "cosmos3_nano_future_window",
190
  "cosmos3_super_reasoner",
191
  "metadata128_neural_mlp",
192
+ "metadata128_simple"
 
193
  ],
194
  "13 Long-Horizon Next-Action Forecasting": [
195
  "cosmos3_nano_future_window",
 
239
  "cosmos3_nano_future_window",
240
  "cosmos3_super_reasoner",
241
  "metadata128_neural_mlp",
242
+ "metadata128_simple"
 
243
  ]
244
  },
245
  "missing_records": [
 
516
  "task_label": "Temporal Order Verification",
517
  "task_number": 11
518
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
519
  {
520
  "method": "Cosmos3-Super Reasoner",
521
  "metric_key": "f1",
 
568
  "task_label": "Multimodal Synchronization Detection",
569
  "task_number": 12
570
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
571
  {
572
  "method": "Cosmos3-Super Reasoner",
573
  "metric_key": "f1",
 
1010
  "task_label": "Time-to-Next-Transition Regression",
1011
  "task_number": 20
1012
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
1013
  {
1014
  "method": "Cosmos3-Super Reasoner",
1015
  "metric_key": "mae",
 
1083
  "method_count": 9,
1084
  "method_task_record_count": 180,
1085
  "proxy_scored_method_task_count": 4,
1086
+ "scored_method_task_count": 119,
1087
+ "scoreless_method_task_count": 61,
1088
  "task_count": 20
1089
  },
1090
  "source_matrix": "docs/data/task_method_20_result_matrix.json",
metrics/task_method_20_result_matrix.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
  "title": "Task Method 20-Result Matrix",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
- "scored_method_task_count": 116,
9
  "series": [
10
  {
11
  "id": "minimal",
@@ -161,17 +161,17 @@
161
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
162
  "plotted_as": "colored point overlay",
163
  "result_record_count": 20,
164
- "scored_task_count": 10,
165
- "covered_task_count": 10,
166
  "proxy_scored_task_count": 0,
167
- "scoreless_task_count": 10,
168
  "unsupported_task_count": 0,
169
- "not_evaluated_task_count": 10,
170
  "status_counts": {
171
- "not_evaluated_in_verified_package": 10,
172
- "scored": 10
173
  },
174
- "coverage_fraction": 0.5,
175
  "result_record_fraction": 1.0
176
  },
177
  {
@@ -1958,17 +1958,17 @@
1958
  "task_label": "Temporal Order Verification",
1959
  "series_id": "qwen3_omni_v6_lora",
1960
  "method": "Qwen3-Omni v6 LoRA",
1961
- "status": "not_evaluated_in_verified_package",
1962
- "status_label": "not evaluated",
1963
- "scored": false,
1964
  "proxy_scored": false,
1965
- "raw": null,
1966
- "raw_text": "n/a",
1967
- "normalized_score": null,
1968
- "metric_key": "f1",
1969
- "source": null,
1970
  "scope": "multi_episode_128_partial_model_overlay",
1971
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
1972
  },
1973
  {
1974
  "task_number": 11,
@@ -2120,17 +2120,17 @@
2120
  "task_label": "Multimodal Synchronization Detection",
2121
  "series_id": "qwen3_omni_v6_lora",
2122
  "method": "Qwen3-Omni v6 LoRA",
2123
- "status": "not_evaluated_in_verified_package",
2124
- "status_label": "not evaluated",
2125
- "scored": false,
2126
  "proxy_scored": false,
2127
- "raw": null,
2128
- "raw_text": "n/a",
2129
- "normalized_score": null,
2130
- "metric_key": "f1",
2131
- "source": null,
2132
  "scope": "multi_episode_128_partial_model_overlay",
2133
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
2134
  },
2135
  {
2136
  "task_number": 12,
@@ -3416,17 +3416,17 @@
3416
  "task_label": "Time-to-Next-Transition Regression",
3417
  "series_id": "qwen3_omni_v6_lora",
3418
  "method": "Qwen3-Omni v6 LoRA",
3419
- "status": "not_evaluated_in_verified_package",
3420
- "status_label": "not evaluated",
3421
- "scored": false,
3422
  "proxy_scored": false,
3423
- "raw": null,
3424
- "raw_text": "n/a",
3425
- "normalized_score": null,
3426
- "metric_key": "mae",
3427
- "source": null,
3428
  "scope": "multi_episode_128_partial_model_overlay",
3429
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
3430
  },
3431
  {
3432
  "task_number": 20,
 
1
  {
2
  "title": "Task Method 20-Result Matrix",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
+ "scored_method_task_count": 119,
9
  "series": [
10
  {
11
  "id": "minimal",
 
161
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
162
  "plotted_as": "colored point overlay",
163
  "result_record_count": 20,
164
+ "scored_task_count": 13,
165
+ "covered_task_count": 13,
166
  "proxy_scored_task_count": 0,
167
+ "scoreless_task_count": 7,
168
  "unsupported_task_count": 0,
169
+ "not_evaluated_task_count": 7,
170
  "status_counts": {
171
+ "not_evaluated_in_verified_package": 7,
172
+ "scored": 13
173
  },
174
+ "coverage_fraction": 0.65,
175
  "result_record_fraction": 1.0
176
  },
177
  {
 
1958
  "task_label": "Temporal Order Verification",
1959
  "series_id": "qwen3_omni_v6_lora",
1960
  "method": "Qwen3-Omni v6 LoRA",
1961
+ "status": "scored",
1962
+ "status_label": "scored",
1963
+ "scored": true,
1964
  "proxy_scored": false,
1965
+ "raw": 0.40984631701404173,
1966
+ "raw_text": "0.4098",
1967
+ "normalized_score": 0.40984631701404173,
1968
+ "metric_key": "temporal_order_f1",
1969
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1970
  "scope": "multi_episode_128_partial_model_overlay",
1971
+ "reason": null
1972
  },
1973
  {
1974
  "task_number": 11,
 
2120
  "task_label": "Multimodal Synchronization Detection",
2121
  "series_id": "qwen3_omni_v6_lora",
2122
  "method": "Qwen3-Omni v6 LoRA",
2123
+ "status": "scored",
2124
+ "status_label": "scored",
2125
+ "scored": true,
2126
  "proxy_scored": false,
2127
+ "raw": 0.3344936184319576,
2128
+ "raw_text": "0.3345",
2129
+ "normalized_score": 0.3344936184319576,
2130
+ "metric_key": "misalignment_detection_f1",
2131
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
2132
  "scope": "multi_episode_128_partial_model_overlay",
2133
+ "reason": null
2134
  },
2135
  {
2136
  "task_number": 12,
 
3416
  "task_label": "Time-to-Next-Transition Regression",
3417
  "series_id": "qwen3_omni_v6_lora",
3418
  "method": "Qwen3-Omni v6 LoRA",
3419
+ "status": "scored",
3420
+ "status_label": "scored",
3421
+ "scored": true,
3422
  "proxy_scored": false,
3423
+ "raw": 134.0687422166874,
3424
+ "raw_text": "134.07",
3425
+ "normalized_score": 0.07859666766782253,
3426
+ "metric_key": "time_to_transition_mae",
3427
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
3428
  "scope": "multi_episode_128_partial_model_overlay",
3429
+ "reason": null
3430
  },
3431
  {
3432
  "task_number": 20,
metrics/task_surface_integrity.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T20:46:02+00:00",
4
  "summary": {
5
  "task_count": 12,
6
  "expected_task_count": 12,
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:26+00:00",
4
  "summary": {
5
  "task_count": 12,
6
  "expected_task_count": 12,
metrics/unified_task_model_radar.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
  "title": "Unified 20-Task Model Radar",
3
  "status": "pass",
4
- "generated_at_utc": "2026-06-17T13:55:02+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
- "scored_method_task_count": 116,
9
  "normalization_policy": {
10
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
11
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
@@ -170,17 +170,17 @@
170
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
171
  "plotted_as": "colored point overlay",
172
  "result_record_count": 20,
173
- "scored_task_count": 10,
174
- "covered_task_count": 10,
175
  "proxy_scored_task_count": 0,
176
- "scoreless_task_count": 10,
177
  "unsupported_task_count": 0,
178
- "not_evaluated_task_count": 10,
179
  "status_counts": {
180
- "not_evaluated_in_verified_package": 10,
181
- "scored": 10
182
  },
183
- "coverage_fraction": 0.5,
184
  "result_record_fraction": 1.0
185
  },
186
  {
@@ -1375,6 +1375,17 @@
1375
  "raw_text": "0.8520",
1376
  "status_label": "scored"
1377
  },
 
 
 
 
 
 
 
 
 
 
 
1378
  "metadata128_simple": {
1379
  "raw": 0.4198864140782312,
1380
  "metric_key": "f1",
@@ -1419,17 +1430,6 @@
1419
  "raw_text": "n/a",
1420
  "status_label": "not supported"
1421
  },
1422
- "qwen3_omni_v6_lora": {
1423
- "raw": null,
1424
- "metric_key": "f1",
1425
- "source": null,
1426
- "scope": "multi_episode_128_partial_model_overlay",
1427
- "status": "not_evaluated_in_verified_package",
1428
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1429
- "normalized_score": null,
1430
- "raw_text": "n/a",
1431
- "status_label": "not evaluated"
1432
- },
1433
  "cosmos3_super_reasoner": {
1434
  "raw": null,
1435
  "metric_key": "f1",
@@ -1486,6 +1486,17 @@
1486
  "raw_text": "0.7153",
1487
  "status_label": "scored"
1488
  },
 
 
 
 
 
 
 
 
 
 
 
1489
  "metadata128_simple": {
1490
  "raw": null,
1491
  "metric_key": "f1",
@@ -1530,17 +1541,6 @@
1530
  "raw_text": "n/a",
1531
  "status_label": "not supported"
1532
  },
1533
- "qwen3_omni_v6_lora": {
1534
- "raw": null,
1535
- "metric_key": "f1",
1536
- "source": null,
1537
- "scope": "multi_episode_128_partial_model_overlay",
1538
- "status": "not_evaluated_in_verified_package",
1539
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
1540
- "normalized_score": null,
1541
- "raw_text": "n/a",
1542
- "status_label": "not evaluated"
1543
- },
1544
  "cosmos3_super_reasoner": {
1545
  "raw": null,
1546
  "metric_key": "f1",
@@ -2374,6 +2374,17 @@
2374
  "raw_text": "10.55",
2375
  "status_label": "scored"
2376
  },
 
 
 
 
 
 
 
 
 
 
 
2377
  "raw128_simple": {
2378
  "raw": 52.32759475708008,
2379
  "metric_key": "mae",
@@ -2418,17 +2429,6 @@
2418
  "raw_text": "n/a",
2419
  "status_label": "not supported"
2420
  },
2421
- "qwen3_omni_v6_lora": {
2422
- "raw": null,
2423
- "metric_key": "mae",
2424
- "source": null,
2425
- "scope": "multi_episode_128_partial_model_overlay",
2426
- "status": "not_evaluated_in_verified_package",
2427
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score",
2428
- "normalized_score": null,
2429
- "raw_text": "n/a",
2430
- "status_label": "not evaluated"
2431
- },
2432
  "cosmos3_super_reasoner": {
2433
  "raw": null,
2434
  "metric_key": "mae",
@@ -2492,7 +2492,7 @@
2492
  "title": "Qwen3-Omni v6 LoRA",
2493
  "status": "verified",
2494
  "task_aligned_axes": "Qwen3",
2495
- "coverage": "20 records / 10 scored task-aligned axes",
2496
  "headline": "JSON validity 0.9990; action macro-F1 0.0029",
2497
  "source": "results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/eval/metrics.json"
2498
  },
@@ -4256,17 +4256,17 @@
4256
  "task_label": "Temporal Order Verification",
4257
  "series_id": "qwen3_omni_v6_lora",
4258
  "method": "Qwen3-Omni v6 LoRA",
4259
- "status": "not_evaluated_in_verified_package",
4260
- "status_label": "not evaluated",
4261
- "scored": false,
4262
  "proxy_scored": false,
4263
- "raw": null,
4264
- "raw_text": "n/a",
4265
- "normalized_score": null,
4266
- "metric_key": "f1",
4267
- "source": null,
4268
  "scope": "multi_episode_128_partial_model_overlay",
4269
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4270
  },
4271
  {
4272
  "task_number": 11,
@@ -4418,17 +4418,17 @@
4418
  "task_label": "Multimodal Synchronization Detection",
4419
  "series_id": "qwen3_omni_v6_lora",
4420
  "method": "Qwen3-Omni v6 LoRA",
4421
- "status": "not_evaluated_in_verified_package",
4422
- "status_label": "not evaluated",
4423
- "scored": false,
4424
  "proxy_scored": false,
4425
- "raw": null,
4426
- "raw_text": "n/a",
4427
- "normalized_score": null,
4428
- "metric_key": "f1",
4429
- "source": null,
4430
  "scope": "multi_episode_128_partial_model_overlay",
4431
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
4432
  },
4433
  {
4434
  "task_number": 12,
@@ -5714,17 +5714,17 @@
5714
  "task_label": "Time-to-Next-Transition Regression",
5715
  "series_id": "qwen3_omni_v6_lora",
5716
  "method": "Qwen3-Omni v6 LoRA",
5717
- "status": "not_evaluated_in_verified_package",
5718
- "status_label": "not evaluated",
5719
- "scored": false,
5720
  "proxy_scored": false,
5721
- "raw": null,
5722
- "raw_text": "n/a",
5723
- "normalized_score": null,
5724
- "metric_key": "mae",
5725
- "source": null,
5726
  "scope": "multi_episode_128_partial_model_overlay",
5727
- "reason": "the verified public model package did not ask this branch to emit that task target; a new task-specific evaluation package is required for a numeric score"
5728
  },
5729
  {
5730
  "task_number": 20,
 
1
  {
2
  "title": "Unified 20-Task Model Radar",
3
  "status": "pass",
4
+ "generated_at_utc": "2026-06-17T21:17:41+00:00",
5
  "task_count": 20,
6
  "method_count": 9,
7
  "method_task_record_count": 180,
8
+ "scored_method_task_count": 119,
9
  "normalization_policy": {
10
  "higher_is_better": "bounded metrics are plotted directly on 0-1 axes after clipping to [0, 1]",
11
  "lower_is_better": "lower-error metrics are converted to best_observed_value / raw_value within the same task",
 
170
  "method_detail": "Verified held-out Qwen3-Omni v6 LoRA metrics, plus task 16 and any completed private-GPU future-task probes scored from task-specific JSON.",
171
  "plotted_as": "colored point overlay",
172
  "result_record_count": 20,
173
+ "scored_task_count": 13,
174
+ "covered_task_count": 13,
175
  "proxy_scored_task_count": 0,
176
+ "scoreless_task_count": 7,
177
  "unsupported_task_count": 0,
178
+ "not_evaluated_task_count": 7,
179
  "status_counts": {
180
+ "not_evaluated_in_verified_package": 7,
181
+ "scored": 13
182
  },
183
+ "coverage_fraction": 0.65,
184
  "result_record_fraction": 1.0
185
  },
186
  {
 
1375
  "raw_text": "0.8520",
1376
  "status_label": "scored"
1377
  },
1378
+ "qwen3_omni_v6_lora": {
1379
+ "raw": 0.40984631701404173,
1380
+ "metric_key": "temporal_order_f1",
1381
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
1382
+ "scope": "multi_episode_128_partial_model_overlay",
1383
+ "status": "scored",
1384
+ "reason": null,
1385
+ "normalized_score": 0.40984631701404173,
1386
+ "raw_text": "0.4098",
1387
+ "status_label": "scored"
1388
+ },
1389
  "metadata128_simple": {
1390
  "raw": 0.4198864140782312,
1391
  "metric_key": "f1",
 
1430
  "raw_text": "n/a",
1431
  "status_label": "not supported"
1432
  },
 
 
 
 
 
 
 
 
 
 
 
1433
  "cosmos3_super_reasoner": {
1434
  "raw": null,
1435
  "metric_key": "f1",
 
1486
  "raw_text": "0.7153",
1487
  "status_label": "scored"
1488
  },
1489
+ "qwen3_omni_v6_lora": {
1490
+ "raw": 0.3344936184319576,
1491
+ "metric_key": "misalignment_detection_f1",
1492
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
1493
+ "scope": "multi_episode_128_partial_model_overlay",
1494
+ "status": "scored",
1495
+ "reason": null,
1496
+ "normalized_score": 0.3344936184319576,
1497
+ "raw_text": "0.3345",
1498
+ "status_label": "scored"
1499
+ },
1500
  "metadata128_simple": {
1501
  "raw": null,
1502
  "metric_key": "f1",
 
1541
  "raw_text": "n/a",
1542
  "status_label": "not supported"
1543
  },
 
 
 
 
 
 
 
 
 
 
 
1544
  "cosmos3_super_reasoner": {
1545
  "raw": null,
1546
  "metric_key": "f1",
 
2374
  "raw_text": "10.55",
2375
  "status_label": "scored"
2376
  },
2377
+ "qwen3_omni_v6_lora": {
2378
+ "raw": 134.0687422166874,
2379
+ "metric_key": "time_to_transition_mae",
2380
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
2381
+ "scope": "multi_episode_128_partial_model_overlay",
2382
+ "status": "scored",
2383
+ "reason": null,
2384
+ "normalized_score": 0.07859666766782253,
2385
+ "raw_text": "134.07",
2386
+ "status_label": "scored"
2387
+ },
2388
  "raw128_simple": {
2389
  "raw": 52.32759475708008,
2390
  "metric_key": "mae",
 
2429
  "raw_text": "n/a",
2430
  "status_label": "not supported"
2431
  },
 
 
 
 
 
 
 
 
 
 
 
2432
  "cosmos3_super_reasoner": {
2433
  "raw": null,
2434
  "metric_key": "mae",
 
2492
  "title": "Qwen3-Omni v6 LoRA",
2493
  "status": "verified",
2494
  "task_aligned_axes": "Qwen3",
2495
+ "coverage": "20 records / 13 scored task-aligned axes",
2496
  "headline": "JSON validity 0.9990; action macro-F1 0.0029",
2497
  "source": "results/omni_finetune/verified_public/xperience10m_qwen3_omni_128ep_multiscale_cap96_v6_rank64_lr5e5_full8gpu_lora_eval_test_full/eval/metrics.json"
2498
  },
 
4256
  "task_label": "Temporal Order Verification",
4257
  "series_id": "qwen3_omni_v6_lora",
4258
  "method": "Qwen3-Omni v6 LoRA",
4259
+ "status": "scored",
4260
+ "status_label": "scored",
4261
+ "scored": true,
4262
  "proxy_scored": false,
4263
+ "raw": 0.40984631701404173,
4264
+ "raw_text": "0.4098",
4265
+ "normalized_score": 0.40984631701404173,
4266
+ "metric_key": "temporal_order_f1",
4267
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/temporal_order/metrics.json",
4268
  "scope": "multi_episode_128_partial_model_overlay",
4269
+ "reason": null
4270
  },
4271
  {
4272
  "task_number": 11,
 
4418
  "task_label": "Multimodal Synchronization Detection",
4419
  "series_id": "qwen3_omni_v6_lora",
4420
  "method": "Qwen3-Omni v6 LoRA",
4421
+ "status": "scored",
4422
+ "status_label": "scored",
4423
+ "scored": true,
4424
  "proxy_scored": false,
4425
+ "raw": 0.3344936184319576,
4426
+ "raw_text": "0.3345",
4427
+ "normalized_score": 0.3344936184319576,
4428
+ "metric_key": "misalignment_detection_f1",
4429
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/misalignment_detection/metrics.json",
4430
  "scope": "multi_episode_128_partial_model_overlay",
4431
+ "reason": null
4432
  },
4433
  {
4434
  "task_number": 12,
 
5714
  "task_label": "Time-to-Next-Transition Regression",
5715
  "series_id": "qwen3_omni_v6_lora",
5716
  "method": "Qwen3-Omni v6 LoRA",
5717
+ "status": "scored",
5718
+ "status_label": "scored",
5719
+ "scored": true,
5720
  "proxy_scored": false,
5721
+ "raw": 134.0687422166874,
5722
+ "raw_text": "134.07",
5723
+ "normalized_score": 0.07859666766782253,
5724
+ "metric_key": "time_to_transition_mae",
5725
+ "source": "results/omni_finetune/xperience10m_qwen3_omni_v6_order_sync_time_probes_a100_20260617T132500Z/time_to_transition/metrics.json",
5726
  "scope": "multi_episode_128_partial_model_overlay",
5727
+ "reason": null
5728
  },
5729
  {
5730
  "task_number": 20,
metrics/website_integrity.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "status": "pass",
3
- "generated_at_utc": "2026-06-17T21:12:34+00:00",
4
  "docs_root": "docs",
5
  "site_base": "/ropedia-xperience-10m-task-suite/",
6
  "summary": {
@@ -316,7 +316,7 @@
316
  },
317
  {
318
  "path": "data/episode128_task_model_radar.json",
319
- "bytes": 187388,
320
  "top_level_type": "dict"
321
  },
322
  {
@@ -486,12 +486,12 @@
486
  },
487
  {
488
  "path": "data/task_method_20_gap_audit.json",
489
- "bytes": 55745,
490
  "top_level_type": "dict"
491
  },
492
  {
493
  "path": "data/task_method_20_result_matrix.json",
494
- "bytes": 129749,
495
  "top_level_type": "dict"
496
  },
497
  {
@@ -526,7 +526,7 @@
526
  },
527
  {
528
  "path": "data/unified_task_model_radar.json",
529
- "bytes": 231240,
530
  "top_level_type": "dict"
531
  },
532
  {
@@ -566,7 +566,7 @@
566
  {
567
  "path": "assets/charts/episode128_task_model_radar.svg",
568
  "exists": true,
569
- "bytes": 44044,
570
  "format": "SVG",
571
  "has_viewbox": true
572
  },
@@ -636,7 +636,7 @@
636
  {
637
  "path": "assets/charts/unified_task_model_radar.svg",
638
  "exists": true,
639
- "bytes": 50060,
640
  "format": "SVG",
641
  "has_viewbox": true
642
  },
 
1
  {
2
  "status": "pass",
3
+ "generated_at_utc": "2026-06-17T21:25:27+00:00",
4
  "docs_root": "docs",
5
  "site_base": "/ropedia-xperience-10m-task-suite/",
6
  "summary": {
 
316
  },
317
  {
318
  "path": "data/episode128_task_model_radar.json",
319
+ "bytes": 187309,
320
  "top_level_type": "dict"
321
  },
322
  {
 
486
  },
487
  {
488
  "path": "data/task_method_20_gap_audit.json",
489
+ "bytes": 53574,
490
  "top_level_type": "dict"
491
  },
492
  {
493
  "path": "data/task_method_20_result_matrix.json",
494
+ "bytes": 129707,
495
  "top_level_type": "dict"
496
  },
497
  {
 
526
  },
527
  {
528
  "path": "data/unified_task_model_radar.json",
529
+ "bytes": 231161,
530
  "top_level_type": "dict"
531
  },
532
  {
 
566
  {
567
  "path": "assets/charts/episode128_task_model_radar.svg",
568
  "exists": true,
569
+ "bytes": 44378,
570
  "format": "SVG",
571
  "has_viewbox": true
572
  },
 
636
  {
637
  "path": "assets/charts/unified_task_model_radar.svg",
638
  "exists": true,
639
+ "bytes": 50394,
640
  "format": "SVG",
641
  "has_viewbox": true
642
  },