File size: 26,066 Bytes
eea471e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
<svg xmlns="http://www.w3.org/2000/svg" width="1500" height="1840" viewBox="0 0 1500 1840">
<defs><marker id="arrow2" viewBox="0 0 10 10" refX="8" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#cbd5e1"/></marker></defs>
<rect width="100%" height="100%" fill="#ffffff"/>
<text x="60" y="56" font-family="Arial, sans-serif" font-size="34" font-weight="700" fill="#10141f">Minimal Architectures for the 12 Ropedia Episode Tasks</text>
<text x="60" y="88" font-family="Arial, sans-serif" font-size="16" fill="#5b6475">Generated from scripts/episode_task_suite.py semantics and committed summary metrics. These are minimal baselines, not deep foundation models.</text>
<line x1="382" y1="177" x2="396" y2="177" stroke="#cbd5e1" stroke-width="3" marker-end="url(#arrow2)"/>
<line x1="732" y1="177" x2="746" y2="177" stroke="#cbd5e1" stroke-width="3" marker-end="url(#arrow2)"/>
<line x1="1092" y1="177" x2="1106" y2="177" stroke="#cbd5e1" stroke-width="3" marker-end="url(#arrow2)"/>
<rect x="60" y="122" width="310" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="122" width="8" height="110" rx="4" fill="#1f63e9"/>
<text x="84" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Shared episode windows</text>
<text x="84" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">5,821 frames -&gt; 1,161 windows</text>
<text x="84" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">20-frame window, 5-frame stride</text>
<text x="84" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">chronological 70/30 split</text>
<rect x="410" y="122" width="310" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="410" y="122" width="8" height="110" rx="4" fill="#008b9a"/>
<text x="434" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Feature vector</text>
<text x="434" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all = 8,378 dimensions</text>
<text x="434" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">17 named modality blocks</text>
<text x="434" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">mean/std fit on train only</text>
<rect x="760" y="122" width="320" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="760" y="122" width="8" height="110" rx="4" fill="#0a7f55"/>
<text x="784" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Reusable heads</text>
<text x="784" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">linear softmax classifier</text>
<text x="784" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">dual ridge regression/projection</text>
<text x="784" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">multi-label logistic + cosine rank</text>
<rect x="1120" y="122" width="320" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1120" y="122" width="8" height="110" rx="4" fill="#b65b04"/>
<text x="1144" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Artifacts</text>
<text x="1144" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">metrics.json, predictions.csv/npz</text>
<text x="1144" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">model.npz with scaler and weights</text>
<text x="1144" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">summary_report.json source of</text>
<text x="1144" y="234" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">numbers</text>
<rect x="60" y="270" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="78" y="303" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#1f63e9">Softmax classifier</text>
<text x="78" y="330" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">logits = z(X)W + b; CE + L2; class weights for classifiers</text>
<rect x="780" y="270" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="798" y="303" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#0a7f55">Ridge regression/projection</text>
<text x="798" y="330" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">closed-form dual ridge on z(X), z(Y); used for forecast and reconstruction</text>
<rect x="60" y="394" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="78" y="427" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#008b9a">Ridge + cosine ranking</text>
<text x="78" y="454" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">project one modality into another feature space, then rank candidates by</text>
<text x="78" y="472" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">cosine</text>
<rect x="780" y="394" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="798" y="427" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#b65b04">Multi-label logistic</text>
<text x="798" y="454" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">sigmoid heads for object vocabulary; threshold 0.5 with top-1 fallback</text>
<rect x="60" y="540" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="540" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="80" y="558" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="128" y="575" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="80" y="612" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">timeline_action</text>
<text x="80" y="644" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="152" y="644" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all window, 8,378d</text>
<text x="80" y="669" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="152" y="669" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; linear softmax, class-weighted</text>
<text x="152" y="686" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="80" y="711" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="152" y="711" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">current action class, 18 classes</text>
<text x="80" y="736" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="152" y="736" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.0500</text>
<rect x="530" y="540" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="540" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="550" y="558" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="598" y="575" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="550" y="612" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">timeline_subtask</text>
<text x="550" y="644" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="622" y="644" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all window, 8,378d</text>
<text x="550" y="669" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="622" y="669" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; linear softmax, class-weighted</text>
<text x="622" y="686" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="550" y="711" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="622" y="711" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">current subtask class, 14 classes</text>
<text x="550" y="736" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="622" y="736" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.0495</text>
<rect x="1000" y="540" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="540" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="1020" y="558" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="1068" y="575" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="1020" y="612" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">transition_detection</text>
<text x="1020" y="644" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="1092" y="644" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all window, 8,378d</text>
<text x="1020" y="669" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="1092" y="669" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; linear softmax, class-weighted</text>
<text x="1092" y="686" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="1020" y="711" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="1092" y="711" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">steady vs transition near action boundary</text>
<text x="1020" y="736" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="1092" y="736" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.6552; boundary-F1 0.2143</text>
<rect x="60" y="818" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="818" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="80" y="836" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="128" y="853" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="80" y="890" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">next_action</text>
<text x="80" y="922" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="152" y="922" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all at time t, 8,378d</text>
<text x="80" y="947" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="152" y="947" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; linear softmax, class-weighted</text>
<text x="152" y="964" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="80" y="989" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="152" y="989" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">action at t+20 frames</text>
<text x="80" y="1014" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="152" y="1014" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.0593</text>
<rect x="530" y="818" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="818" width="8" height="248" rx="4" fill="#0a7f55"/>
<rect x="550" y="836" width="96" height="24" rx="6" fill="#f8fafc" stroke="#0a7f55"/>
<text x="598" y="853" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#0a7f55">ridge</text>
<text x="550" y="890" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">hand_trajectory_forecast</text>
<text x="550" y="922" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">INPUT</text>
<text x="622" y="922" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all at time t, 8,378d</text>
<text x="550" y="947" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">HEAD</text>
<text x="622" y="947" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score X/Y -&gt; dual ridge regression,</text>
<text x="622" y="964" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">L2=10</text>
<text x="550" y="989" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">OUTPUT</text>
<text x="622" y="989" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">future hand joints, 1260d</text>
<text x="550" y="1014" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">METRIC</text>
<text x="622" y="1014" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">MPJPE 0.8223</text>
<rect x="1000" y="818" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="818" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="1020" y="836" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="1068" y="853" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="1020" y="890" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">contact_prediction</text>
<text x="1020" y="922" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="1092" y="922" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X without contact/text leakage, 7,335d</text>
<text x="1020" y="947" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="1092" y="947" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; linear softmax on observed</text>
<text x="1092" y="964" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">labels</text>
<text x="1020" y="989" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="1092" y="989" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">any body contact in window; degenerate</text>
<text x="1092" y="1006" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">one-class sample</text>
<text x="1020" y="1031" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="1092" y="1031" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 1.0000</text>
<rect x="60" y="1096" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="1096" width="8" height="248" rx="4" fill="#b65b04"/>
<rect x="80" y="1114" width="96" height="24" rx="6" fill="#f8fafc" stroke="#b65b04"/>
<text x="128" y="1131" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#b65b04">multilabel</text>
<text x="80" y="1168" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">object_relevance</text>
<text x="80" y="1200" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">INPUT</text>
<text x="152" y="1200" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X without caption text, 7,482d</text>
<text x="80" y="1225" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">HEAD</text>
<text x="152" y="1225" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; sigmoid multi-label logistic,</text>
<text x="152" y="1242" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">weighted</text>
<text x="80" y="1267" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">OUTPUT</text>
<text x="152" y="1267" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">multi-hot object set, 34 objects</text>
<text x="80" y="1292" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">METRIC</text>
<text x="152" y="1292" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">micro-F1 0.1839</text>
<rect x="530" y="1096" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="1096" width="8" height="248" rx="4" fill="#008b9a"/>
<rect x="550" y="1114" width="96" height="24" rx="6" fill="#f8fafc" stroke="#008b9a"/>
<text x="598" y="1131" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#008b9a">ridge+rank</text>
<text x="550" y="1168" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">caption_grounding</text>
<text x="550" y="1200" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">INPUT</text>
<text x="622" y="1200" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">sensor 7,482d -&gt; text space 896d</text>
<text x="550" y="1225" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">HEAD</text>
<text x="622" y="1225" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">ridge projection, then cosine ranking</text>
<text x="550" y="1250" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">OUTPUT</text>
<text x="622" y="1250" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">text query retrieves matching time window</text>
<text x="550" y="1275" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">METRIC</text>
<text x="622" y="1275" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">MRR 0.0172</text>
<rect x="1000" y="1096" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="1096" width="8" height="248" rx="4" fill="#008b9a"/>
<rect x="1020" y="1114" width="96" height="24" rx="6" fill="#f8fafc" stroke="#008b9a"/>
<text x="1068" y="1131" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#008b9a">ridge+rank</text>
<text x="1020" y="1168" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">cross_modal_retrieval</text>
<text x="1020" y="1200" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">INPUT</text>
<text x="1092" y="1200" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">motion/IMU/camera 2,247d -&gt; visual 5,096d</text>
<text x="1020" y="1225" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">HEAD</text>
<text x="1092" y="1225" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">ridge projection, then cosine ranking</text>
<text x="1020" y="1250" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">OUTPUT</text>
<text x="1092" y="1250" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">retrieve matching depth/video window</text>
<text x="1020" y="1275" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">METRIC</text>
<text x="1092" y="1275" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">top-5 0.3764</text>
<rect x="60" y="1374" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="1374" width="8" height="248" rx="4" fill="#0a7f55"/>
<rect x="80" y="1392" width="96" height="24" rx="6" fill="#f8fafc" stroke="#0a7f55"/>
<text x="128" y="1409" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#0a7f55">ridge</text>
<text x="80" y="1446" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">modality_reconstruction</text>
<text x="80" y="1478" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">INPUT</text>
<text x="152" y="1478" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">motion/IMU/camera 2,247d</text>
<text x="80" y="1503" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">HEAD</text>
<text x="152" y="1503" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score X/Y -&gt; dual ridge regression,</text>
<text x="152" y="1520" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">L2=10</text>
<text x="80" y="1545" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">OUTPUT</text>
<text x="152" y="1545" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">depth/video feature vector, 5,096d</text>
<text x="80" y="1570" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">METRIC</text>
<text x="152" y="1570" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">R2 -0.0160</text>
<rect x="530" y="1374" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="1374" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="550" y="1392" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="598" y="1409" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="550" y="1446" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">temporal_order</text>
<text x="550" y="1478" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="622" y="1478" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">concat[x_t, x_t+1, diff], 25,134d</text>
<text x="550" y="1503" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="622" y="1503" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; binary linear softmax, CE + L2</text>
<text x="550" y="1528" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="622" y="1528" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">correct vs reversed adjacent windows</text>
<text x="550" y="1553" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="622" y="1553" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">F1 0.5487</text>
<rect x="1000" y="1374" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="1374" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="1020" y="1392" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="1068" y="1409" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="1020" y="1446" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">misalignment_detection</text>
<text x="1020" y="1478" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="1092" y="1478" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">concat[motion_t, visual_t/visual_t+8],</text>
<text x="1092" y="1495" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">7,343d</text>
<text x="1020" y="1520" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="1092" y="1520" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -&gt; binary linear softmax, CE + L2</text>
<text x="1020" y="1545" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="1092" y="1545" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">aligned vs shifted by 8 windows</text>
<text x="1020" y="1570" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="1092" y="1570" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">F1 0.4866</text>
<rect x="60" y="1688" width="1380" height="72" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="84" y="1718" font-family="Arial, sans-serif" font-size="15" fill="#273143">Interpretation: this suite tests whether each input/output contract is wired correctly before scaling to many episodes.</text>
<text x="84" y="1742" font-family="Arial, sans-serif" font-size="15" fill="#273143">Research-grade claims need held-out episode splits and stronger sequence/vision-language/robot-policy models.</text>
</svg>