File size: 26,066 Bytes
eea471e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | <svg xmlns="http://www.w3.org/2000/svg" width="1500" height="1840" viewBox="0 0 1500 1840">
<defs><marker id="arrow2" viewBox="0 0 10 10" refX="8" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#cbd5e1"/></marker></defs>
<rect width="100%" height="100%" fill="#ffffff"/>
<text x="60" y="56" font-family="Arial, sans-serif" font-size="34" font-weight="700" fill="#10141f">Minimal Architectures for the 12 Ropedia Episode Tasks</text>
<text x="60" y="88" font-family="Arial, sans-serif" font-size="16" fill="#5b6475">Generated from scripts/episode_task_suite.py semantics and committed summary metrics. These are minimal baselines, not deep foundation models.</text>
<line x1="382" y1="177" x2="396" y2="177" stroke="#cbd5e1" stroke-width="3" marker-end="url(#arrow2)"/>
<line x1="732" y1="177" x2="746" y2="177" stroke="#cbd5e1" stroke-width="3" marker-end="url(#arrow2)"/>
<line x1="1092" y1="177" x2="1106" y2="177" stroke="#cbd5e1" stroke-width="3" marker-end="url(#arrow2)"/>
<rect x="60" y="122" width="310" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="122" width="8" height="110" rx="4" fill="#1f63e9"/>
<text x="84" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Shared episode windows</text>
<text x="84" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">5,821 frames -> 1,161 windows</text>
<text x="84" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">20-frame window, 5-frame stride</text>
<text x="84" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">chronological 70/30 split</text>
<rect x="410" y="122" width="310" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="410" y="122" width="8" height="110" rx="4" fill="#008b9a"/>
<text x="434" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Feature vector</text>
<text x="434" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all = 8,378 dimensions</text>
<text x="434" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">17 named modality blocks</text>
<text x="434" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">mean/std fit on train only</text>
<rect x="760" y="122" width="320" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="760" y="122" width="8" height="110" rx="4" fill="#0a7f55"/>
<text x="784" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Reusable heads</text>
<text x="784" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">linear softmax classifier</text>
<text x="784" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">dual ridge regression/projection</text>
<text x="784" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">multi-label logistic + cosine rank</text>
<rect x="1120" y="122" width="320" height="110" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1120" y="122" width="8" height="110" rx="4" fill="#b65b04"/>
<text x="1144" y="153" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#10141f">Artifacts</text>
<text x="1144" y="180" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">metrics.json, predictions.csv/npz</text>
<text x="1144" y="198" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">model.npz with scaler and weights</text>
<text x="1144" y="216" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">summary_report.json source of</text>
<text x="1144" y="234" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">numbers</text>
<rect x="60" y="270" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="78" y="303" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#1f63e9">Softmax classifier</text>
<text x="78" y="330" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">logits = z(X)W + b; CE + L2; class weights for classifiers</text>
<rect x="780" y="270" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="798" y="303" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#0a7f55">Ridge regression/projection</text>
<text x="798" y="330" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">closed-form dual ridge on z(X), z(Y); used for forecast and reconstruction</text>
<rect x="60" y="394" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="78" y="427" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#008b9a">Ridge + cosine ranking</text>
<text x="78" y="454" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">project one modality into another feature space, then rank candidates by</text>
<text x="78" y="472" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">cosine</text>
<rect x="780" y="394" width="660" height="100" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="798" y="427" font-family="Arial, sans-serif" font-size="18" font-weight="700" fill="#b65b04">Multi-label logistic</text>
<text x="798" y="454" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">sigmoid heads for object vocabulary; threshold 0.5 with top-1 fallback</text>
<rect x="60" y="540" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="540" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="80" y="558" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="128" y="575" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="80" y="612" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">timeline_action</text>
<text x="80" y="644" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="152" y="644" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all window, 8,378d</text>
<text x="80" y="669" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="152" y="669" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> linear softmax, class-weighted</text>
<text x="152" y="686" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="80" y="711" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="152" y="711" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">current action class, 18 classes</text>
<text x="80" y="736" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="152" y="736" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.0500</text>
<rect x="530" y="540" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="540" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="550" y="558" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="598" y="575" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="550" y="612" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">timeline_subtask</text>
<text x="550" y="644" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="622" y="644" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all window, 8,378d</text>
<text x="550" y="669" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="622" y="669" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> linear softmax, class-weighted</text>
<text x="622" y="686" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="550" y="711" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="622" y="711" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">current subtask class, 14 classes</text>
<text x="550" y="736" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="622" y="736" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.0495</text>
<rect x="1000" y="540" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="540" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="1020" y="558" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="1068" y="575" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="1020" y="612" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">transition_detection</text>
<text x="1020" y="644" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="1092" y="644" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all window, 8,378d</text>
<text x="1020" y="669" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="1092" y="669" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> linear softmax, class-weighted</text>
<text x="1092" y="686" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="1020" y="711" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="1092" y="711" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">steady vs transition near action boundary</text>
<text x="1020" y="736" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="1092" y="736" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.6552; boundary-F1 0.2143</text>
<rect x="60" y="818" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="818" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="80" y="836" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="128" y="853" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="80" y="890" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">next_action</text>
<text x="80" y="922" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="152" y="922" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all at time t, 8,378d</text>
<text x="80" y="947" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="152" y="947" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> linear softmax, class-weighted</text>
<text x="152" y="964" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">CE + L2</text>
<text x="80" y="989" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="152" y="989" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">action at t+20 frames</text>
<text x="80" y="1014" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="152" y="1014" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 0.0593</text>
<rect x="530" y="818" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="818" width="8" height="248" rx="4" fill="#0a7f55"/>
<rect x="550" y="836" width="96" height="24" rx="6" fill="#f8fafc" stroke="#0a7f55"/>
<text x="598" y="853" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#0a7f55">ridge</text>
<text x="550" y="890" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">hand_trajectory_forecast</text>
<text x="550" y="922" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">INPUT</text>
<text x="622" y="922" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X_all at time t, 8,378d</text>
<text x="550" y="947" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">HEAD</text>
<text x="622" y="947" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score X/Y -> dual ridge regression,</text>
<text x="622" y="964" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">L2=10</text>
<text x="550" y="989" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">OUTPUT</text>
<text x="622" y="989" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">future hand joints, 1260d</text>
<text x="550" y="1014" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">METRIC</text>
<text x="622" y="1014" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">MPJPE 0.8223</text>
<rect x="1000" y="818" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="818" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="1020" y="836" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="1068" y="853" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="1020" y="890" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">contact_prediction</text>
<text x="1020" y="922" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="1092" y="922" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X without contact/text leakage, 7,335d</text>
<text x="1020" y="947" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="1092" y="947" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> linear softmax on observed</text>
<text x="1092" y="964" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">labels</text>
<text x="1020" y="989" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="1092" y="989" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">any body contact in window; degenerate</text>
<text x="1092" y="1006" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">one-class sample</text>
<text x="1020" y="1031" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="1092" y="1031" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">macro-F1 1.0000</text>
<rect x="60" y="1096" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="1096" width="8" height="248" rx="4" fill="#b65b04"/>
<rect x="80" y="1114" width="96" height="24" rx="6" fill="#f8fafc" stroke="#b65b04"/>
<text x="128" y="1131" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#b65b04">multilabel</text>
<text x="80" y="1168" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">object_relevance</text>
<text x="80" y="1200" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">INPUT</text>
<text x="152" y="1200" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">X without caption text, 7,482d</text>
<text x="80" y="1225" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">HEAD</text>
<text x="152" y="1225" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> sigmoid multi-label logistic,</text>
<text x="152" y="1242" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">weighted</text>
<text x="80" y="1267" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">OUTPUT</text>
<text x="152" y="1267" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">multi-hot object set, 34 objects</text>
<text x="80" y="1292" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#b65b04">METRIC</text>
<text x="152" y="1292" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">micro-F1 0.1839</text>
<rect x="530" y="1096" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="1096" width="8" height="248" rx="4" fill="#008b9a"/>
<rect x="550" y="1114" width="96" height="24" rx="6" fill="#f8fafc" stroke="#008b9a"/>
<text x="598" y="1131" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#008b9a">ridge+rank</text>
<text x="550" y="1168" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">caption_grounding</text>
<text x="550" y="1200" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">INPUT</text>
<text x="622" y="1200" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">sensor 7,482d -> text space 896d</text>
<text x="550" y="1225" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">HEAD</text>
<text x="622" y="1225" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">ridge projection, then cosine ranking</text>
<text x="550" y="1250" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">OUTPUT</text>
<text x="622" y="1250" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">text query retrieves matching time window</text>
<text x="550" y="1275" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">METRIC</text>
<text x="622" y="1275" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">MRR 0.0172</text>
<rect x="1000" y="1096" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="1096" width="8" height="248" rx="4" fill="#008b9a"/>
<rect x="1020" y="1114" width="96" height="24" rx="6" fill="#f8fafc" stroke="#008b9a"/>
<text x="1068" y="1131" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#008b9a">ridge+rank</text>
<text x="1020" y="1168" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">cross_modal_retrieval</text>
<text x="1020" y="1200" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">INPUT</text>
<text x="1092" y="1200" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">motion/IMU/camera 2,247d -> visual 5,096d</text>
<text x="1020" y="1225" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">HEAD</text>
<text x="1092" y="1225" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">ridge projection, then cosine ranking</text>
<text x="1020" y="1250" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">OUTPUT</text>
<text x="1092" y="1250" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">retrieve matching depth/video window</text>
<text x="1020" y="1275" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#008b9a">METRIC</text>
<text x="1092" y="1275" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">top-5 0.3764</text>
<rect x="60" y="1374" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="60" y="1374" width="8" height="248" rx="4" fill="#0a7f55"/>
<rect x="80" y="1392" width="96" height="24" rx="6" fill="#f8fafc" stroke="#0a7f55"/>
<text x="128" y="1409" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#0a7f55">ridge</text>
<text x="80" y="1446" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">modality_reconstruction</text>
<text x="80" y="1478" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">INPUT</text>
<text x="152" y="1478" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">motion/IMU/camera 2,247d</text>
<text x="80" y="1503" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">HEAD</text>
<text x="152" y="1503" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score X/Y -> dual ridge regression,</text>
<text x="152" y="1520" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">L2=10</text>
<text x="80" y="1545" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">OUTPUT</text>
<text x="152" y="1545" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">depth/video feature vector, 5,096d</text>
<text x="80" y="1570" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#0a7f55">METRIC</text>
<text x="152" y="1570" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">R2 -0.0160</text>
<rect x="530" y="1374" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="530" y="1374" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="550" y="1392" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="598" y="1409" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="550" y="1446" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">temporal_order</text>
<text x="550" y="1478" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="622" y="1478" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">concat[x_t, x_t+1, diff], 25,134d</text>
<text x="550" y="1503" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="622" y="1503" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> binary linear softmax, CE + L2</text>
<text x="550" y="1528" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="622" y="1528" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">correct vs reversed adjacent windows</text>
<text x="550" y="1553" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="622" y="1553" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">F1 0.5487</text>
<rect x="1000" y="1374" width="440" height="248" rx="8" fill="#ffffff" stroke="#dce2ec" stroke-width="2"/>
<rect x="1000" y="1374" width="8" height="248" rx="4" fill="#1f63e9"/>
<rect x="1020" y="1392" width="96" height="24" rx="6" fill="#f8fafc" stroke="#1f63e9"/>
<text x="1068" y="1409" text-anchor="middle" font-family="Arial, sans-serif" font-size="11" font-weight="700" fill="#1f63e9">softmax</text>
<text x="1020" y="1446" font-family="Arial, sans-serif" font-size="20" font-weight="700" fill="#10141f">misalignment_detection</text>
<text x="1020" y="1478" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">INPUT</text>
<text x="1092" y="1478" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">concat[motion_t, visual_t/visual_t+8],</text>
<text x="1092" y="1495" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">7,343d</text>
<text x="1020" y="1520" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">HEAD</text>
<text x="1092" y="1520" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">z-score -> binary linear softmax, CE + L2</text>
<text x="1020" y="1545" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">OUTPUT</text>
<text x="1092" y="1545" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">aligned vs shifted by 8 windows</text>
<text x="1020" y="1570" font-family="Arial, sans-serif" font-size="12" font-weight="700" fill="#1f63e9">METRIC</text>
<text x="1092" y="1570" font-family="Arial, sans-serif" font-size="13" font-weight="500" fill="#394255">F1 0.4866</text>
<rect x="60" y="1688" width="1380" height="72" rx="8" fill="#f8fafc" stroke="#dce2ec"/>
<text x="84" y="1718" font-family="Arial, sans-serif" font-size="15" fill="#273143">Interpretation: this suite tests whether each input/output contract is wired correctly before scaling to many episodes.</text>
<text x="84" y="1742" font-family="Arial, sans-serif" font-size="15" fill="#273143">Research-grade claims need held-out episode splits and stronger sequence/vision-language/robot-policy models.</text>
</svg> |