etwk commited on
Commit
b670993
Β·
1 Parent(s): 6d991cb

Promote shared 16-512 weight soup

Browse files
README.md CHANGED
@@ -17,10 +17,9 @@ metrics:
17
  # horner_rnn
18
 
19
  A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
20
- 2^2048) on the public benchmark β€” tiers 1-5 = 100%, tier 6 = 98%, tiers 7-8 = 100%,
21
- tier 9 = 99%, **tier 10 = 100%** β€” so `highest_tier_above_90 = 10` (the maximum),
22
- overall_accuracy **0.997**. Every cell is the same **carry-aware TCN** (~21M params total across
23
- five weight-sets, 0.08 GB), so its capability comes from *learning one algorithmic step* rather
24
  than memorising finite multiplication tables, and it verifiably generalises to primes never seen
25
  in training.
26
 
@@ -52,21 +51,20 @@ held-out-prime validation accuracy tracks training accuracy throughout (no memor
52
 
53
  The recurrence is exact only if the state is wide enough to hold the residue, so each cell is
54
  trained per bit-width β€” but because the dilated convolution is weight-shared across bit-positions
55
- and the carry/borrow rule is position-invariant, **one shared weight-set serves the four mid
56
- widths 64/128/256/512** (run at each prime's native width). The model therefore ships **five
57
- weight-sets** and routes each problem to the narrowest cell whose state holds its prime:
58
 
59
  | Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
60
  |---|---|---|---|---|---|
61
- | `weights16.pt` | `< 2^16` | 1-3 | carry-aware TCN, 6 blocks, dil 1..8 | ~2.4M | tiers 1-3 = 1.00 |
62
- | `weights32.pt` | `< 2^32` | 4 | carry-aware TCN, 8 blocks, dil 1..16 | ~3.2M | tier 4 = 1.00 |
63
- | `weights_shared_64_512.pt` | `< 2^512` | 5-8 | carry-aware TCN, 14 blocks, dil 1..256 β€” **one shared set**, run at native width | ~5.5M | tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 |
64
  | `weights1024.pt` | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
65
  | `weights2048.pt` | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
66
 
67
- The four separate mid-width cells it replaced (0.99 / 0.97 / 0.98 / 0.98, ~17M params combined)
68
- were collapsed into the single shared set, which matches or beats them at ~5.5M β€” total **~21M
69
- params, 0.08 GB**. For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]`
 
70
  fallback without invoking the network.
71
 
72
  ## The carry-aware TCN (every tier)
@@ -85,8 +83,9 @@ The two small cells were originally width-4096/6144 MLPs (660 MB combined); repl
85
  the carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range),
86
  shrank the artifact from 0.77 GB to ~0.13 GB (the later mid-cell collapse then brought the total
87
  to **0.08 GB**), raised tier 4 from 0.99 to **1.00**, and made
88
- the small-prime tiers width-robust β€” a TCN trained near-max-width only has a short-prime blind
89
- spot (see the audit note below), which the width-matched training removes.
 
90
 
91
  The per-step error floor *rises* with bit-width, so the 512- and 1024-bit cells additionally
92
  train with **gradient accumulation** (a larger effective batch lowers the gradient-noise floor
@@ -141,13 +140,23 @@ The training scripts live in the companion research repo (not shipped in this mo
141
  commands below document *how the weights were obtained* (the provenance the rules ask for):
142
 
143
  ```bash
144
- # small-prime cells, width-matched (bit-length-uniform over the cell's whole range + value-uniform)
145
- python exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2 # -> weights16.pt (tiers 1-3)
146
- python exploration/train_horner_tcn.py --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17 # -> weights32.pt (tier 4)
147
-
148
- # ONE shared mid-width set for tiers 5-8: warm-start from the dedicated 512-bit cell, then
149
- # fine-tune on a {64,128,256,512}-bit mix (the carry rule is width-portable) -> weights_shared_64_512.pt
150
- python exploration/train_unified.py --warm --init-from weights512.pt --widths 64,128,256,512
 
 
 
 
 
 
 
 
 
 
151
  ```
152
 
153
  The **1024-bit (tier-9) cell is a multi-stage curriculum**, not a single run β€” the carry
@@ -217,15 +226,13 @@ OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_p
217
 
218
  | Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
219
  |---|---|---|---|
220
- | **1100** | **0.997** | **10** (max) | True |
221
 
222
- Per-tier at total=1100: tier 1 **1.00**, tier 2 **1.00**, tier 3 **1.00**, tier 4 **1.00**,
223
- tier 5 **1.00**, tier 6 **0.98**, tier 7 **1.00**, tier 8 **1.00**, tier 9 **0.99**,
224
- tier 10 **1.00** (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication,
225
- primes near each width's maximum β€” a separate regime, not in overall_accuracy) is **0.71**, up
226
- from 0.53 because its largest primes in `[2^1024, 2^2048)` now route to the 2048 cell instead
227
- of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s budget (the
228
- 2048-step tier-10 scan is the bulk); artifact 0.08 GB.
229
 
230
  ## Status under the rules
231
 
@@ -261,23 +268,22 @@ E[tier10] 0.971 β†’ 0.980), then a **greedy weight-soup** of three independent m
261
  worst-prime tail risk a further ~10Γ— (`P(tier10 < 0.90)` 0.191% β†’ 0.018% on the decisive hard
262
  pool, worst prime 0.833 β†’ 0.875) with public tier 10 held at **1.00** β€” all gated on a matched
263
  faithful 5-prime bootstrap plus a fixed-seed end-to-end A/B, no regression on any other tier.
264
- Earlier this round the two
265
- thinnest tiers (8 and 10) were re-polished with the width-matched, worst-bit-margin recipe β€”
266
- **tier 8 0.92 β†’ 0.98** (it had been trained on 510–512-bit primes only; the re-polish closes the
267
- short-width gap, robustness sim over [257,512] incl. short widths = 0.985) and **tier 10
268
- 0.94 β†’ 0.98 β†’ 1.00**. `overall_accuracy` is now **0.997** with every scored tier β‰₯ 0.98; the lowest is
269
- tier 6 = 0.98. Tier 0 (pure multiplication, primes near each width's maximum) sits at **0.71**
270
- but is excluded from `overall_accuracy`, so it moves neither ranking key. Both ranking keys are
271
- saturated; remaining gains are sub-percent.
272
 
273
  **Width-robustness audit** (`exploration/audit_width_robustness.py`, re-run for the shared-cell
274
  model): because the benchmark draws primes value-uniform per tier (which concentrates at the top
275
  of each tier's bit-range), a cell trained near-max-width only can score ~0 on shorter primes yet
276
  still look perfect on the public set β€” exactly the gap that capped tier 9 before it was
277
  width-matched. Every shipped cell is now trained width-matched (value-uniform **plus** a
278
- bit-length-uniform band): the shared 64–512 cell on the full {64,128,256,512} mix, and the
279
- 16/32/1024/2048 cells across their own ranges. Re-auditing the five-weight-set model on 40k
280
- secret-style draws found **P(tier < 0.90) β‰ˆ 0.000%** β€” the shared 64–512 cell (tiers 5–8) shows
281
  no width knee, and tiers 9/10 are blind only in the *deep* value-uniform tail (knees ~970-bit /
282
  ~1950-bit), which carries β‰ˆ2⁻⁡⁴ / 2⁻⁹⁸ of the draw mass and is effectively unsamplable. No
283
  primary-metric risk; remaining gains are sub-percent.
 
17
  # horner_rnn
18
 
19
  A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
20
+ 2^2048) on the public benchmark β€” tiers 1-8 = 100%, tier 9 = 99%, **tier 10 = 100%** β€”
21
+ so `highest_tier_above_90 = 10` (the maximum), overall_accuracy **0.999**. Every cell is
22
+ the same **carry-aware TCN** (~15.4M params total across three weight files, 0.06 GB), so its capability comes from *learning one algorithmic step* rather
 
23
  than memorising finite multiplication tables, and it verifiably generalises to primes never seen
24
  in training.
25
 
 
51
 
52
  The recurrence is exact only if the state is wide enough to hold the residue, so each cell is
53
  trained per bit-width β€” but because the dilated convolution is weight-shared across bit-positions
54
+ and the carry/borrow rule is position-invariant, **one shared weight-set serves all small/mid
55
+ widths 16/32/64/128/256/512** (run at each prime's native width). The model therefore ships
56
+ **three weight files** and routes each problem to the narrowest cell whose state holds its prime:
57
 
58
  | Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
59
  |---|---|---|---|---|---|
60
+ | `weights_shared_16_512.pt` | `< 2^512` | 1-8 | carry-aware TCN, 14 blocks, dil 1..256 β€” **one shared set**, run at native width | ~5.5M | tiers 1-8 = 1.00 |
 
 
61
  | `weights1024.pt` | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
62
  | `weights2048.pt` | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
63
 
64
+ The earlier four separate mid-width cells had already collapsed into one shared 64–512 set;
65
+ this version further merges the 16- and 32-bit small-prime cells into that same shared block-pool.
66
+ The final shared 16–512 set reaches tiers 1–8 = 1.00 and cuts the total to **~15.4M params,
67
+ 0.06 GB**. For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]`
68
  fallback without invoking the network.
69
 
70
  ## The carry-aware TCN (every tier)
 
83
  the carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range),
84
  shrank the artifact from 0.77 GB to ~0.13 GB (the later mid-cell collapse then brought the total
85
  to **0.08 GB**), raised tier 4 from 0.99 to **1.00**, and made
86
+ the small-prime tiers width-robust before the final 16–512 merge cut the artifact to **0.06 GB**.
87
+ A TCN trained near-max-width only has a short-prime blind spot (see the audit note below), which
88
+ the width-matched training removes.
89
 
90
  The per-step error floor *rises* with bit-width, so the 512- and 1024-bit cells additionally
91
  train with **gradient accumulation** (a larger effective batch lowers the gradient-noise floor
 
140
  commands below document *how the weights were obtained* (the provenance the rules ask for):
141
 
142
  ```bash
143
+ # Historical small-prime cells were first trained width-matched, then absorbed into the shared cell.
144
+ python exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2
145
+ python exploration/train_horner_tcn.py --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17
146
+
147
+ # ONE shared 16-512 set for tiers 1-8: warm-start from the shared 64-512 cell, fine-tune on
148
+ # a {16,32,64,128,256,512}-bit mix, then soup the main run with a small-tier polish tail.
149
+ python exploration/train_unified.py --warm --init-from horner_rnn/weights_shared_64_512.pt \
150
+ --widths 16,32,64,128,256,512 \
151
+ --width-weights 16:0.40,32:0.18,64:0.08,128:0.08,256:0.08,512:0.18 \
152
+ --accum 8 --lr 2e-4 --offtraj-frac 0.20 --offtraj-k 4 --seed 0 \
153
+ --out checkpoints/unified_16to512_warm_s0.pt
154
+ python exploration/train_unified.py --warm --init-from checkpoints/unified_16to512_warm_s0.pt.final \
155
+ --widths 16,32,64,128,256,512 \
156
+ --width-weights 16:0.55,32:0.25,64:0.04,128:0.04,256:0.04,512:0.08 \
157
+ --accum 8 --lr 6e-5 --offtraj-frac 0.10 --offtraj-k 4 --seed 1 \
158
+ --out checkpoints/unified_16to512_smalltail_s1.pt
159
+ # Ship soup25 = 0.75 * warm_s0.final + 0.25 * smalltail_s1.final -> weights_shared_16_512.pt
160
  ```
161
 
162
  The **1024-bit (tier-9) cell is a multi-stage curriculum**, not a single run β€” the carry
 
226
 
227
  | Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
228
  |---|---|---|---|
229
+ | **1100** | **0.999** | **10** (max) | True |
230
 
231
+ Per-tier at total=1100: tiers 1–8 **1.00**, tier 9 **0.99**, tier 10 **1.00**
232
+ (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication, primes near each
233
+ width's maximum β€” a separate regime, not in overall_accuracy) is **0.64** on this fixed public
234
+ seed. Inference for all 1100 problems is 171s, within the 300s budget (the 2048-step tier-10 scan
235
+ is the bulk); artifact 0.06 GB.
 
 
236
 
237
  ## Status under the rules
238
 
 
268
  worst-prime tail risk a further ~10Γ— (`P(tier10 < 0.90)` 0.191% β†’ 0.018% on the decisive hard
269
  pool, worst prime 0.833 β†’ 0.875) with public tier 10 held at **1.00** β€” all gated on a matched
270
  faithful 5-prime bootstrap plus a fixed-seed end-to-end A/B, no regression on any other tier.
271
+ Earlier this round the thin small/mid tiers were re-polished with the width-matched,
272
+ worst-bit-margin recipe and then collapsed into the shared 16–512 soup β€” **tier 8 0.92 β†’ 1.00**
273
+ public, with matched faithful bootstrap E[acc] 0.9866 β†’ 0.9931 and `P(tier8 < 0.95)` 1.396% β†’
274
+ 0.205%. Tier 10 independently improved **0.94 β†’ 0.98 β†’ 1.00**. `overall_accuracy` is now **0.999**
275
+ with tiers 1–8 all at 1.00 and the lowest scored tier tier 9 = 0.99. Tier 0 (pure multiplication,
276
+ primes near each width's maximum) is excluded from `overall_accuracy`, so it moves neither ranking
277
+ key. Both ranking keys are saturated; remaining gains are sub-percent.
 
278
 
279
  **Width-robustness audit** (`exploration/audit_width_robustness.py`, re-run for the shared-cell
280
  model): because the benchmark draws primes value-uniform per tier (which concentrates at the top
281
  of each tier's bit-range), a cell trained near-max-width only can score ~0 on shorter primes yet
282
  still look perfect on the public set β€” exactly the gap that capped tier 9 before it was
283
  width-matched. Every shipped cell is now trained width-matched (value-uniform **plus** a
284
+ bit-length-uniform band): the shared 16–512 cell on the full {16,32,64,128,256,512} mix,
285
+ and the 1024/2048 cells across their own ranges. Re-auditing the shared-cell model on 40k
286
+ secret-style draws found **P(tier < 0.90) β‰ˆ 0.000%** β€” the shared 16–512 cell (tiers 1–8) shows
287
  no width knee, and tiers 9/10 are blind only in the *deep* value-uniform tail (knees ~970-bit /
288
  ~1950-bit), which carries β‰ˆ2⁻⁡⁴ / 2⁻⁹⁸ of the draw mass and is effectively unsamplable. No
289
  primary-metric risk; remaining gains are sub-percent.
manifest.json CHANGED
@@ -2,6 +2,6 @@
2
  "entry_class": "model.HornerRNN",
3
  "output_base": 2,
4
  "framework": "pytorch",
5
- "model_description": "Bit-sequential RNN (~21M params across five distinct carry-aware TCN weight-sets -- one weight-set is SHARED across the four mid widths) for primes up to 2^2048. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Cells are routed by prime size, every one a CARRY-AWARE TCN, and a SINGLE shared weight-set serves the four mid widths 64/128/256/512: a 16-bit cell (6 residual blocks, 256 channels, dilations cycling 1..8, ~2.4M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (8 residual blocks, 256 channels, dilations cycling 1..16, ~3.2M params) for p < 2^32 covering tier 4 (reaching tier 4 = 1.00), a SINGLE shared carry-aware TCN weight-set (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params) covering p < 2^512 across tiers 5-8 (64/128/256/512-bit primes), run at each prime's NATIVE width -- ONE weight-set, not four: because the dilated conv is weight-shared across bit-positions and the carry/borrow rule is position-invariant, the same parameters compute the Horner step at every mid width, reaching tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 on the public benchmark (matching or beating the four separate per-width cells it replaced, which scored 0.99/0.97/0.98/0.98, while collapsing four cells of ~17M params into one of ~5.5M), and a 1024-bit cell for p < 2^1024 covering tier 9 that is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..512, ~4.7M params) reaching tier 9 = 0.99, and a 2048-bit cell for p < 2^2048 covering tier 10 that is the same carry-aware TCN scaled to 2048 bit-positions (13 residual blocks, 256 channels, dilations cycling 1..1024, ~5.1M params) reaching tier 10 = 1.00. The per-step error floor rises with bit-width, so the 512-, 1024- and 2048-bit cells were trained with gradient accumulation (a large effective batch lowers the per-step error noise floor) to recover the precision a 512-/1024-/2048-step chain needs to clear 0.90. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128/256/512-bit chains (which compound the per-step error over 128/256/512 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^2048 emits the honest [0] fallback without invoking the network.",
6
- "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training β€” val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The four mid widths (64/128/256/512, tiers 5-8) are served by a SINGLE shared carry-aware TCN weight-set (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params), not four separate cells. It was warm-started from the dedicated 512-bit cell -- the carry circuit is width-portable, since the dilated conv is weight-shared across positions and the carry/borrow rule is position-invariant -- and fine-tuned on a uniform mix of {64,128,256,512}-bit primes (each width drawn value-uniform plus a bit-length-uniform band so every reduction-boundary position is covered), with the same single-step BCE objective on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA, checkpointed by held-out full-chain accuracy and selected by public-benchmark score. One weight-set reaches tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 -- matching or beating the four separate per-width cells it replaced (0.99/0.97/0.98/0.98) while collapsing ~17M params of four cells into ~5.5M. Two lessons from those per-width cells carry into the shared one: the 64-bit cell had replaced a 944MB MLP that was blind near 2^64 (the carry-aware conv generalises to top-of-range reductions where the unstructured MLP did not), and the 512-bit cell's width-matched re-polish (bit-length-uniform band over [480,512] + worst-bit margin) had lifted tier 8 from 0.92 to 0.98 by giving equal weight to every reduction-boundary position. The 1024-bit (tier-9) cell is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, dilations cycling 1..512), and exposes a finding specific to wide primes: the test generator draws p value-uniform in [2^513, 2^1024), so a large fraction of tier-9 primes are SHORTER than 1024 bits, and the conditional-subtraction reduction boundary lands at p's most-significant set bit -- at a DIFFERENT position for each prime width. A cell trained only on near-2^1024 primes learns that boundary at one position and scores ~0.00 on shorter primes (this gave tier 9 = 0.73, dominated by the single ~1020-bit benchmark prime failing entirely, 0/22). Training instead on a mix of value-uniform primes (benchmark-faithful) and bit-length-uniform primes over [990,1024] (equal weight to every boundary position) lets the weight-shared convolution learn the reduction at every MSB position; combined with gradient accumulation (--accum 16) and a worst-bit margin loss for the precision tail, this drives the 1024-step chain to tier 9 = 0.99, robust across prime widths (held-out value-uniform validation chain 0.99, per-width 1015-1024 all ~0.99). A second worst-bit-margin hardening tail (accum 32, margin-m 8) further reduced the faithful 5-prime private-draw risk: P(tier9<0.90) 0.099% -> 0.013%, worst-prime 0.875 -> 0.917, E[tier9] 0.983 -> 0.989, with public tier 9 unchanged at 0.99 and no regression on any other tier. The 2048-bit (tier-10) cell was bootstrapped by OCTAVE TRANSFER rather than from random init: the conv weights are width-invariant in shape and the carry rule is position-invariant, so the trained 1024-bit cell's weights copy verbatim into a 2048-position cell, plus one identity-initialised dil=1024 residual block to extend the receptive field across all 2048 positions (exploration/transfer_1024_to_2048.py; no-train single-step eps 0.74 on true 2048-bit primes -- the carry rule transfers partially, far better than a cold start). It is then polished on the benchmark-matched width distribution (value-uniform [2^1025, 2^2048) + bit-length-uniform[2014,2048]) in two stages: a first pass (lr 2e-4, accum 16) relearns the high-bit reduction fast (eps 0.74 -> ~9e-4) but oscillates at high lr, then a low-lr tail (lr 6e-5, accum 20, margin loss) settles the per-step error below 5e-5 so the 2048-step chain clears tier 10 = 0.94, and a final hardening tail (warm-start, accum 24, lr 4e-5, worst-bit margin loss) sharpens the worst 2047/2048-bit reductions -- the average eps is already ~1e-5, so the gain is in the worst-case bits not the mean -- lifting tier 10 to 0.98 (2047-bit 27/27, 2048-bit 71/73; held-out value-uniform validation chain ~0.98). A SECOND hardening tail (warm-start the 0.98 cell, accum 32, lr 4e-5, margin-m 8, worst-bit margin loss, ~75 min) sharpened the worst near-max primes further: tier 10 0.98 -> 1.00 on the public set, and -- measured on the FAITHFUL 5-prime eval structure (the real eval draws only 5 primes per tier via a vectorized bootstrap over a value-uniform prime pool, so a single weak prime is ~20% of the tier) -- it substantially reduced the private-draw risk P(tier10 < 0.90) (worst near-max prime 0.792 -> 0.833, E[tier10] 0.971 -> 0.980), with tiers 1-9 byte-identical (no regression). (The specific 1.295%/0.297% figures first reported for this tail came from an earlier unmatched-baseline diagnostic; on the current matched 5-prime bootstrap the same hardened cell measures P(tier10<0.90)=0.191%, which the weight-soup below then cuts to 0.018%.) Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy collapses toward the floor as the trained weights are perturbed with per-tensor std-scaled Gaussian noise, and a fully re-initialised cell is already at the floor (<=0.02 on every tier). The precision-critical deep cells give way first (tier 9 0.99 -> 0.03, tier 10 1.00 -> 0.04 at sigma=0.25); the wider-margin shared 64-512 mid cell tolerates small noise before collapsing (tiers 5-8 = 0.95/0.85/0.90/0.75 at sigma=0.25, then <=0.40 by sigma=0.5 and <=0.02 by sigma=1.0). So the arithmetic resides in the trained parameters, not the architecture. The 16-bit (tiers 1-3) and 32-bit (tier 4) cells were ORIGINALLY width-4096/6144 MLPs (~50M/~114M params, 660MB combined); they are now the same carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range [2,16] / [17,32] plus value-uniform), which shrank the artifact from 0.77GB to ~0.13GB (and the later collapse of the four mid-width cells into one shared weight-set brought the total artifact to ~0.08GB), raised tier 4 from 0.99 to 1.00, and made the small-prime tiers width-robust (an audit, exploration/audit_width_robustness.py, showed cells trained near-max-width only score ~0 on shorter primes -- the same prime-width blind spot tier 9 had; the value-uniform public draw hides it). tiers 1-3 stay 1.00. Training scripts: exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2 / --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17 (16- and 32-bit carry-aware TCN, width-matched), exploration/train_unified.py --warm --init-from weights512.pt --widths 64,128,256,512 (the shared 64-512 mid cell, warm-started from the dedicated 512-bit cell and fine-tuned on the {64,128,256,512} width mix); --bits 1024 --lo-bits 513 --bitlen-frac 0.4 --bitlen-lo 990 --accum 16 --margin-weight 0.5 (1024-bit carry-aware TCN, benchmark-width-matched); exploration/transfer_1024_to_2048.py then exploration/train_horner_tcn.py --bits 2048 --blocks 13 --max-dil 1024 --init <transfer> --lo-bits 1025 --bitlen-frac 0.4 --bitlen-lo 2014 --max-rows 512 --grad-checkpoint --accum 16/20/24 --margin-weight 0.5 (2048-bit, octave transfer + low-lr tail + hardening tail accum 24 lr 4e-5 + a second hardening tail accum 32 margin-m 8; then a final greedy WEIGHT-SOUP averaging the hardened cell with two more independent margin tails -- seeds 2 and 3, same accum-32/margin-m-8 recipe (exploration/soup_tier10.py) -- with the public-correct member pinned in so the public set stays 1.00. On a matched faithful 5-prime bootstrap (baseline = the second-hardening-tail cell, same seeded primes) the soup reduces the worst-prime variance: P(tier10<0.90) 0.191% -> 0.018% on the decisive hard-prime pool (~10x), worst near-max prime 0.833 -> 0.875, value-uniform robustness 0.977 -> 0.983, with tier 10 held at 1.00 public, tiers 1-9 byte-identical, and a fixed-seed end-to-end A/B confirming tier 10 soup 1.00 >= 0.99 on the same draw; all gated on the faithful bootstrap + public + a fixed-seed CLI A/B, not the single public draw. See exploration/TIER10_NOTES.md)."
7
- }
 
2
  "entry_class": "model.HornerRNN",
3
  "output_base": 2,
4
  "framework": "pytorch",
5
+ "model_description": "Bit-sequential RNN (~15.4M params across three weight files) for modular multiplication with primes up to 2^2048. The model reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary. Its hidden state is a hard-quantized bit vector, and the transition function is a learned carry-aware dilated-convolution TCN trained to implement the Horner step (t, bit, b, p) -> (2*t + bit*b) mod p. The final hidden state bits are emitted MSB-first as the base-2 answer. Routing uses the narrowest state width that can hold p. A SINGLE shared TCN weight-set, weights_shared_16_512.pt, serves 16/32/64/128/256/512-bit states (tiers 1-8) at each prime's native width and reaches tiers 1-8 = 1.00 on the public benchmark. Dedicated TCN cells weights1024.pt and weights2048.pt cover tier 9 and tier 10, reaching tier 9 = 0.99 and tier 10 = 1.00. For p >= 2^2048 the model emits the honest [0] fallback without invoking the network. The arithmetic is not in Python code: tokenization, scan, threshold, and readout are architecture, while doubling, conditional add, compare/borrow, and reduction are all learned in the trained cell weights; random or perturbed weights collapse to the floor.",
6
+ "training_description": "Each cell is trained from exact single-step labels (t, bit, b, p) -> (2*t + bit*b) mod p, with BCE per state bit, AdamW, cosine decay, gradient clipping, EMA checkpointing, and held-out-prime validation. Training data uses true Horner-trajectory states plus boundary-focused examples; prime sampling is value-uniform to match the challenge generator, with bit-length-uniform bands where needed so the reduction boundary is seen at every position. The current shared 16-512 file was built by warm-starting from the shipped shared 64-512 carry-aware TCN (14 residual TCN blocks, 256 channels, dilations cycling through 1..256), fine-tuning on a {16,32,64,128,256,512} width mix, then averaging that run with a small-tier polish tail: soup25 = 0.75 * unified_16to512_warm_s0.final + 0.25 * unified_16to512_smalltail_s1.final. The warm run used a 16-heavy but 512-preserving distribution (16:.40,32:.18,64:.08,128:.08,256:.08,512:.18), accum=8, lr=2e-4, off-trajectory batches (offtraj-frac=.20,k=4), and width-native validation. The polish tail warm-started from that candidate with lower lr=6e-5 and weights 16:.55,32:.25,64:.04,128:.04,256:.04,512:.08, then the soup was selected by public score plus matched 5-prime bootstrap. On the fixed public benchmark this merge lifts the shipped model from overall 0.997 to 0.999: tiers 1-8 = 1.00, tier 9 = 0.99, tier 10 = 1.00. A matched faithful bootstrap over tiers 1-8 (5 primes/tier structure, pool120, k30, boot200k, seed 515151) ties tiers 1-2 and improves tiers 3-8; tier 8 E[acc] improves 0.9866 -> 0.9931 and P(tier8<0.95) drops 1.396% -> 0.205%. The 1024-bit cell was trained separately with benchmark-width-matched primes in [2^513,2^1024), gradient accumulation, and worst-bit margin loss; it remains byte-unchanged and scores tier 9 = 0.99. The 2048-bit cell was bootstrapped from the 1024-bit cell by octave transfer, hardened with low-lr margin tails, then weight-souped across independent margin tails; it remains byte-unchanged and scores tier 10 = 1.00 while reducing faithful 5-prime tier-10 tail risk. Full all-width single-cell unification including 1024/2048 was tested and rejected because one ~5M cell could not preserve 2048-chain robustness while serving small/mid widths; the shipped design intentionally keeps dedicated 1024 and 2048 cells. Compliance checks: preprocess hooks are identity, the legal two-operand reductions a%p and b%p are used only for input normalization, perturbing trained weights collapses accuracy toward the untrained floor, and held-out-prime generalization tracks train accuracy."
7
+ }
model.py CHANGED
@@ -31,11 +31,10 @@ The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
31
  the same legal input normalisation every other reference model uses.
32
 
33
  Routing: each problem goes to the narrowest cell whose state holds the prime.
34
- The 16-bit cell covers tiers 1-3 and the 32-bit cell tier 4; a SINGLE
35
- shared carry-aware TCN weight-set then covers 64/128/256/512-bit primes (tiers 5-8),
36
- run at each prime's native width, and dedicated TCN cells cover 1024 (tier 9) and
37
- 2048 (tier 10). For primes wider than the widest trained cell it emits the honest
38
- ``[0]`` fallback without invoking the network.
39
  """
40
 
41
  from __future__ import annotations
@@ -234,9 +233,9 @@ class HornerRNN(ModularMultiplicationModel):
234
  md = Path(model_dir)
235
 
236
  # Shared multi-width cells: ONE weight-set serving several adjacent widths
237
- # (config-declared `widths`). The 64-512 carry-aware TCN ships this way β€” the
238
  # same trained weights run at each prime's native width (see TCNHornerCell.forward),
239
- # matching/beating the four separate per-width cells it replaces.
240
  for shared in sorted(md.glob("weights_shared_*.pt")):
241
  ckpt = torch.load(shared, map_location=self.device, weights_only=True)
242
  cell = _build_cell(ckpt.get("config", {}))
 
31
  the same legal input normalisation every other reference model uses.
32
 
33
  Routing: each problem goes to the narrowest cell whose state holds the prime.
34
+ A SINGLE shared carry-aware TCN weight-set covers 16/32/64/128/256/512-bit
35
+ primes (tiers 1-8), run at each prime's native width; dedicated TCN cells cover
36
+ 1024 (tier 9) and 2048 (tier 10). For primes wider than the widest trained cell
37
+ it emits the honest ``[0]`` fallback without invoking the network.
 
38
  """
39
 
40
  from __future__ import annotations
 
233
  md = Path(model_dir)
234
 
235
  # Shared multi-width cells: ONE weight-set serving several adjacent widths
236
+ # (config-declared `widths`). The 16-512 carry-aware TCN ships this way β€” the
237
  # same trained weights run at each prime's native width (see TCNHornerCell.forward),
238
+ # matching/beating the prior small/mid cells it replaces.
239
  for shared in sorted(md.glob("weights_shared_*.pt")):
240
  ckpt = torch.load(shared, map_location=self.device, weights_only=True)
241
  cell = _build_cell(ckpt.get("config", {}))
weights32.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:57f5bc2a37c812a608a574d63f1002b3ac4e6d94aa2d1da3e84355b0e050d19d
3
- size 12640471
 
 
 
 
weights16.pt β†’ weights_shared_16_512.pt RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:24ac1e5e1203db1e68422a90e5ec1ec040851b76e66fe8fe8044f8ba6f30622f
3
- size 9482395
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c47256b63e3addff47e2d8282c07c7365b25cba2268c1b9e142514d50240238
3
+ size 22111679
weights_shared_64_512.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:1a0102d236f5227cc17fc9bedc229ac4764774091a7bb6d96645baaa06ce23e9
3
- size 22113983