etwk commited on
Commit Β·
c2635be
1
Parent(s): 759f10f
tier-10 second hardening tail: 0.98 -> 1.00 public, P(tier10<0.90) 1.295% -> 0.297%
Browse filesweights2048.pt: warm-started from the shipped 0.98 cell, accum 32 + margin-m 8
worst-bit-margin tail (~75 min). Sharpens the worst near-max primes.
Gated on the FAITHFUL 5-prime eval structure (real eval draws 5 primes/tier,
so one weak prime is ~20% of the tier; vectorized bootstrap over a value-uniform
pool). On the matched seed-991 hard pool: E[tier10] 0.971 -> 0.980, worst-prime
0.792 -> 0.833, P(tier10<0.90) 1.295% -> 0.297% (~4x risk cut). Public benchmark:
tier 10 0.98 -> 1.00, overall 0.995 -> 0.997, htop 10, deterministic, tiers 1-9
byte-identical (no regression). Independent random-seed draw: 0.991/htop 10/det.
- README.md +25 -16
- manifest.json +2 -2
- weights2048.pt +2 -2
README.md
CHANGED
|
@@ -17,9 +17,9 @@ metrics:
|
|
| 17 |
# horner_rnn
|
| 18 |
|
| 19 |
A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
|
| 20 |
-
2^2048) on the public benchmark β tiers 1-
|
| 21 |
-
tier
|
| 22 |
-
overall_accuracy **0.
|
| 23 |
so its capability comes from *learning one algorithmic step* rather than memorising finite
|
| 24 |
multiplication tables, and it verifiably generalises to primes never seen in training.
|
| 25 |
|
|
@@ -62,7 +62,7 @@ whose state holds its prime:
|
|
| 62 |
| 256-bit | `< 2^256` | 7 | carry-aware TCN, 12 blocks, dil 1..128 | ~4.7M | tier 7 = 0.98 |
|
| 63 |
| 512-bit | `< 2^512` | 8 | carry-aware TCN, 14 blocks, dil 1..256 | ~5.5M | tier 8 = 0.98 |
|
| 64 |
| 1024-bit | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
|
| 65 |
-
| 2048-bit | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 =
|
| 66 |
|
| 67 |
For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]` fallback without
|
| 68 |
invoking the network.
|
|
@@ -189,7 +189,12 @@ settles the per-step error below 5e-5 so the 2048-step chain clears tier 10. A f
|
|
| 189 |
**hardening tail** (warm-start the shipped cell, accum 24, lr 4e-5, worst-bit margin loss) then
|
| 190 |
sharpens the precision tail on the hardest 2047/2048-bit reductions β the cell's *average* eps
|
| 191 |
is already ~1e-5, so the gain is in the worst-case bits, not the mean β lifting **tier 10
|
| 192 |
-
0.94 β 0.98** (2047-bit 27/27, 2048-bit 71/73).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
`exploration/TIER10_NOTES.md`. Two new flags make 2048-bit tractable:
|
| 194 |
`--max-rows` (subsample the trajectory micro-batch; grad-checkpointing 13 blocks at 2048-bit
|
| 195 |
OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_prime` is
|
|
@@ -199,11 +204,11 @@ OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_p
|
|
| 199 |
|
| 200 |
| Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
|
| 201 |
|---|---|---|---|
|
| 202 |
-
| **1100** | **0.
|
| 203 |
|
| 204 |
Per-tier at total=1100: tier 1 **1.00**, tier 2 **1.00**, tier 3 **1.00**, tier 4 **1.00**,
|
| 205 |
-
tier 5 **
|
| 206 |
-
tier 10 **
|
| 207 |
primes near each width's maximum β a separate regime, not in overall_accuracy) is **0.63**, up
|
| 208 |
from 0.53 because its largest primes in `[2^1024, 2^2048)` now route to the 2048 cell instead
|
| 209 |
of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s budget (the
|
|
@@ -231,17 +236,21 @@ of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s
|
|
| 231 |
|
| 232 |
## What remains
|
| 233 |
|
| 234 |
-
Every reduction tier, **1 through 10, is now β₯ 0.
|
| 235 |
ceiling of the benchmark. `highest_tier_above_90` is the *maximum* tier β₯ 0.90 (not a contiguous
|
| 236 |
-
run from tier 1), so it depends only on tier 10 holding β₯ 0.90 on the private draw
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
short-width gap, robustness sim over [257,512] incl. short widths = 0.985) and **tier 10
|
| 241 |
-
0.94 β 0.98**. `overall_accuracy` is now **0.
|
| 242 |
-
tier 6 = 0.
|
| 243 |
but is excluded from `overall_accuracy`, so it moves neither ranking key. Both ranking keys are
|
| 244 |
-
|
| 245 |
|
| 246 |
**Width-robustness audit** (`exploration/audit_width_robustness.py`): because the benchmark
|
| 247 |
draws primes value-uniform per tier (which concentrates at the top of each tier's bit-range), a
|
|
|
|
| 17 |
# horner_rnn
|
| 18 |
|
| 19 |
A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
|
| 20 |
+
2^2048) on the public benchmark β tiers 1-5 = 100%, tier 6 = 98%, tiers 7-8 = 100%,
|
| 21 |
+
tier 9 = 99%, **tier 10 = 100%** β so `highest_tier_above_90 = 10` (the maximum),
|
| 22 |
+
overall_accuracy **0.997**. Every cell is the same **carry-aware TCN** (~30M params total, 0.13 GB),
|
| 23 |
so its capability comes from *learning one algorithmic step* rather than memorising finite
|
| 24 |
multiplication tables, and it verifiably generalises to primes never seen in training.
|
| 25 |
|
|
|
|
| 62 |
| 256-bit | `< 2^256` | 7 | carry-aware TCN, 12 blocks, dil 1..128 | ~4.7M | tier 7 = 0.98 |
|
| 63 |
| 512-bit | `< 2^512` | 8 | carry-aware TCN, 14 blocks, dil 1..256 | ~5.5M | tier 8 = 0.98 |
|
| 64 |
| 1024-bit | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
|
| 65 |
+
| 2048-bit | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
|
| 66 |
|
| 67 |
For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]` fallback without
|
| 68 |
invoking the network.
|
|
|
|
| 189 |
**hardening tail** (warm-start the shipped cell, accum 24, lr 4e-5, worst-bit margin loss) then
|
| 190 |
sharpens the precision tail on the hardest 2047/2048-bit reductions β the cell's *average* eps
|
| 191 |
is already ~1e-5, so the gain is in the worst-case bits, not the mean β lifting **tier 10
|
| 192 |
+
0.94 β 0.98** (2047-bit 27/27, 2048-bit 71/73). A **second hardening tail** (accum 32, margin-m 8,
|
| 193 |
+
warm-start the 0.98 cell) sharpened the worst near-max primes further: **tier 10 0.98 β 1.00**
|
| 194 |
+
public, and β measured on the *faithful* 5-prime eval structure (5 primes/tier, so one weak
|
| 195 |
+
prime is ~20% of the tier) β it cut the secret-draw risk `P(tier10 < 0.90)` from **1.295% to
|
| 196 |
+
0.297%** (worst near-max prime 0.792 β 0.833, E[tier10] 0.971 β 0.980) with no regression on any
|
| 197 |
+
other tier. Full recipe and findings:
|
| 198 |
`exploration/TIER10_NOTES.md`. Two new flags make 2048-bit tractable:
|
| 199 |
`--max-rows` (subsample the trajectory micro-batch; grad-checkpointing 13 blocks at 2048-bit
|
| 200 |
OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_prime` is
|
|
|
|
| 204 |
|
| 205 |
| Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
|
| 206 |
|---|---|---|---|
|
| 207 |
+
| **1100** | **0.997** | **10** (max) | True |
|
| 208 |
|
| 209 |
Per-tier at total=1100: tier 1 **1.00**, tier 2 **1.00**, tier 3 **1.00**, tier 4 **1.00**,
|
| 210 |
+
tier 5 **1.00**, tier 6 **0.98**, tier 7 **1.00**, tier 8 **1.00**, tier 9 **0.99**,
|
| 211 |
+
tier 10 **1.00** (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication,
|
| 212 |
primes near each width's maximum β a separate regime, not in overall_accuracy) is **0.63**, up
|
| 213 |
from 0.53 because its largest primes in `[2^1024, 2^2048)` now route to the 2048 cell instead
|
| 214 |
of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s budget (the
|
|
|
|
| 236 |
|
| 237 |
## What remains
|
| 238 |
|
| 239 |
+
Every reduction tier, **1 through 10, is now β₯ 0.98**, so `highest_tier_above_90 = 10` is at the
|
| 240 |
ceiling of the benchmark. `highest_tier_above_90` is the *maximum* tier β₯ 0.90 (not a contiguous
|
| 241 |
+
run from tier 1), so it depends only on tier 10 holding β₯ 0.90 on the private draw. Because the
|
| 242 |
+
real eval draws only **5 primes per tier** (so a single weak prime is ~20% of the tier), tier 10's
|
| 243 |
+
private-draw risk was measured on that faithful structure rather than the single public draw: a
|
| 244 |
+
**second tier-10 hardening tail** (accum 32, margin-m 8) cut `P(tier10 < 0.90)` from **1.295% to
|
| 245 |
+
0.297%** (worst near-max prime 0.792 β 0.833, E[tier10] 0.971 β 0.980) and took **tier 10
|
| 246 |
+
0.98 β 1.00** on the public set, with no regression on any other tier. Earlier this round the two
|
| 247 |
+
thinnest tiers (8 and 10) were re-polished with the width-matched, worst-bit-margin recipe β
|
| 248 |
+
**tier 8 0.92 β 0.98** (it had been trained on 510β512-bit primes only; the re-polish closes the
|
| 249 |
short-width gap, robustness sim over [257,512] incl. short widths = 0.985) and **tier 10
|
| 250 |
+
0.94 β 0.98 β 1.00**. `overall_accuracy` is now **0.997** with every scored tier β₯ 0.98; the lowest is
|
| 251 |
+
tier 6 = 0.98. Tier 0 (pure multiplication, primes near each width's maximum) sits at **0.63**
|
| 252 |
but is excluded from `overall_accuracy`, so it moves neither ranking key. Both ranking keys are
|
| 253 |
+
saturated; remaining gains are sub-percent.
|
| 254 |
|
| 255 |
**Width-robustness audit** (`exploration/audit_width_robustness.py`): because the benchmark
|
| 256 |
draws primes value-uniform per tier (which concentrates at the top of each tier's bit-range), a
|
manifest.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"entry_class": "model.HornerRNN",
|
| 3 |
"output_base": 2,
|
| 4 |
"framework": "pytorch",
|
| 5 |
-
"model_description": "Bit-sequential RNN (~21M params across five distinct carry-aware TCN weight-sets -- one weight-set is SHARED across the four mid widths) for primes up to 2^2048. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Cells are routed by prime size, every one a CARRY-AWARE TCN, and a SINGLE shared weight-set serves the four mid widths 64/128/256/512: a 16-bit cell (6 residual blocks, 256 channels, dilations cycling 1..8, ~2.4M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (8 residual blocks, 256 channels, dilations cycling 1..16, ~3.2M params) for p < 2^32 covering tier 4 (reaching tier 4 = 1.00), a SINGLE shared carry-aware TCN weight-set (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params) covering p < 2^512 across tiers 5-8 (64/128/256/512-bit primes), run at each prime's NATIVE width -- ONE weight-set, not four: because the dilated conv is weight-shared across bit-positions and the carry/borrow rule is position-invariant, the same parameters compute the Horner step at every mid width, reaching tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 on the public benchmark (matching or beating the four separate per-width cells it replaced, which scored 0.99/0.97/0.98/0.98, while collapsing four cells of ~17M params into one of ~5.5M), and a 1024-bit cell for p < 2^1024 covering tier 9 that is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..512, ~4.7M params) reaching tier 9 = 0.99, and a 2048-bit cell for p < 2^2048 covering tier 10 that is the same carry-aware TCN scaled to 2048 bit-positions (13 residual blocks, 256 channels, dilations cycling 1..1024, ~5.1M params) reaching tier 10 =
|
| 6 |
-
"training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training β val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The four mid widths (64/128/256/512, tiers 5-8) are served by a SINGLE shared carry-aware TCN weight-set (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params), not four separate cells. It was warm-started from the dedicated 512-bit cell -- the carry circuit is width-portable, since the dilated conv is weight-shared across positions and the carry/borrow rule is position-invariant -- and fine-tuned on a uniform mix of {64,128,256,512}-bit primes (each width drawn value-uniform plus a bit-length-uniform band so every reduction-boundary position is covered), with the same single-step BCE objective on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA, checkpointed by held-out full-chain accuracy and selected by public-benchmark score. One weight-set reaches tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 -- matching or beating the four separate per-width cells it replaced (0.99/0.97/0.98/0.98) while collapsing ~17M params of four cells into ~5.5M. Two lessons from those per-width cells carry into the shared one: the 64-bit cell had replaced a 944MB MLP that was blind near 2^64 (the carry-aware conv generalises to top-of-range reductions where the unstructured MLP did not), and the 512-bit cell's width-matched re-polish (bit-length-uniform band over [480,512] + worst-bit margin) had lifted tier 8 from 0.92 to 0.98 by giving equal weight to every reduction-boundary position. The 1024-bit (tier-9) cell is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, dilations cycling 1..512), and exposes a finding specific to wide primes: the test generator draws p value-uniform in [2^513, 2^1024), so a large fraction of tier-9 primes are SHORTER than 1024 bits, and the conditional-subtraction reduction boundary lands at p's most-significant set bit -- at a DIFFERENT position for each prime width. A cell trained only on near-2^1024 primes learns that boundary at one position and scores ~0.00 on shorter primes (this gave tier 9 = 0.73, dominated by the single ~1020-bit benchmark prime failing entirely, 0/22). Training instead on a mix of value-uniform primes (benchmark-faithful) and bit-length-uniform primes over [990,1024] (equal weight to every boundary position) lets the weight-shared convolution learn the reduction at every MSB position; combined with gradient accumulation (--accum 16) and a worst-bit margin loss for the precision tail, this drives the 1024-step chain to tier 9 = 0.99, robust across prime widths (held-out value-uniform validation chain 0.99, per-width 1015-1024 all ~0.99). The 2048-bit (tier-10) cell was bootstrapped by OCTAVE TRANSFER rather than from random init: the conv weights are width-invariant in shape and the carry rule is position-invariant, so the trained 1024-bit cell's weights copy verbatim into a 2048-position cell, plus one identity-initialised dil=1024 residual block to extend the receptive field across all 2048 positions (exploration/transfer_1024_to_2048.py; no-train single-step eps 0.74 on true 2048-bit primes -- the carry rule transfers partially, far better than a cold start). It is then polished on the benchmark-matched width distribution (value-uniform [2^1025, 2^2048) + bit-length-uniform[2014,2048]) in two stages: a first pass (lr 2e-4, accum 16) relearns the high-bit reduction fast (eps 0.74 -> ~9e-4) but oscillates at high lr, then a low-lr tail (lr 6e-5, accum 20, margin loss) settles the per-step error below 5e-5 so the 2048-step chain clears tier 10 = 0.94, and a final hardening tail (warm-start, accum 24, lr 4e-5, worst-bit margin loss) sharpens the worst 2047/2048-bit reductions -- the average eps is already ~1e-5, so the gain is in the worst-case bits not the mean -- lifting tier 10 to 0.98 (2047-bit 27/27, 2048-bit 71/73; held-out value-uniform validation chain ~0.98). Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy collapses toward the floor when the trained weights are perturbed with per-tensor std-scaled Gaussian noise, and an untrained re-init scores 0.00 -- the dedicated tier-9 and tier-10 cells go 0.99 -> 0.04 and 0.98 -> 0.04 at sigma=0.25, and the shared 64-512 mid cell collapses the same way (tier 5 1.00 -> 0.03, tier 6 0.99 -> 0.03 at sigma=0.25; an untrained re-init of it scores 0.00 on both). So the arithmetic resides in the trained parameters, not the architecture. The 16-bit (tiers 1-3) and 32-bit (tier 4) cells were ORIGINALLY width-4096/6144 MLPs (~50M/~114M params, 660MB combined); they are now the same carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range [2,16] / [17,32] plus value-uniform), which shrank the artifact from 0.77GB to 0.13GB, raised tier 4 from 0.99 to 1.00, and made the small-prime tiers width-robust (an audit, exploration/audit_width_robustness.py, showed cells trained near-max-width only score ~0 on shorter primes -- the same prime-width blind spot tier 9 had; the value-uniform public draw hides it). tiers 1-3 stay 1.00. Training scripts: exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2 / --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17 (16- and 32-bit carry-aware TCN, width-matched), exploration/train_unified.py --warm --init-from weights512.pt --widths 64,128,256,512 (the shared 64-512 mid cell, warm-started from the dedicated 512-bit cell and fine-tuned on the {64,128,256,512} width mix); --bits 1024 --lo-bits 513 --bitlen-frac 0.4 --bitlen-lo 990 --accum 16 --margin-weight 0.5 (1024-bit carry-aware TCN, benchmark-width-matched); exploration/transfer_1024_to_2048.py then exploration/train_horner_tcn.py --bits 2048 --blocks 13 --max-dil 1024 --init <transfer> --lo-bits 1025 --bitlen-frac 0.4 --bitlen-lo 2014 --max-rows 512 --grad-checkpoint --accum 16/20/24 --margin-weight 0.5 (2048-bit, octave transfer + low-lr tail + hardening tail accum 24 lr 4e-5; see exploration/TIER10_NOTES.md)."
|
| 7 |
}
|
|
|
|
| 2 |
"entry_class": "model.HornerRNN",
|
| 3 |
"output_base": 2,
|
| 4 |
"framework": "pytorch",
|
| 5 |
+
"model_description": "Bit-sequential RNN (~21M params across five distinct carry-aware TCN weight-sets -- one weight-set is SHARED across the four mid widths) for primes up to 2^2048. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Cells are routed by prime size, every one a CARRY-AWARE TCN, and a SINGLE shared weight-set serves the four mid widths 64/128/256/512: a 16-bit cell (6 residual blocks, 256 channels, dilations cycling 1..8, ~2.4M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (8 residual blocks, 256 channels, dilations cycling 1..16, ~3.2M params) for p < 2^32 covering tier 4 (reaching tier 4 = 1.00), a SINGLE shared carry-aware TCN weight-set (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params) covering p < 2^512 across tiers 5-8 (64/128/256/512-bit primes), run at each prime's NATIVE width -- ONE weight-set, not four: because the dilated conv is weight-shared across bit-positions and the carry/borrow rule is position-invariant, the same parameters compute the Horner step at every mid width, reaching tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 on the public benchmark (matching or beating the four separate per-width cells it replaced, which scored 0.99/0.97/0.98/0.98, while collapsing four cells of ~17M params into one of ~5.5M), and a 1024-bit cell for p < 2^1024 covering tier 9 that is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..512, ~4.7M params) reaching tier 9 = 0.99, and a 2048-bit cell for p < 2^2048 covering tier 10 that is the same carry-aware TCN scaled to 2048 bit-positions (13 residual blocks, 256 channels, dilations cycling 1..1024, ~5.1M params) reaching tier 10 = 1.00. The per-step error floor rises with bit-width, so the 512-, 1024- and 2048-bit cells were trained with gradient accumulation (a large effective batch lowers the per-step error noise floor) to recover the precision a 512-/1024-/2048-step chain needs to clear 0.90. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128/256/512-bit chains (which compound the per-step error over 128/256/512 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^2048 emits the honest [0] fallback without invoking the network.",
|
| 6 |
+
"training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training β val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The four mid widths (64/128/256/512, tiers 5-8) are served by a SINGLE shared carry-aware TCN weight-set (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params), not four separate cells. It was warm-started from the dedicated 512-bit cell -- the carry circuit is width-portable, since the dilated conv is weight-shared across positions and the carry/borrow rule is position-invariant -- and fine-tuned on a uniform mix of {64,128,256,512}-bit primes (each width drawn value-uniform plus a bit-length-uniform band so every reduction-boundary position is covered), with the same single-step BCE objective on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA, checkpointed by held-out full-chain accuracy and selected by public-benchmark score. One weight-set reaches tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 -- matching or beating the four separate per-width cells it replaced (0.99/0.97/0.98/0.98) while collapsing ~17M params of four cells into ~5.5M. Two lessons from those per-width cells carry into the shared one: the 64-bit cell had replaced a 944MB MLP that was blind near 2^64 (the carry-aware conv generalises to top-of-range reductions where the unstructured MLP did not), and the 512-bit cell's width-matched re-polish (bit-length-uniform band over [480,512] + worst-bit margin) had lifted tier 8 from 0.92 to 0.98 by giving equal weight to every reduction-boundary position. The 1024-bit (tier-9) cell is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, dilations cycling 1..512), and exposes a finding specific to wide primes: the test generator draws p value-uniform in [2^513, 2^1024), so a large fraction of tier-9 primes are SHORTER than 1024 bits, and the conditional-subtraction reduction boundary lands at p's most-significant set bit -- at a DIFFERENT position for each prime width. A cell trained only on near-2^1024 primes learns that boundary at one position and scores ~0.00 on shorter primes (this gave tier 9 = 0.73, dominated by the single ~1020-bit benchmark prime failing entirely, 0/22). Training instead on a mix of value-uniform primes (benchmark-faithful) and bit-length-uniform primes over [990,1024] (equal weight to every boundary position) lets the weight-shared convolution learn the reduction at every MSB position; combined with gradient accumulation (--accum 16) and a worst-bit margin loss for the precision tail, this drives the 1024-step chain to tier 9 = 0.99, robust across prime widths (held-out value-uniform validation chain 0.99, per-width 1015-1024 all ~0.99). The 2048-bit (tier-10) cell was bootstrapped by OCTAVE TRANSFER rather than from random init: the conv weights are width-invariant in shape and the carry rule is position-invariant, so the trained 1024-bit cell's weights copy verbatim into a 2048-position cell, plus one identity-initialised dil=1024 residual block to extend the receptive field across all 2048 positions (exploration/transfer_1024_to_2048.py; no-train single-step eps 0.74 on true 2048-bit primes -- the carry rule transfers partially, far better than a cold start). It is then polished on the benchmark-matched width distribution (value-uniform [2^1025, 2^2048) + bit-length-uniform[2014,2048]) in two stages: a first pass (lr 2e-4, accum 16) relearns the high-bit reduction fast (eps 0.74 -> ~9e-4) but oscillates at high lr, then a low-lr tail (lr 6e-5, accum 20, margin loss) settles the per-step error below 5e-5 so the 2048-step chain clears tier 10 = 0.94, and a final hardening tail (warm-start, accum 24, lr 4e-5, worst-bit margin loss) sharpens the worst 2047/2048-bit reductions -- the average eps is already ~1e-5, so the gain is in the worst-case bits not the mean -- lifting tier 10 to 0.98 (2047-bit 27/27, 2048-bit 71/73; held-out value-uniform validation chain ~0.98). A SECOND hardening tail (warm-start the 0.98 cell, accum 32, lr 4e-5, margin-m 8, worst-bit margin loss, ~75 min) sharpened the worst near-max primes further: tier 10 0.98 -> 1.00 on the public set, and -- measured on the FAITHFUL 5-prime eval structure (the real eval draws only 5 primes per tier via a vectorized bootstrap over a value-uniform prime pool, so a single weak prime is ~20% of the tier) -- it cut the private-draw risk P(tier10 < 0.90) from 1.295% to 0.297% (worst near-max prime 0.792 -> 0.833, E[tier10] 0.971 -> 0.980), with tiers 1-9 byte-identical (no regression). Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy collapses toward the floor when the trained weights are perturbed with per-tensor std-scaled Gaussian noise, and an untrained re-init scores 0.00 -- the dedicated tier-9 and tier-10 cells go 0.99 -> 0.04 and 0.98 -> 0.04 at sigma=0.25, and the shared 64-512 mid cell collapses the same way (tier 5 1.00 -> 0.03, tier 6 0.99 -> 0.03 at sigma=0.25; an untrained re-init of it scores 0.00 on both). So the arithmetic resides in the trained parameters, not the architecture. The 16-bit (tiers 1-3) and 32-bit (tier 4) cells were ORIGINALLY width-4096/6144 MLPs (~50M/~114M params, 660MB combined); they are now the same carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range [2,16] / [17,32] plus value-uniform), which shrank the artifact from 0.77GB to 0.13GB, raised tier 4 from 0.99 to 1.00, and made the small-prime tiers width-robust (an audit, exploration/audit_width_robustness.py, showed cells trained near-max-width only score ~0 on shorter primes -- the same prime-width blind spot tier 9 had; the value-uniform public draw hides it). tiers 1-3 stay 1.00. Training scripts: exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2 / --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17 (16- and 32-bit carry-aware TCN, width-matched), exploration/train_unified.py --warm --init-from weights512.pt --widths 64,128,256,512 (the shared 64-512 mid cell, warm-started from the dedicated 512-bit cell and fine-tuned on the {64,128,256,512} width mix); --bits 1024 --lo-bits 513 --bitlen-frac 0.4 --bitlen-lo 990 --accum 16 --margin-weight 0.5 (1024-bit carry-aware TCN, benchmark-width-matched); exploration/transfer_1024_to_2048.py then exploration/train_horner_tcn.py --bits 2048 --blocks 13 --max-dil 1024 --init <transfer> --lo-bits 1025 --bitlen-frac 0.4 --bitlen-lo 2014 --max-rows 512 --grad-checkpoint --accum 16/20/24 --margin-weight 0.5 (2048-bit, octave transfer + low-lr tail + hardening tail accum 24 lr 4e-5 + a second hardening tail accum 32 margin-m 8 gated on the faithful 5-prime bootstrap; see exploration/TIER10_NOTES.md)."
|
| 7 |
}
|
weights2048.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b6ba86e52b945a186fbf30f78d586b86665b87a23914aa021a4542ab576039af
|
| 3 |
+
size 20536061
|