etwk commited on
Commit Β·
b670993
1
Parent(s): 6d991cb
Promote shared 16-512 weight soup
Browse files- README.md +47 -41
- manifest.json +3 -3
- model.py +6 -7
- weights32.pt +0 -3
- weights16.pt β weights_shared_16_512.pt +2 -2
- weights_shared_64_512.pt +0 -3
README.md
CHANGED
|
@@ -17,10 +17,9 @@ metrics:
|
|
| 17 |
# horner_rnn
|
| 18 |
|
| 19 |
A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
|
| 20 |
-
2^2048) on the public benchmark β tiers 1-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
five weight-sets, 0.08 GB), so its capability comes from *learning one algorithmic step* rather
|
| 24 |
than memorising finite multiplication tables, and it verifiably generalises to primes never seen
|
| 25 |
in training.
|
| 26 |
|
|
@@ -52,21 +51,20 @@ held-out-prime validation accuracy tracks training accuracy throughout (no memor
|
|
| 52 |
|
| 53 |
The recurrence is exact only if the state is wide enough to hold the residue, so each cell is
|
| 54 |
trained per bit-width β but because the dilated convolution is weight-shared across bit-positions
|
| 55 |
-
and the carry/borrow rule is position-invariant, **one shared weight-set serves
|
| 56 |
-
widths 64/128/256/512** (run at each prime's native width). The model therefore ships
|
| 57 |
-
weight
|
| 58 |
|
| 59 |
| Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
|
| 60 |
|---|---|---|---|---|---|
|
| 61 |
-
| `
|
| 62 |
-
| `weights32.pt` | `< 2^32` | 4 | carry-aware TCN, 8 blocks, dil 1..16 | ~3.2M | tier 4 = 1.00 |
|
| 63 |
-
| `weights_shared_64_512.pt` | `< 2^512` | 5-8 | carry-aware TCN, 14 blocks, dil 1..256 β **one shared set**, run at native width | ~5.5M | tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 |
|
| 64 |
| `weights1024.pt` | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
|
| 65 |
| `weights2048.pt` | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
|
| 66 |
|
| 67 |
-
The four separate mid-width cells
|
| 68 |
-
|
| 69 |
-
|
|
|
|
| 70 |
fallback without invoking the network.
|
| 71 |
|
| 72 |
## The carry-aware TCN (every tier)
|
|
@@ -85,8 +83,9 @@ The two small cells were originally width-4096/6144 MLPs (660 MB combined); repl
|
|
| 85 |
the carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range),
|
| 86 |
shrank the artifact from 0.77 GB to ~0.13 GB (the later mid-cell collapse then brought the total
|
| 87 |
to **0.08 GB**), raised tier 4 from 0.99 to **1.00**, and made
|
| 88 |
-
the small-prime tiers width-robust
|
| 89 |
-
spot (see the audit note below), which
|
|
|
|
| 90 |
|
| 91 |
The per-step error floor *rises* with bit-width, so the 512- and 1024-bit cells additionally
|
| 92 |
train with **gradient accumulation** (a larger effective batch lowers the gradient-noise floor
|
|
@@ -141,13 +140,23 @@ The training scripts live in the companion research repo (not shipped in this mo
|
|
| 141 |
commands below document *how the weights were obtained* (the provenance the rules ask for):
|
| 142 |
|
| 143 |
```bash
|
| 144 |
-
# small-prime cells
|
| 145 |
-
python exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2
|
| 146 |
-
python exploration/train_horner_tcn.py --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17
|
| 147 |
-
|
| 148 |
-
# ONE shared
|
| 149 |
-
#
|
| 150 |
-
python exploration/train_unified.py --warm --init-from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
```
|
| 152 |
|
| 153 |
The **1024-bit (tier-9) cell is a multi-stage curriculum**, not a single run β the carry
|
|
@@ -217,15 +226,13 @@ OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_p
|
|
| 217 |
|
| 218 |
| Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
|
| 219 |
|---|---|---|---|
|
| 220 |
-
| **1100** | **0.
|
| 221 |
|
| 222 |
-
Per-tier at total=1100:
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s budget (the
|
| 228 |
-
2048-step tier-10 scan is the bulk); artifact 0.08 GB.
|
| 229 |
|
| 230 |
## Status under the rules
|
| 231 |
|
|
@@ -261,23 +268,22 @@ E[tier10] 0.971 β 0.980), then a **greedy weight-soup** of three independent m
|
|
| 261 |
worst-prime tail risk a further ~10Γ (`P(tier10 < 0.90)` 0.191% β 0.018% on the decisive hard
|
| 262 |
pool, worst prime 0.833 β 0.875) with public tier 10 held at **1.00** β all gated on a matched
|
| 263 |
faithful 5-prime bootstrap plus a fixed-seed end-to-end A/B, no regression on any other tier.
|
| 264 |
-
Earlier this round the
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
saturated; remaining gains are sub-percent.
|
| 272 |
|
| 273 |
**Width-robustness audit** (`exploration/audit_width_robustness.py`, re-run for the shared-cell
|
| 274 |
model): because the benchmark draws primes value-uniform per tier (which concentrates at the top
|
| 275 |
of each tier's bit-range), a cell trained near-max-width only can score ~0 on shorter primes yet
|
| 276 |
still look perfect on the public set β exactly the gap that capped tier 9 before it was
|
| 277 |
width-matched. Every shipped cell is now trained width-matched (value-uniform **plus** a
|
| 278 |
-
bit-length-uniform band): the shared
|
| 279 |
-
|
| 280 |
-
secret-style draws found **P(tier < 0.90) β 0.000%** β the shared
|
| 281 |
no width knee, and tiers 9/10 are blind only in the *deep* value-uniform tail (knees ~970-bit /
|
| 282 |
~1950-bit), which carries β2β»β΅β΄ / 2β»βΉβΈ of the draw mass and is effectively unsamplable. No
|
| 283 |
primary-metric risk; remaining gains are sub-percent.
|
|
|
|
| 17 |
# horner_rnn
|
| 18 |
|
| 19 |
A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
|
| 20 |
+
2^2048) on the public benchmark β tiers 1-8 = 100%, tier 9 = 99%, **tier 10 = 100%** β
|
| 21 |
+
so `highest_tier_above_90 = 10` (the maximum), overall_accuracy **0.999**. Every cell is
|
| 22 |
+
the same **carry-aware TCN** (~15.4M params total across three weight files, 0.06 GB), so its capability comes from *learning one algorithmic step* rather
|
|
|
|
| 23 |
than memorising finite multiplication tables, and it verifiably generalises to primes never seen
|
| 24 |
in training.
|
| 25 |
|
|
|
|
| 51 |
|
| 52 |
The recurrence is exact only if the state is wide enough to hold the residue, so each cell is
|
| 53 |
trained per bit-width β but because the dilated convolution is weight-shared across bit-positions
|
| 54 |
+
and the carry/borrow rule is position-invariant, **one shared weight-set serves all small/mid
|
| 55 |
+
widths 16/32/64/128/256/512** (run at each prime's native width). The model therefore ships
|
| 56 |
+
**three weight files** and routes each problem to the narrowest cell whose state holds its prime:
|
| 57 |
|
| 58 |
| Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
|
| 59 |
|---|---|---|---|---|---|
|
| 60 |
+
| `weights_shared_16_512.pt` | `< 2^512` | 1-8 | carry-aware TCN, 14 blocks, dil 1..256 β **one shared set**, run at native width | ~5.5M | tiers 1-8 = 1.00 |
|
|
|
|
|
|
|
| 61 |
| `weights1024.pt` | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
|
| 62 |
| `weights2048.pt` | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
|
| 63 |
|
| 64 |
+
The earlier four separate mid-width cells had already collapsed into one shared 64β512 set;
|
| 65 |
+
this version further merges the 16- and 32-bit small-prime cells into that same shared block-pool.
|
| 66 |
+
The final shared 16β512 set reaches tiers 1β8 = 1.00 and cuts the total to **~15.4M params,
|
| 67 |
+
0.06 GB**. For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]`
|
| 68 |
fallback without invoking the network.
|
| 69 |
|
| 70 |
## The carry-aware TCN (every tier)
|
|
|
|
| 83 |
the carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range),
|
| 84 |
shrank the artifact from 0.77 GB to ~0.13 GB (the later mid-cell collapse then brought the total
|
| 85 |
to **0.08 GB**), raised tier 4 from 0.99 to **1.00**, and made
|
| 86 |
+
the small-prime tiers width-robust before the final 16β512 merge cut the artifact to **0.06 GB**.
|
| 87 |
+
A TCN trained near-max-width only has a short-prime blind spot (see the audit note below), which
|
| 88 |
+
the width-matched training removes.
|
| 89 |
|
| 90 |
The per-step error floor *rises* with bit-width, so the 512- and 1024-bit cells additionally
|
| 91 |
train with **gradient accumulation** (a larger effective batch lowers the gradient-noise floor
|
|
|
|
| 140 |
commands below document *how the weights were obtained* (the provenance the rules ask for):
|
| 141 |
|
| 142 |
```bash
|
| 143 |
+
# Historical small-prime cells were first trained width-matched, then absorbed into the shared cell.
|
| 144 |
+
python exploration/train_horner_tcn.py --bits 16 --lo-bits 2 --bitlen-frac 0.65 --bitlen-lo 2
|
| 145 |
+
python exploration/train_horner_tcn.py --bits 32 --lo-bits 17 --bitlen-frac 0.6 --bitlen-lo 17
|
| 146 |
+
|
| 147 |
+
# ONE shared 16-512 set for tiers 1-8: warm-start from the shared 64-512 cell, fine-tune on
|
| 148 |
+
# a {16,32,64,128,256,512}-bit mix, then soup the main run with a small-tier polish tail.
|
| 149 |
+
python exploration/train_unified.py --warm --init-from horner_rnn/weights_shared_64_512.pt \
|
| 150 |
+
--widths 16,32,64,128,256,512 \
|
| 151 |
+
--width-weights 16:0.40,32:0.18,64:0.08,128:0.08,256:0.08,512:0.18 \
|
| 152 |
+
--accum 8 --lr 2e-4 --offtraj-frac 0.20 --offtraj-k 4 --seed 0 \
|
| 153 |
+
--out checkpoints/unified_16to512_warm_s0.pt
|
| 154 |
+
python exploration/train_unified.py --warm --init-from checkpoints/unified_16to512_warm_s0.pt.final \
|
| 155 |
+
--widths 16,32,64,128,256,512 \
|
| 156 |
+
--width-weights 16:0.55,32:0.25,64:0.04,128:0.04,256:0.04,512:0.08 \
|
| 157 |
+
--accum 8 --lr 6e-5 --offtraj-frac 0.10 --offtraj-k 4 --seed 1 \
|
| 158 |
+
--out checkpoints/unified_16to512_smalltail_s1.pt
|
| 159 |
+
# Ship soup25 = 0.75 * warm_s0.final + 0.25 * smalltail_s1.final -> weights_shared_16_512.pt
|
| 160 |
```
|
| 161 |
|
| 162 |
The **1024-bit (tier-9) cell is a multi-stage curriculum**, not a single run β the carry
|
|
|
|
| 226 |
|
| 227 |
| Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
|
| 228 |
|---|---|---|---|
|
| 229 |
+
| **1100** | **0.999** | **10** (max) | True |
|
| 230 |
|
| 231 |
+
Per-tier at total=1100: tiers 1β8 **1.00**, tier 9 **0.99**, tier 10 **1.00**
|
| 232 |
+
(overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication, primes near each
|
| 233 |
+
width's maximum β a separate regime, not in overall_accuracy) is **0.64** on this fixed public
|
| 234 |
+
seed. Inference for all 1100 problems is 171s, within the 300s budget (the 2048-step tier-10 scan
|
| 235 |
+
is the bulk); artifact 0.06 GB.
|
|
|
|
|
|
|
| 236 |
|
| 237 |
## Status under the rules
|
| 238 |
|
|
|
|
| 268 |
worst-prime tail risk a further ~10Γ (`P(tier10 < 0.90)` 0.191% β 0.018% on the decisive hard
|
| 269 |
pool, worst prime 0.833 β 0.875) with public tier 10 held at **1.00** β all gated on a matched
|
| 270 |
faithful 5-prime bootstrap plus a fixed-seed end-to-end A/B, no regression on any other tier.
|
| 271 |
+
Earlier this round the thin small/mid tiers were re-polished with the width-matched,
|
| 272 |
+
worst-bit-margin recipe and then collapsed into the shared 16β512 soup β **tier 8 0.92 β 1.00**
|
| 273 |
+
public, with matched faithful bootstrap E[acc] 0.9866 β 0.9931 and `P(tier8 < 0.95)` 1.396% β
|
| 274 |
+
0.205%. Tier 10 independently improved **0.94 β 0.98 β 1.00**. `overall_accuracy` is now **0.999**
|
| 275 |
+
with tiers 1β8 all at 1.00 and the lowest scored tier tier 9 = 0.99. Tier 0 (pure multiplication,
|
| 276 |
+
primes near each width's maximum) is excluded from `overall_accuracy`, so it moves neither ranking
|
| 277 |
+
key. Both ranking keys are saturated; remaining gains are sub-percent.
|
|
|
|
| 278 |
|
| 279 |
**Width-robustness audit** (`exploration/audit_width_robustness.py`, re-run for the shared-cell
|
| 280 |
model): because the benchmark draws primes value-uniform per tier (which concentrates at the top
|
| 281 |
of each tier's bit-range), a cell trained near-max-width only can score ~0 on shorter primes yet
|
| 282 |
still look perfect on the public set β exactly the gap that capped tier 9 before it was
|
| 283 |
width-matched. Every shipped cell is now trained width-matched (value-uniform **plus** a
|
| 284 |
+
bit-length-uniform band): the shared 16β512 cell on the full {16,32,64,128,256,512} mix,
|
| 285 |
+
and the 1024/2048 cells across their own ranges. Re-auditing the shared-cell model on 40k
|
| 286 |
+
secret-style draws found **P(tier < 0.90) β 0.000%** β the shared 16β512 cell (tiers 1β8) shows
|
| 287 |
no width knee, and tiers 9/10 are blind only in the *deep* value-uniform tail (knees ~970-bit /
|
| 288 |
~1950-bit), which carries β2β»β΅β΄ / 2β»βΉβΈ of the draw mass and is effectively unsamplable. No
|
| 289 |
primary-metric risk; remaining gains are sub-percent.
|
manifest.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"entry_class": "model.HornerRNN",
|
| 3 |
"output_base": 2,
|
| 4 |
"framework": "pytorch",
|
| 5 |
-
"model_description": "Bit-sequential RNN (~
|
| 6 |
-
"training_description": "Each
|
| 7 |
-
}
|
|
|
|
| 2 |
"entry_class": "model.HornerRNN",
|
| 3 |
"output_base": 2,
|
| 4 |
"framework": "pytorch",
|
| 5 |
+
"model_description": "Bit-sequential RNN (~15.4M params across three weight files) for modular multiplication with primes up to 2^2048. The model reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary. Its hidden state is a hard-quantized bit vector, and the transition function is a learned carry-aware dilated-convolution TCN trained to implement the Horner step (t, bit, b, p) -> (2*t + bit*b) mod p. The final hidden state bits are emitted MSB-first as the base-2 answer. Routing uses the narrowest state width that can hold p. A SINGLE shared TCN weight-set, weights_shared_16_512.pt, serves 16/32/64/128/256/512-bit states (tiers 1-8) at each prime's native width and reaches tiers 1-8 = 1.00 on the public benchmark. Dedicated TCN cells weights1024.pt and weights2048.pt cover tier 9 and tier 10, reaching tier 9 = 0.99 and tier 10 = 1.00. For p >= 2^2048 the model emits the honest [0] fallback without invoking the network. The arithmetic is not in Python code: tokenization, scan, threshold, and readout are architecture, while doubling, conditional add, compare/borrow, and reduction are all learned in the trained cell weights; random or perturbed weights collapse to the floor.",
|
| 6 |
+
"training_description": "Each cell is trained from exact single-step labels (t, bit, b, p) -> (2*t + bit*b) mod p, with BCE per state bit, AdamW, cosine decay, gradient clipping, EMA checkpointing, and held-out-prime validation. Training data uses true Horner-trajectory states plus boundary-focused examples; prime sampling is value-uniform to match the challenge generator, with bit-length-uniform bands where needed so the reduction boundary is seen at every position. The current shared 16-512 file was built by warm-starting from the shipped shared 64-512 carry-aware TCN (14 residual TCN blocks, 256 channels, dilations cycling through 1..256), fine-tuning on a {16,32,64,128,256,512} width mix, then averaging that run with a small-tier polish tail: soup25 = 0.75 * unified_16to512_warm_s0.final + 0.25 * unified_16to512_smalltail_s1.final. The warm run used a 16-heavy but 512-preserving distribution (16:.40,32:.18,64:.08,128:.08,256:.08,512:.18), accum=8, lr=2e-4, off-trajectory batches (offtraj-frac=.20,k=4), and width-native validation. The polish tail warm-started from that candidate with lower lr=6e-5 and weights 16:.55,32:.25,64:.04,128:.04,256:.04,512:.08, then the soup was selected by public score plus matched 5-prime bootstrap. On the fixed public benchmark this merge lifts the shipped model from overall 0.997 to 0.999: tiers 1-8 = 1.00, tier 9 = 0.99, tier 10 = 1.00. A matched faithful bootstrap over tiers 1-8 (5 primes/tier structure, pool120, k30, boot200k, seed 515151) ties tiers 1-2 and improves tiers 3-8; tier 8 E[acc] improves 0.9866 -> 0.9931 and P(tier8<0.95) drops 1.396% -> 0.205%. The 1024-bit cell was trained separately with benchmark-width-matched primes in [2^513,2^1024), gradient accumulation, and worst-bit margin loss; it remains byte-unchanged and scores tier 9 = 0.99. The 2048-bit cell was bootstrapped from the 1024-bit cell by octave transfer, hardened with low-lr margin tails, then weight-souped across independent margin tails; it remains byte-unchanged and scores tier 10 = 1.00 while reducing faithful 5-prime tier-10 tail risk. Full all-width single-cell unification including 1024/2048 was tested and rejected because one ~5M cell could not preserve 2048-chain robustness while serving small/mid widths; the shipped design intentionally keeps dedicated 1024 and 2048 cells. Compliance checks: preprocess hooks are identity, the legal two-operand reductions a%p and b%p are used only for input normalization, perturbing trained weights collapses accuracy toward the untrained floor, and held-out-prime generalization tracks train accuracy."
|
| 7 |
+
}
|
model.py
CHANGED
|
@@ -31,11 +31,10 @@ The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
|
|
| 31 |
the same legal input normalisation every other reference model uses.
|
| 32 |
|
| 33 |
Routing: each problem goes to the narrowest cell whose state holds the prime.
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
``[0]`` fallback without invoking the network.
|
| 39 |
"""
|
| 40 |
|
| 41 |
from __future__ import annotations
|
|
@@ -234,9 +233,9 @@ class HornerRNN(ModularMultiplicationModel):
|
|
| 234 |
md = Path(model_dir)
|
| 235 |
|
| 236 |
# Shared multi-width cells: ONE weight-set serving several adjacent widths
|
| 237 |
-
# (config-declared `widths`). The
|
| 238 |
# same trained weights run at each prime's native width (see TCNHornerCell.forward),
|
| 239 |
-
# matching/beating the
|
| 240 |
for shared in sorted(md.glob("weights_shared_*.pt")):
|
| 241 |
ckpt = torch.load(shared, map_location=self.device, weights_only=True)
|
| 242 |
cell = _build_cell(ckpt.get("config", {}))
|
|
|
|
| 31 |
the same legal input normalisation every other reference model uses.
|
| 32 |
|
| 33 |
Routing: each problem goes to the narrowest cell whose state holds the prime.
|
| 34 |
+
A SINGLE shared carry-aware TCN weight-set covers 16/32/64/128/256/512-bit
|
| 35 |
+
primes (tiers 1-8), run at each prime's native width; dedicated TCN cells cover
|
| 36 |
+
1024 (tier 9) and 2048 (tier 10). For primes wider than the widest trained cell
|
| 37 |
+
it emits the honest ``[0]`` fallback without invoking the network.
|
|
|
|
| 38 |
"""
|
| 39 |
|
| 40 |
from __future__ import annotations
|
|
|
|
| 233 |
md = Path(model_dir)
|
| 234 |
|
| 235 |
# Shared multi-width cells: ONE weight-set serving several adjacent widths
|
| 236 |
+
# (config-declared `widths`). The 16-512 carry-aware TCN ships this way β the
|
| 237 |
# same trained weights run at each prime's native width (see TCNHornerCell.forward),
|
| 238 |
+
# matching/beating the prior small/mid cells it replaces.
|
| 239 |
for shared in sorted(md.glob("weights_shared_*.pt")):
|
| 240 |
ckpt = torch.load(shared, map_location=self.device, weights_only=True)
|
| 241 |
cell = _build_cell(ckpt.get("config", {}))
|
weights32.pt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:57f5bc2a37c812a608a574d63f1002b3ac4e6d94aa2d1da3e84355b0e050d19d
|
| 3 |
-
size 12640471
|
|
|
|
|
|
|
|
|
|
|
|
weights16.pt β weights_shared_16_512.pt
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6c47256b63e3addff47e2d8282c07c7365b25cba2268c1b9e142514d50240238
|
| 3 |
+
size 22111679
|
weights_shared_64_512.pt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:1a0102d236f5227cc17fc9bedc229ac4764774091a7bb6d96645baaa06ce23e9
|
| 3 |
-
size 22113983
|
|
|
|
|
|
|
|
|
|
|
|