etwk commited on
Commit ·
6b82250
1
Parent(s): faa14cf
Update card + manifest: tier 5 0.74 -> 0.98 (trajectory training), highest_tier_above_90 5
Browse filesweights64.pt is the trajectory-trained 0.98 cell (md5 07851d3d..., git-LFS object). model.py/train.py unchanged.
- README.md +17 -11
- manifest.json +1 -1
README.md
CHANGED
|
@@ -16,9 +16,8 @@ A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p`
|
|
| 16 |
multiplication tables. Entry for the
|
| 17 |
[Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
|
| 18 |
|
| 19 |
-
- **Saturates tiers 1–
|
| 20 |
-
- **
|
| 21 |
-
- **overall_accuracy 0.483**, `highest_tier_above_90 = 4`
|
| 22 |
- Verifiably **generalises to primes never seen in training** (held-out-prime validation
|
| 23 |
accuracy tracks training accuracy — no memorisation gap)
|
| 24 |
|
|
@@ -52,12 +51,15 @@ holds the prime:
|
|
| 52 |
|---|---|---|---|---|---|---|
|
| 53 |
| `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
|
| 54 |
| `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
|
| 55 |
-
| `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.
|
| 56 |
|
| 57 |
The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
|
| 58 |
modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
|
| 59 |
compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
|
| 63 |
(challenge manifest), `train.py` (the 16-bit trainer).
|
|
@@ -106,18 +108,22 @@ cell is *at* the floor. The capability therefore resides in the trained paramete
|
|
| 106 |
|---|---|---|---|---|---|---|
|
| 107 |
| tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
|
| 108 |
| tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
|
| 109 |
-
| tier 5 (64-bit cell) | 0.
|
| 110 |
|
| 111 |
Generalisation against memorisation: 10% of primes at each bit-width were held out of
|
| 112 |
training entirely; chain accuracy on them matches the training primes.
|
| 113 |
|
| 114 |
## Training
|
| 115 |
|
| 116 |
-
Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
## License
|
| 123 |
|
|
|
|
| 16 |
multiplication tables. Entry for the
|
| 17 |
[Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
|
| 18 |
|
| 19 |
+
- **Saturates tiers 1–5** (all primes `< 2⁶⁴`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%
|
| 20 |
+
- **overall_accuracy 0.507**, `highest_tier_above_90 = 5`
|
|
|
|
| 21 |
- Verifiably **generalises to primes never seen in training** (held-out-prime validation
|
| 22 |
accuracy tracks training accuracy — no memorisation gap)
|
| 23 |
|
|
|
|
| 51 |
|---|---|---|---|---|---|---|
|
| 52 |
| `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
|
| 53 |
| `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
|
| 54 |
+
| `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.98 |
|
| 55 |
|
| 56 |
The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
|
| 57 |
modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
|
| 58 |
compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
|
| 59 |
+
The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
|
| 60 |
+
examples on the **states the chain actually visits** (the true Horner trajectory) rather than
|
| 61 |
+
uniformly sampled `t` — see *Training*. For `p ≥ 2⁶⁴` the model emits the honest `[0]`
|
| 62 |
+
fallback without invoking the network.
|
| 63 |
|
| 64 |
Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
|
| 65 |
(challenge manifest), `train.py` (the 16-bit trainer).
|
|
|
|
| 108 |
|---|---|---|---|---|---|---|
|
| 109 |
| tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
|
| 110 |
| tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
|
| 111 |
+
| tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
|
| 112 |
|
| 113 |
Generalisation against memorisation: 10% of primes at each bit-width were held out of
|
| 114 |
training entirely; chain accuracy on them matches the training primes.
|
| 115 |
|
| 116 |
## Training
|
| 117 |
|
| 118 |
+
Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range; BCE
|
| 119 |
+
per state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
|
| 120 |
+
primes. The 64-bit cell adds a second fine-tuning phase whose single steps are drawn from the
|
| 121 |
+
**true Horner trajectory** — `t` is an actual chain intermediate `(a_{≥i}·b) mod p`, not a
|
| 122 |
+
uniform sample — matching the training distribution to the states the chain visits at
|
| 123 |
+
inference. This lifts tier 5 from 0.74 to 0.98 with no capacity change and no backprop through
|
| 124 |
+
the recurrence (ordinary supervised BCE on the same single-step target). Training code and the
|
| 125 |
+
full write-up live in the solutions repo (link in the model card metadata / challenge
|
| 126 |
+
leaderboard).
|
| 127 |
|
| 128 |
## License
|
| 129 |
|
manifest.json
CHANGED
|
@@ -3,5 +3,5 @@
|
|
| 3 |
"output_base": 2,
|
| 4 |
"framework": "pytorch",
|
| 5 |
"model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
|
| 6 |
-
"training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch
|
| 7 |
}
|
|
|
|
| 3 |
"output_base": 2,
|
| 4 |
"framework": "pytorch",
|
| 5 |
"model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
|
| 6 |
+
"training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory)."
|
| 7 |
}
|