etwk commited on
Commit
6b82250
·
1 Parent(s): faa14cf

Update card + manifest: tier 5 0.74 -> 0.98 (trajectory training), highest_tier_above_90 5

Browse files

weights64.pt is the trajectory-trained 0.98 cell (md5 07851d3d..., git-LFS object). model.py/train.py unchanged.

Files changed (2) hide show
  1. README.md +17 -11
  2. manifest.json +1 -1
README.md CHANGED
@@ -16,9 +16,8 @@ A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p`
16
  multiplication tables. Entry for the
17
  [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
18
 
19
- - **Saturates tiers 1–4** (all primes `< 2³²`): tiers 1–3 = 100%, tier 4 = 99%
20
- - **Tier 5** (33–64-bit primes) = 0.74 on the public benchmark
21
- - **overall_accuracy 0.483**, `highest_tier_above_90 = 4`
22
  - Verifiably **generalises to primes never seen in training** (held-out-prime validation
23
  accuracy tracks training accuracy — no memorisation gap)
24
 
@@ -52,12 +51,15 @@ holds the prime:
52
  |---|---|---|---|---|---|---|
53
  | `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
54
  | `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
55
- | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.74 |
56
 
57
  The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
58
  modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
59
  compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
60
- For `p 2⁶⁴` the model emits the honest `[0]` fallback without invoking the network.
 
 
 
61
 
62
  Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
63
  (challenge manifest), `train.py` (the 16-bit trainer).
@@ -106,18 +108,22 @@ cell is *at* the floor. The capability therefore resides in the trained paramete
106
  |---|---|---|---|---|---|---|
107
  | tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
108
  | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
109
- | tier 5 (64-bit cell) | 0.74 | 0.71 | 0.46 | 0.01 | 0.01 | 0.00 |
110
 
111
  Generalisation against memorisation: 10% of primes at each bit-width were held out of
112
  training entirely; chain accuracy on them matches the training primes.
113
 
114
  ## Training
115
 
116
- Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range,
117
- half of each batch mined near the comparison boundary where errors concentrate; BCE per
118
- state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
119
- primes. Training code and the full write-up live in the solutions repo (link in the model
120
- card metadata / challenge leaderboard).
 
 
 
 
121
 
122
  ## License
123
 
 
16
  multiplication tables. Entry for the
17
  [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
18
 
19
+ - **Saturates tiers 1–5** (all primes `< 2⁶⁴`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%
20
+ - **overall_accuracy 0.507**, `highest_tier_above_90 = 5`
 
21
  - Verifiably **generalises to primes never seen in training** (held-out-prime validation
22
  accuracy tracks training accuracy — no memorisation gap)
23
 
 
51
  |---|---|---|---|---|---|---|
52
  | `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
53
  | `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
54
+ | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.98 |
55
 
56
  The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
57
  modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
58
  compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
59
+ The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
60
+ examples on the **states the chain actually visits** (the true Horner trajectory) rather than
61
+ uniformly sampled `t` — see *Training*. For `p ≥ 2⁶⁴` the model emits the honest `[0]`
62
+ fallback without invoking the network.
63
 
64
  Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
65
  (challenge manifest), `train.py` (the 16-bit trainer).
 
108
  |---|---|---|---|---|---|---|
109
  | tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
110
  | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
111
+ | tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
112
 
113
  Generalisation against memorisation: 10% of primes at each bit-width were held out of
114
  training entirely; chain accuracy on them matches the training primes.
115
 
116
  ## Training
117
 
118
+ Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range; BCE
119
+ per state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
120
+ primes. The 64-bit cell adds a second fine-tuning phase whose single steps are drawn from the
121
+ **true Horner trajectory** `t` is an actual chain intermediate `(a_{≥i}·b) mod p`, not a
122
+ uniform sample matching the training distribution to the states the chain visits at
123
+ inference. This lifts tier 5 from 0.74 to 0.98 with no capacity change and no backprop through
124
+ the recurrence (ordinary supervised BCE on the same single-step target). Training code and the
125
+ full write-up live in the solutions repo (link in the model card metadata / challenge
126
+ leaderboard).
127
 
128
  ## License
129
 
manifest.json CHANGED
@@ -3,5 +3,5 @@
3
  "output_base": 2,
4
  "framework": "pytorch",
5
  "model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
6
- "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch (more for the 64-bit fine-tune) mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit, --residual)."
7
  }
 
3
  "output_base": 2,
4
  "framework": "pytorch",
5
  "model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
6
+ "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory)."
7
  }