XllentAI
/

modular_arithmetic

@@ -16,9 +16,8 @@ A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p`
 multiplication tables. Entry for the
 [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
-- **Saturates tiers 1–4** (all primes `< 2³²`): tiers 1–3 = 100%, tier 4 = 99%
-- **Tier 5** (33–64-bit primes) = 0.74 on the public benchmark
-- **overall_accuracy 0.483**, `highest_tier_above_90 = 4`
 - Verifiably **generalises to primes never seen in training** (held-out-prime validation
   accuracy tracks training accuracy — no memorisation gap)
@@ -52,12 +51,15 @@ holds the prime:
 |---|---|---|---|---|---|---|
 | `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
 | `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
-| `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.74 |
 The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
 modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
 compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
-For `p ≥ 2⁶⁴` the model emits the honest `[0]` fallback without invoking the network.
 Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
 (challenge manifest), `train.py` (the 16-bit trainer).
@@ -106,18 +108,22 @@ cell is *at* the floor. The capability therefore resides in the trained paramete
 |---|---|---|---|---|---|---|
 | tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
 | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
-| tier 5 (64-bit cell) | 0.74 | 0.71 | 0.46 | 0.01 | 0.01 | 0.00 |
 Generalisation against memorisation: 10% of primes at each bit-width were held out of
 training entirely; chain accuracy on them matches the training primes.
 ## Training
-Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range,
-half of each batch mined near the comparison boundary where errors concentrate; BCE per
-state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
-primes. Training code and the full write-up live in the solutions repo (link in the model
-card metadata / challenge leaderboard).
 ## License

 multiplication tables. Entry for the
 [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
+- **Saturates tiers 1–5** (all primes `< 2⁶⁴`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%
+- **overall_accuracy 0.507**, `highest_tier_above_90 = 5`
 - Verifiably **generalises to primes never seen in training** (held-out-prime validation
   accuracy tracks training accuracy — no memorisation gap)
 |---|---|---|---|---|---|---|
 | `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
 | `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
+| `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.98 |
 The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
 modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
 compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
+The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
+examples on the **states the chain actually visits** (the true Horner trajectory) rather than
+uniformly sampled `t` — see *Training*. For `p ≥ 2⁶⁴` the model emits the honest `[0]`
+fallback without invoking the network.
 Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
 (challenge manifest), `train.py` (the 16-bit trainer).
 |---|---|---|---|---|---|---|
 | tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
 | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
+| tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
 Generalisation against memorisation: 10% of primes at each bit-width were held out of
 training entirely; chain accuracy on them matches the training primes.
 ## Training
+Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range; BCE
+per state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
+primes. The 64-bit cell adds a second fine-tuning phase whose single steps are drawn from the
+**true Horner trajectory** — `t` is an actual chain intermediate `(a_{≥i}·b) mod p`, not a
+uniform sample — matching the training distribution to the states the chain visits at
+inference. This lifts tier 5 from 0.74 to 0.98 with no capacity change and no backprop through
+the recurrence (ordinary supervised BCE on the same single-step target). Training code and the
+full write-up live in the solutions repo (link in the model card metadata / challenge
+leaderboard).
 ## License

manifest.json CHANGED Viewed

@@ -3,5 +3,5 @@
   "output_base": 2,
   "framework": "pytorch",
   "model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
-  "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch (more for the 64-bit fine-tune) mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit, --residual)."
 }

   "output_base": 2,
   "framework": "pytorch",
   "model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
+  "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory)."
 }